Skip to main content
Welcome guest. | Register | Login | Post

An automated way to beat the digg effect

19 replies [Last post]
libervisco's picture
Offline
Joined: 2006-05-04

As much as I am proud to say Libervis.com *finally* got it's first digg frontpaging I am ashamed to say that the server didn't took all that traffic well. Both Nuxified.org and Libervis.com are on the same server and both sites crashed, apparently due to too many mysql queries.

The only reason these sites are up right now is because someone took the story off the digg homepage prematurely, which obviously lessened the amount of traffic that was coming in and the sql server recovered.

Well, we can't let this sort of thing happen again. Be it from digg or some other site, such awesome traffic peaks are generally a good thing for our growth and if we just permanently remain vulnerable to them instead of benefiting from them that just wont do.

So on to the topic.. one idea I have about beating the digg effect is to do the following:

  • Check whether traffic is coming from digg.com or not.
  • Check the amount of traffic coming from digg.com and if it is higher than (insert the highest amount we can reasonably bear) then do the following (basically an if statement).
  • Append .nyud.net:8080 to the host name of the URI which is receiving excess traffic (but not necessarily other url's) in which case the page is automatically mirrored and accessed in a Coral Content Distribution Network.

Actually the first step may not even be necessary. We can just check for traffic amounts, if that's possible, and redirect the suspected URI over CDN.

Also, once the traffic levels fall below the excess level the URI should be returned back to normal.

It may not be an ideal solution since the traffic going to coral isn't registered by our own server so we miss the actual traffic details, and the CDN isn't updating frequently enough to account for comments posted to a story, but if it would work it may be a surefire way to protect a site from crashing.

Alternative solution may be to automatically generate a static page when excess traffic is detected. Maybe I'm complicating things too much with the above and maybe that would be a better solution...

Anyway, would any of the above be possible in PHP, possibly aided by the .htaccess file? Also, if you've got other suggestions I'm all eyes and ears! Smiling

Thanks
Danijel

dylunio's picture
Offline
Joined: 2005-12-20
Yay on getting Dugg I was

Yay on getting Dugg Smiling

I was wondering what had happened to the network this evening, and this explaines things. I know nothing about high-volume website management so I can't comment any more than your proposal sounds sane.

dylunio

tbuitenh's picture
Offline
Joined: 2005-12-21
The easiest solution might

The easiest solution might be to put a customized digg button below articles, which always uses coral, and to put comments on a separate page, which somehow always uses libervis directly instead of coral

libervisco's picture
Offline
Joined: 2006-05-04
You mean the button which

You mean the button which submits the story to digg with the Coral extension added in the URL submitted? That seems quite easy, but not necessarily very effective since people may still just submit it to digg manually using the direct URL.

Also, putting comments on another page would be a bit awkward to read, IMHO.. I know some sites do it, but I like comments below the article.

I think just auto-generating a static page when necessary might be the best way. The static version doesn't have to contain any comments, but only a nice readable content and a link to the full version. So people can still access the full version, but since primary traffic comes to the static page (so no mysql queries) it will not come in such big amounts to the actual article fed from the db.

tbuitenh's picture
Offline
Joined: 2005-12-21
Manually changing .htaccess

Manually changing .htaccess seems like the way to go, then. By trying to detect the digg effect, you actually put extra load on the server, potentially making things worse if the digg-wave comes in very fast... But if it does come in fast, manual changing will be too late...

Making a digg button that uses a coraled static page with a link to the original (where we should somehow trick the that link not to be changed by coral) should help at least some of the time: when people use it. The way to make them use it might be labeling it "safe digg that won't kill our server" Laughing out loud

libervisco's picture
Offline
Joined: 2006-05-04
You're probably right.. I

You're probably right.. I just read about a drupal module someone made that automatically makes static version of site pages and redirects access to those instead of the database, refreshed every five minutes (minimum). The problem is that five minutes refresh may be too much to apply sitewide (including even for registered members). I'll have to test and see if it can be tuned to make static only some pages and only for anonymous, then it might be acceptable and even better option than Coral because at least the traffic will be monitored since it would come to our server.

Coral is really the option to be used as a last resort. The big disadvantage is that, it being a mirror, it is not traffic coming to our site at all.. and Coral cache refreshes way too slowly (like hours).

Anyway, I'll try some things.

libervisco's picture
Offline
Joined: 2006-05-04
Ok for now I will be

Ok for now I will be creating static pages of stories that could be dugg, with a link to the full dynamic version. In addition to that I will use 1 minute database caching (acceptable enough for expected too high load times) and throttle module available for drupal (disabling certain functions temporarily under high load).

Now, a way to make sure no-one accidentally diggs a full dynamic version of the page I could put a rule in .htaccess that would deny digg.com access to anything which has /article/ as parent in the URI and instead redirect those requests to URIs with /static/ instead of /article/ and an added .html extension.

Why? Here's an example. The normal article published on Libervis.com (same rules would apply to Nuxified) has the URL like this http://www.libervis.com/article/gnu_solaris_the_free_os_of_the_future. I have generated a static page for that article at http://www.libervis.com/static/gnu_solaris_the_free_os_of_the_future.html

I simply want to avoid any traffic from digg being directed to the former URL. Instead, that traffic would bypass it and end up at the latter URL which shows a static page. All of this should be possible by using .htaccess. I would take care of manually creating the static pages in the /static/ subdirectory.

Now I just have to find out the proper way to do it in .htaccess, but that's the concept. If it can be done (and something tells me it can), I think that's the best solution. It completely bypasses mysql so all the server has to endure is pure bandwidth and loading of the static page (which is just html and some images). We have almost unlimited bandwidth so it's not a problem.

If anyone has ideas of how to do it in .htaccess I welcome them. Also, I'd appreciate any opinions or suggestions from you gurus. Eye

tbuitenh's picture
Offline
Joined: 2005-12-21
A problem here might be

A problem here might be that the traffic doesn't come from digg, but is referred from it. I don't know if .htaccess is that flexible, but it's worth investigating.

The perfect solution is to put a proxy (such as squid) between libervis and the rest of the web, but I guess we can't do that while we don't have our own dedicated server yet. I remember when I tried orkut (it was very beta) I occasionally got squid error pages, so it seems a proxy can work for interactive sites, as long as the software serving the pages to the proxy also works fine...

libervisco's picture
Offline
Joined: 2006-05-04
I know blocking certain

I know blocking certain referrers is possible with .htaccess as I've been blocking some spam referrers that way, so if there is a way to basically issue an "if" statement to apache in .htaccess it should be possible to say "if referrer is digg.com then convert URL to this (that being the static page URL).

As for squid, you're right, it wont work for shared hosting. I've taken a look at some dedicated options from both my current host and some other companies and it's still a bit out of my current budget so.. we'll have to wait for that.

But this shared hosting we've got now isn't bad at all. Since we're being sponsored we are on a less busy high performance server (so sharing with less people) and we can receive up to 100 000 visits per month which pans out as more than 10 000 per day. The problem is not that the server can't take a lot of traffic. It's when a large surge comes in one single minute after the story gets published on digg.com. It's hard for mysql to take that anywhere, sometimes even on dedicated hosting as I hear, without special additional equipment (load balancer and clustering). We just have to optimize and use static pages when possible. That's the best solution. If we are successful with that then we can continue growing quite comfortably even on current hosting while beating out the surges like that.

So yes, we still have room for more traffic, we just need smarter traffic surges management. Smiling

libervisco's picture
Offline
Joined: 2006-05-04
Heh looks like Libervis.com

Heh looks like Libervis.com will have another stress testing soon (and Nuxified is on the same server). That GNU/Solaris story keeps getting diggs... Wish us luck! Smiling

tbuitenh's picture
Offline
Joined: 2005-12-21
boost looks like the

boost looks like the solution

libervisco's picture
Offline
Joined: 2006-05-04
Yeah, I'm doing almost the

Yeah, I'm doing almost the same thing manually for now though. That module is alpha code.

libervisco's picture
Offline
Joined: 2006-05-04
This is not exactly an

This is not exactly an automated solution, but I've got another idea, although it's nothing especially unique I suppose.

I could put the html of the entry on another server (which can better handle the digg effect) and call it from within Nuxified.org. This way I don't have to do a redirection, nor a static version and I can give a full original link to digg.com. When the traffic comes it will load up the page, but instead of getting the actual article from nuxified database it would get it from that other server.

One caveat, though, is that it would still have to call the php script from Nuxified database to read it and figure out that the data is somewhere else and call it from there, meaning that mysql would still be hit upon. However I suppose that the amount of data being drawn from mysql matters too, right? So if the script is basically just one line it would be much less stress on mysql to get this one line than to get the whole article?

Or maybe the sheer amount of queries would still bring it down? If that's the case, I guess alternative way is to put the script in without a database, but that's a bit trickier to do on Drupal I imagine..

libervisco's picture
Offline
Joined: 2006-05-04
I got it! Thanks to

I got it! Thanks to Drupal's flexibility and my VPS we can now redirect most of the digg load to the VPS server, for specific articles which could be dugg (I'll call them "diggable").

What's happening is this. Anonymous visitors are coming in droves from digg, hitting the URL of the page that was dugg. Drupal then extracts that URI and checks whether it is the URI of the currently diggable article. It also checks whether the user is anonymous. If both of these conditions are true it, instead of loading the page from our database (hence overloading our mysql), loads it from the snapshot stored on another server, a mirror basically, which is static and updated every hour (for comments).

If anyone registers or logs in, they will see the page normally as any other page, loaded from the database.

What's great about this solution is that it can significantly reduce the load and risk of getting crushed by digg effect without changing the URL or doing any redirections that can harm the page ranks and reputation of the dugg URL (so we lose *nothing* at all). In fact, the only little disadvantage is that some anonymous users will see comments up to 1 hour old, that's all. If they log in, they'll see all new comments (or as allowed by db caching, up to 5 minutes).

Here is the script I'm using, with some comments. I'll be saving this for use every time we have a diggable article.

<?php
/* Digg beats our shared hosting sites to hell.. well no more! If an article is likely to be dugg, just enter this script into page.tpl.php and change the following details:

* URI of the diggable article
* count the characters in the URI and enter that instead of "26" below
* The URL of the snapshot of the page elsewhere

Of course, once digg effect and all calms down, this can be commented out in which case all visitors will view the article completely normally.

The snapshot is on our other VPS server where we are running a daily cron wgetting the article page and saving it in the proper location with a proper file name. This way, if there are any comments, anonymous users will be updated every hour (since it's hourly cron).

This rocks!!! :D
*/

$uri_request_id = $_SERVER['REQUEST_URI']; // getting the URI
global $user;
$allowed_role = 'anonymous user';         // only for anonymous users!
if ( substr( $uri_request_id, 0, 26 ) == '/article/kazehakase_review' ) // The URI must match the diggable article
if(is_array($user->roles))
  if(
in_array($allowed_role, $user->roles)){ // now check if it matches the anonymous role as set above
$dugg_article = file_get_contents('http://www.libervis.net/misc/kazehakase_review.html'); print $dugg_article; // get the snapshot
return;
}
?>

Isn't this great? Laughing out loud

EDIT: Alright, I realized that I was (and possibly still am) a bit confused about fetching an html file from somewhere else being some sort of a saviour. Apache still needs to process it right here. So the only benefit of *that* is that I can more easily control my other VPS in terms of updating the snapshot, setting the cron and whatnot.. Also another benefit is that we save this server a bit of work by having another server do the snapshoting job.

Still, the above is very significant in one way; it goes around using the database completely (for anonymous). No matter from where exactly do we fetch a snapshot, database is not part of the game. So yeah it ends up like plain old caching, but without the need for redirections (we keep the URL unchanged all the time). Smiling

tbuitenh's picture
Offline
Joined: 2005-12-21
I don't get it. Why are you

I don't get it. Why are you trying to generate lots of traffic between libervis.net and nuxified.org? Wouldn't it be way more efficient to store the snapshot version on nuxified? I can imagine it would be more efficient to move the images to the other server, though.

libervisco's picture
Offline
Joined: 2006-05-04
Yeah that's what I figured

Yeah that's what I figured out later too.. I think I'll make a snapshot here, unless I can somehow make a full redirection without changing the URL in the addressbar when redirected (in which case it really would be beneficial).

And yeah images are on the other server either way.

libervisco's picture
Offline
Joined: 2006-05-04
I should have *never* put

I should have *never* put this thing on the Nuxified server. The article was down with a blank page for *hours* now because, guess what:

/bin/sh: line 1: /usr/bin/wget: Permission denied

That's right. My shared hosting doesn't let me use wget. But did I know that before setting the cron??? No.

If I just left it on the first solution (get it from the other server), we wouldn't have that damn downtime and the story could've even gotte to digg. Way to go Daniel, way to go!

THAT's what I'm talking about when I say it's easier to control this on another server.

Man it pisses me off to find out an article I've been counting on was down for the crucial hours of its existance!

tbuitenh's picture
Offline
Joined: 2005-12-21
And what did we learn

And what did we learn today? Always test before putting your code to use!

libervisco's picture
Offline
Joined: 2006-05-04
How to test without SSH

How to test without SSH access? I have to wait an hour to see the command in action..