Wikipedia is a large and popular site. In addition to many thousands of human visitors each day, we receive visits from various web spiders. Some of these spiders are fairly friendly, like googlebot; they're well-behaved, obey the restrictions in our robots exclusion file against looking at the fairly infinite array of dynamically generated pages, and their work helps us by making us available in search engines and bringing new people to the site.
Some spiders aren't so friendly. Maybe somebody thinks it'd be fun to make an offline copy of the wiki, so they point a poorly-behaved site copier at www.wikipedia.org and try to download every link as fast as possible. This can mean generating potentially many millions of pages: every version of every page. Diffs between each version. RC and history and backlinks and contributions with dozens of combinations of time and numerical limits.
When I catch one of these guys, for the sake of keeping Wikipedia online I ban the user agent of the tool (if they're not hiding it) and for good measure the IP address (often they come right back with another tool or the same one with the user agent string faked).
A few of these people are probably potentially useful contributors. It'd be good to handle these automatically and in a friendlier fashion with some request throttling.
Delay system Edit
Some tools such as mod_throttle will progressively delay incoming requests from some IP once it's passed a threshhold.
- Legit but too-fast tools will just be slowed down to a friendlier level
- If the client sends many requests simultaneously, this will tie up server threads. We'd want to place a hard limit on connections as well.
- mod_throttle 3.1.2 and earlier contain a design flaw making it vulnerable to attack and as of writing no fix has been released.
- The website now says, "Snert's Apache modules currently CLOSED to the public until further notice. Questions as to why or requests for archives are ignored."
503 Service Not Available Edit
We can send a 503 and cut off the connection, freeing it up for legit users.
- Frees up connections for traffic
- Bad spider gets an error message, which may discourage it from continuing at all
- 503 can include a suggested delay for retry
- 503 message could include an explanation of the problem and a link to the database download page. Some spiderers may be happy to get the data in an alternate format that doesn't strain our servers so much.
- May scare off legit users who somehow trip the limit
- How would this interact with search engines that manage to trip it?
Triggering the throttle Edit
Modules for Apache (broken links!):