404 handler caching

We have a good server-side file caching system already, but there are some possibilities for enhancing it.

Rationale edit

It's almost always true with Web servers in general and Apache in particular that serving static files from the file system is going to be much, much faster -- usually an order of magnitude faster -- than doing any dynamic processing. In other words, any PHP page is going to be significantly slower than a plain HTML page.

For this reason, we could use a directory in a MediaWiki installation as a static file cache.

Design edit

There's a Web directory where static pages go -- say, "/var/www/wiki".

URLs for articles point to that directory: the link for "Foo Bar" renders to "/wiki/Foo_Bar".

Initially, an HTTP hit for "/wiki/Foo_Bar" will fail -- no page there.

Apache has a directive -- ErrorDocument -- for mapping a script to handle missing files (among other errors). We map a script -- /w/404Handler.phtml or something -- to handle these errors for the "/wiki/" directory.

 <Directory /var/www/wiki/>
   ErrorDocument 404 /w/404Handler.phtml
 </Directory>

The 404 handler tries to retrieve the article from the database. If the article doesn't exist, it shows an edit form for the article, just like a broken link works now. If the article is Special:, it lets wiki.phtml do the work.

If the article _does_ exist, the handler renders the page *to a file in the cache directory*, e.g. "/wiki/Foo_Bar.html". It then opens the file and serves it to the current user, too. (There are probably some other ways to optimize this, but the best I can see is that they would require two hits to the server, which is wasteful. It's easier and cheaper just to read the file that was written.) (Of course, it returns an HTTP 200 response on success.)

For the next hit to come in for "/wiki/Foo_Bar", the Web server will find the file there (MultiViews will find the ".html" version -- a future 404 handler might also write out ".xml" or other document file formats), and thus serve it directly, without running any PHP code.

On saving an edit of a page, the software simply *deletes* the cached HTML file. This will trigger the 404 handler on the next hit for the page, which will regenerate the cached file. Similar cache invalidation would be necessary for moving or deleting a page.

If an edit would change the appearance of links to a page (say, making a new article (broken link to good link), or deleting an article (good link to broken link), or crossing the default stub boundary (if it exists)), saving a page would also delete the cached versions of any pages that link to the changed page. These, also, will be regenerated by the 404 handler on the next hit (with the new link appearance).
You don't need this, see #Using Cascading Style Sheets

A garbage-collection cron job runs every N minutes to keep the size of the cache to a reasonable level (# bytes, # files). It deletes the least recently used pages, by filesystem access time. (There's a possible race condition where a huge influx of activity could overflow the cache between garbage collection runs. If the installation is susceptible to this -- say, it has hard disk space limits -- garbage collection could happen in the 404 handler rather than in a cron job. This would obviously slow display of pages, though.)

Of course, logged-in users want to get their pages rendered _just_so_, with question marks and [edit] links, etc. They should also have all their "My page" and other links working. It may be possible to do most of this with JavaScript -- checking the UserId cookie, and showing and hiding parts based on that. Then the same .html file can be served to logged-in and not-logged-in users, with the client side doing the customization.

But probably the easiest way is just to run the dynamic pages every time for logged in users. We can use the Apache rewrite engine to serve different stuff, based on whether you're logged in or not. We use a RewriteCond line to check if the UserId cookie exists:

 RewriteCond %{HTTP_COOKIE} (^|&)UserId=([^&]+)(&|$)
 RewriteRule ^/wiki/(.*)$ /wiki.phtml?title=${ampescape:$1} [L]

This kinda depends on having a significant number of users not-logged-in. If 100% of users are logged in, we get no benefits, and a slight cost from Rewrite-checking the cookie on every hit.

A more aggressive version could try to cache some of the other features of MediaWiki, like user contributions, page histories, etc., with the same strategy.

Advantages edit

Reduce the amount of dynamic page servicing. This is pretty much the big one. Per-page server costs should drop by at least one order of magnitude and maybe two.
Apache handles all the tweaky optimizations like compression, content negotiation, client-side caching, etc. It simplifies the MediaWiki code.

Disadvantages edit

There'd probably be some problems with funky characters -- spaces, punctuation, etc. "/wiki/Foo Bar" and "/wiki/Foo_Bar" should probably (usually) map to the same file. Some work needed here.
You lose view counting. However, a periodic Web server log checker could provide the same functionality -- reading the Web log, updating the database with new view counts.
Directory entries. Having a few thousand files in a single directory could get kinda losey for some file systems. There may be some rewrite tricks that could allow writing multiple sub-directories, like we do with images in the /upload dir right now, but having them appear as being in the /wiki directory.
Redirect articles are tricky.
Anonymous users won't see new-message notification.
If there are multiple Web servers sharing a single database, then it wouldn't work. The delete-on-update strategy means we either have to share a cache directory, or have some kind of communication mechanism between Web servers. Either way, it's a little dicey. However, an efficient shared cache may actually still beat dynamic page handling by a long shot.
If multiple Web servers share a single database, then they can share the /wiki-cache directory, too (NFS). --Bodo Thiesen 23:54, 18 Jan 2005 (UTC)

Implementations edit

http://wikitravel.org/~evan/wormwiki-0.1.tar.gz -- a proof-of-concept Wiki engine that uses this technique for showing pages.
http://www.emacswiki.org/cgi-bin/oddmuse/404%20Handler%20Extension -- for OddMuse
[1] -- for PmWiki
http://wikitravel.org/~evan/Cache404.php.txt -- a beta version of a 404-handler caching extension for MediaWiki; in production at http://wikitravel.org/

Issues to consider edit

Wikimedia serves use very extensive Squid caching for anonymous content. This duplicates that. Is it necessary or beneficial to do so? Would other MediaWiki users be best served by advice to use Squid approach than by this?
A directory containing fully rendered pages is a possible aid to search.
How does it scale to 5,000,000 files in one directory? To just the 1,000,000 or so in the immediate future of Wikimedia projects? Does it require a specific file system to perform in something better than completely unacceptable O(number of pages) time? Needs to be O(1) or very close to it. The traditional resolution to this is to actually store files in a directory tree so that n for each level is small. CRC-32 as a hashing algorithm for page name is one possible approach, using each result byte from the CRC-32 value for one directory level. Using two bytes with one million pages and assuming uniformity would give 15 pages in the leaves. It takes more than 17 million pages for the number of files per leaf to reach 256 and at present that looks to be sufficient, so perhaps CRC-16 would do. Note that it's been found that CRC-32 produced a problematic number of collisions in million URL hashing when used by (insert forgotten name of largest router maker). How large was problematic wasn't described.

Using Cascading Style Sheets edit

Instead of simply creating a bad link like this:

<a href="/wiki/Example_Bad_Link" title="Example Bad Link">Example Bad Link</a>

Create the link like this:

<span class=broken-link><a href="/wiki/Example_Bad_Link" title="Example Bad Link" class=link-Example_Bad_Link>Example Bad Link</a></span>

Now add a

<link rel="stylesheet" type="text/css" href="wiki/Example_Bad_Link.css"/>

for each bad link to the head section. The page renderer now has to create the css file too (it's trivial) and only the page deleter code has to delete that file. The content of file wiki/Example_Bad_Link.css would then be:

.link-Example_Bad_Link { styles for good links }

Advantage edit

No depending page needs to be deleted, if a new page is created.

Disadvantage edit

The browsers will do a lot of accesses to download the .css-files (and in most cases fails, so it has to redo the job the next time).
A page which was deleted will still show up as existing. But in general, no pages which are linked to will ever be deleted, so this issue will most likely be a theoretical issue only, and if that really matters, the page deleter code may still delete all depenting pages as in the original proposal.