Reducing transit requirements

Outgoing bandwidth is a fairly small cost, compared to our total operation. However we, the server administrators, sometimes forget that we aren't the only ones who have to pay for bandwidth: the clients have to pay for it too. In developed countries, this cost is quite reasonable, but in poorer countries, due to lower per-capita international capacity, the costs can be exorbitant, to the point of putting it out of reach of the bulk of the population. Below is an idea for a set of technical measures we can take to reduce international bandwidth requirements, and thus reduce the cost to ISPs of serving Wikipedia.

The ancient method for reducing transit bandwidth costs was for the ISP to use a caching HTTP proxy. However this is less effective than it once was, because of the growth of dynamic websites such as webmail services. Wikipedia is in an unusual position in that it has a large number of pages (several million) with a broad variety of update frequencies. To avoid having the user see an old version of the page, we pretend to be a dynamic service.

Within our own organisation, we use a hack to cache web pages in a proxy. We have the server strip off Cache-Control headers for a particular client list, thus allowing downstream caches to store the content. We then send a stream of HTCP CLR messages to each cache, thus notifying it of changes in a timely fashion. However, this scales poorly to the situation of hundreds of isolated ISP proxies in cities around the world. Firstly there is a need to reconfigure our own caches every time a foreign cache is added, secondly there is no standard way to tell the foreign cache what Cache-Control header it should send to downstream clients. So here's what I suggest:

  • Add a new HTTP response header, say X-CLR-Cache-Control, which overrides the Cache-Control header for clients which are subscribed to the CLR stream.
  • Create an API allowing clients to subscribe to the CLR stream. This interface would allow the negotiation of a "shared secret" for HTCP authentication, so HTTPS would probably be a sensible protocol. It would be free of charge but require regular resubscription.
  • Add a new REASON to the CLR OP-DATA (see RFC 2756 section 6.5), specifying that the item has been updated, and that pre-fetching the object during the daily off-peak period would be recommended.
  • Produce a patched Squid RPM containing these changes, and assist foreign ISP techs in setting it up.

The off-peak pre-fetch mechanism mentioned above needs a bit more explanation. Request patterns between an isolated city and a foreign server are strongly diurnal. I'm not sure if current pricing structures would permit a cost saving, but at the very least, making use of times when the connection is nearly idle by pre-fetching content would reduce response times as seen by the user. An extra REASON is required because some items which are stored in the cache are either very rarely requested, or expensive for the content server to serve, or both (such as our history pages).

There's a complication in that the foreign squid server somehow has to guess what additional headers to send in the request. It knows the URI but it doesn't know, say, what Accept-Encoding header is most likely to be useful. This information could be communicated by the CLR subscription interface, or perhaps discovered by examining client traffic.

These changes would benefit the ISP financially, and I would hope that by giving free access to even small ISPs, we will maximise the chances of competitive pressure leading to these savings being passed on to the end-user. Larger institutions who are able to negotiate with the ISP, such as schools, may be able to organise cheap access to Wikipedia and similar sites, while having the bulk of the Internet blocked.

Note that although I've phrased this in terms of helping those who are less well-off, there are similar gains for the manageability and bandwidth requirements of our own squids. There's no reason our own squids couldn't use the subscription API, and benefit from pre-fetching.

I don't have any immediate plans to implement this myself, so if you think this sounds like an interesting project, feel free to jump in. (Tim Starling)