Cluster report, September-November, 2005
These months were yet again amazing in Wikimedia growth history. Since September request rates doubled, lots of information added, modified and expanded, more users came. To deal with that site had to improve both software and hardware platforms again.
Of course, more hardware was thrown at the problem. In mid-September three new database servers (thistle,ixia,lomaria) were added to the pool, removing ancient type of hardware from the service. With data growth rates 'old' 4GB-RAM boxes could not keep up with operation, except quite limited one.
40 dual-opteron application servers have been deployed, conserving our limited colocation space, as well as providing lots of performance for a buck. One batch of them (20) was deployed just this week. They're equipped with larger drives and more memory, thus allowing to place various unplanned services on them (9 apache servers are storing old revisions as well), some servers participate in shared memory pool, running memcached.
One of really efficient purchases was 12k$ worth image server 'amane', providing us with storage space and even ability to to backup at current loads. It is running now highly efficient and lightweight HTTP server - lighttpd. So far images are served, but growth of Wikimedia Commons will force us to find a really scalable and reliable way to handle lots of media.
Additionally 10 more application servers are ordered together with a new Squid cache server batch. These 10 single-opteron boxes will have 4 small and fast disks and should enable efficient caching of content.
As all this gear was bought for donated money, we really appreciate community help here, thank you!
Yahoo supplied cluster in Seoul, Korea has finally got into action, bringing cached content closer to Asian locations, as well as having master databases and application cluster for Japanese, Thai, Korean and Malaysian Wikipedias.
For internal load balancing Perlbal was replaced by LVS, and we've got a nice flashy donated load balancing device that may be deployed into operation soon as well. LVS has to be handled with care and several tiny misconfiguration incidents seriously affected site performance. Lately the cluster has became quite big and complex and now we need more sophisticated and extensive sanity checks and test cases.
There are lots of work in establishing more failover capabilities - we will be having two active links to our main ISP in Florida. Static HTML dump is (becoming) nice and usable and may help us in case of serious crashes. It can be served from Amsterdam cluster as well!
As for last several days we managed to bring cluster into quite proper working shape, now it's important to fix everything and prepare for more load and more growth and yet another expansion. We hope that we will be able with the help of community to solve all our performance and stability issues and avoid being Lohipedia :)
Lots of various problems were solved so far in order to achieve what we have now, and lots of low hanging fruits have been picked. What is dealt now with is complex and needs manpower and fresh ideas as well. Discussions are always welcome on #wikimedia-tech in Freenode (except during serious downtimes :).
And, of course, Thanks Team (or rather, Family)! It is amazing to work together!