Hardware capacity growth planning

Note this page is out-of-date!!!

See cache strategy for the caching architecture we use. It's critical to site performance and understanding the roles of the systems described here.

The original document was written in late September of 2004. An update on October 11th of 2004:

In early October the traffic growth steadied and may be returning to the long-term trend.
The final proposed order for October 2004 is now available.

Our next order

For early October we anticipate ordering one "set" of computers, A set is a fairly balanced mixture of one Squid cache server, four or five Apache web servers and one database slave. We expect to add sets like this as needed, adjusting the exact balance as necessary.

From the vendor we have been using, we'd order this:

1x Squid:
- Single P4 CPU around 2.4-3.0GHz
- 4GB of RAM
- Two 10,000 RPM SATA drives (we want to test performance of these in cache miss situations). We would try more if the case we normally used supported more.
SM-1151SATA P4 3.2GHz, 4x1GBMB ECC DDR400 RAM, Seagate 200GB drive, CD-ROM, floppy, $2462 This time we'll actually get this with 1GB and use it as a web server, replacing an older machine being upgraded to 4GB and moved to Squid service, using the two drives from this one.

4x Web servers:
- Single P4 CPU around 2.6-3.0GHz (We'd be happy with dual or single CPU boxes, AMD or Intel)
- 1 or 2GB of RAM (1GB this time)
- 200GB or so SATA drive for log/utility storage.
SM-1151SATA P4 3.2GHz, 2x512MB ECC DDR400 RAM, Seagate 200GB drive, CD-ROM, floppy, $1361 each, $5444 total

1x Database slave (to be configured for search):
- Dual Opteron CPU, slower end of the range (we don't care about make, want to be sure we don't run out of CPU, though actual use is typically 20%)
- 4GB of RAM (8 would be helpful)
- 6 x 10,000 RPM SATA drives in RAID 0. We know that 7200RPM in RAID 1 does well, we want to see how this configuration does and we don't care about losing it all if a drive dies - it's just a slave, won't ever see data which isn't already safely stored elsewhere)
- Write caching RAID controller (very important for database speed)
SM-2280SATA, Dual Opteron 242 1.6GHz, 4x1GB ECC DDR400 RAM, 6x Western Digital Raptor 10,000 RPM 74GB SATA drives in RAID 0, CD-ROM, floppy, $5352

TOTAL: $13,258

If we were looking to buy from IBM, we'd be investigating the Cluster 1350 system.

Question - don't you think a UPS system would also be a wise investment?

Our growth rate

. As of December 2004 it has 39 servers.

The performance limited doubling time was about 20 weeks, the recent one about 8 weeks. Ignoring anticipated software improvements, the possibility that we would reach the limit of demand for what we offer and cost considerations, the 8 week doubling time would see us having over 1,000 computers by September 2005.

Background information

For longer term capacity growth we will need to expand in these general areas. For all areas, we want at least three systems. People are relying on our presence as the emerging encyclopedia of the Internet and we don't want to be unavailable.

Database

The database architecture has a general master/slave setup, with roles split across the group. For all but the search servers, the system must be capable of allocating most of the system RAM in a single chunk because MySQL's InnoDB engine makes a single allocation for its cache. We currently use Opteron systems with Fedora Core 2.

Master and standby masters. These systems need to be able to handle high write and significant read loads. Ideally, three of these with specifications in the range of dual Opterons, 8-16 GB of RAM and 6 or more 15K SCSI drives. The InnoDB storage engine in MySQL is very CPU-efficient and typically hits RAM/cachability and disk performance limits before it uses the full capacity of a dual CPU system. There are periods where CPU use is higher but it seems unlikely that more than four CPUs per system would be of benefit and the query servers are expected to reduce this CPU load at lower cost than additional CPUs in these systems. The standby systems participate in load sharing until a fault hits the master. More systems are better than higher capability systems, except when it comes to disk space - space is being consumed rapidly and that growth is likely to accelerate. See growth in storage size of current articles for the larger projects. If query or storage needs rise beyond the capability of a single system of moderate cost we expect to partition the data along its natural project/language boundaries, paying careful attention to load balancing across the different time zones which participate in the different language projects.
Storage servers. We are in the process of adjusting the database schema to allow us to place the text part of our article history on different servers. This has every version of every article for all projects, so we have comprehensive records of authorship. This will offload the seldom accessed text storage job from the systems with very fast and costly disk systems. The sort of systems which fit this job are RAID or network attached storage systems. As with everything, three or more are ideal, so we can stand at least one failure. Overall database size grew from about 90GB to about 140GB between the start of June and mid September, mostly due to these data. We're working on producing a growth curve for it.
Query servers. These do the bulk of the query processing and that proportion will rise as we adjust more code to use them. Because the MySQL database engine requires a different memory allocation for search from that used for most tables (MyISAM for search, InnoDB for the rest), it favors two sets of systems, one set optimised for search, the other for other queries. The specifications for each role are essentially the same: modest CPU capability (dual Opteron) and moderately fast disk systems (IDE RAID 10) with 4-16GB of RAM, more better than less, but more systems better than spending the same money on the most expensive RAM type, where the last RAM doubling can cost as much as another system. We've been pleased by the performance of a dual Opteron system with a 6 disk 7200RPM IDE RAID 10 setup and 4GB of RAM. The number of these will increase as necessary to handle the query load, eventually having one per 5-10 web servers. We currently use one for every 7, with some speed bottlenecks and changes in schema and queries planned which are expected to reduce them. At least three would be good, but we'll probably need many more as use of the site increases. Standby master systems serve as part of this pool.

Web servers/page builders

These read from the databases and produce the final web pages. Parsing the wiki text is CPU-intensive in PHP and the per-page rendering times are in the 300ms and up range. Caching is used to reduce the need to do this repetitively, so longer term the main load from this is to support those editing articles, who need repeated parsing to preview their work. Ideal for this role would be blade servers with RAM in the 512MB to 1-2GB range and small disks of any speed. It's a pure CPU power and rack space efficiency game. For redundancy, if blade servers are used, we'd want to start with at least three, possibly partially filled, so we won't hurt unduly if one fails. Alternatively, one rack unit single or multiple CPU boxes are well suited to this role. We're currently using single CPU P4s with RAM in the 512MB-2GB range. Disks are likely to be a long term reliability liability. We use Memcached for caching at present and it shares the RAM with the Apache web servers. We're in the early stages of testing a caching BDB database backed alternative, which may allow the use of disk rather than RAM, though perfection would be enough RAM to cache several versions of every article in this web server farm. Unlikely to be economic to do that as size grows but the more inexpensive RAM the caching system has, the better. We currently use 6GB of this cache and have 10GB of article text. Skins vary and aren't yet fully done via CSS, so space for several copies would be useful, though at steadily reducing efficiency as the size grows.

Media storage

We have recently purchased a 1TB RAID 5 box to handle our storage needs for images and eventually more multimedia. This is estimated to be sufficient for 6-12 months. We're likely to want several more. This has modest write loads. RAID 5 IDE arrays or network attached storage systems are ideal. As always, three or more are desired for redundancy.

Squid cache servers

We use these to serve about 80% of all hits. The software is single threaded, so single CPU systems are all that is necessary. CPU load and response time rises with the number of simultaneous connections to be serviced, so a fair number of moderately capable systems, each with relatively few connections to service, should do better than many big systems. RAM for caching is shared between these. We currently use four single CPU P4s with 4GB of RAM and slow IDE drives. Squid is reported to do well with a large number of very small drives, since it spreads its disk caching load across them all. We've no experience with this. For load surges from media attention, experience and benchmarks have shown that the Squids can easily serve the pages quickly from cache, provided the total number of simultaneous connections doesn't become excessive. For our normal load we tested switching more and more load to one Squid and found that performance would start to suffer significantly as the total number of cache misses it saw rose. So, our typical load handling capability is fairly well measured by CPU use, with some margin above 100% use where page serving times start to degrade at an increasing rate. We expect to locate Squid sets of three or more machines in various locations around the world, using donated hosting and bandwidth. We use Squid 2 and are planning to evaluate Squid 3, which has some architectural improvements (better polling) which may significantly raise the simultaneous connection capacity of each machine.

Load balancing

We currently do DNS-based load balancing to the Squids and ping-type response time measurement to determine which Apache should service the request. This is currently sensitive to both network topology and speed differences between individual systems, though it steadies down and balances more evenly as load rises. Software changes which factor in current CPU load on the web servers rather than solely their response time are on the to do list. Hardware based load balancing would be helpful.

The current implementation of Squid to Apache load balancing uses the Squid ICP (Inter Cache Protocol), which requires an actual Squid server running on each Apache. A project is currently in progress, and showing much promise, to replace this vestigial Squid with a customized client which will reply to the actual Squid servers with a more finely tuned indication of Apache server load.

Backup

We do not yet have an adequate backup system. We copy the databases to local disks in the same racks as the main systems. As offers and/or purchases allow we'll be using some form of off site database backup, likely via continuous replication to one or more sites. Local backup may best be handled by network attached storage or large RAID systems with cheap drives or a near line storage setup of some sort. Today we use duplicate copies on inexpensive IDE drives in the web servers and multiple database slaves.

Development platforms, human factors

We do not yet have an adequate development test system and interpersonal communications could be improved further:

At the moment we can't test the full architecture - squids to web servers to database servers - in a system fully isolated from production. Configuration changes are often done for the first time on the live system.
The developers have a limited ability to test their work on realistic systems local to them.
All developers and support staff would benefit from mobile systems, mobile connectivity and high speed internet connections to increase their availability and make it easier for them to fulfill their roles.
Investment in developer and support training materials is currently minimal.
The wiki, email and IRC provide excellent communications. Still, internet-based video conferencing items and IP telephony would be of use to further improve communication, both for the development and support team and for the Wikimedia Foundation board members.