Q3 and Q4 2004 hardware order worksheet

Permanent growth is one of the characteristics of all Wikimedia projects. The English doubles its number of articles in about a year; some of the other languages are faster and double in about half a year. The number of words grows by a factor of 10 per year. Server traffic has been increasing at an average rate of 90% per quarter. To cope with this growth, the server farm has to grow with the needs of the community.

The Wikimedia budget has $31,000 (high figure) provisionally budgeted for the purchase of new servers in the third quarter (1 July¿30 September) of 2004 and $48,000 for the fourth quarter (1 October¿31 December) of 2004. If page requests do not grow at the expected 85% this quarter, then we will have to adjust the budget accordingly.

Below is a worksheet that is being used by Wikimedia developers to figure out what servers that will be purchased in these two quarters.

See Milestones

Current situation

Currently, we have

3 Squid servers: Caches that deliver pages to non-contributors (people who only read, without editing). These seem to have some capacity in reserve, especially since the recent addition of the third box. In early July one failed for about ten hours. While load was switched to the other two without trouble, jeronim reported that site performance was clearly significantly slower. So we need one more at least to handle a single failure without performance degradation. It looks as though one more every two to three months may be the sort of plan we'll need.
7 Apache servers: Web servers that render the web pages from the wikitext. These servers are currently more than 100% busy in peak hours.
2 MySQL servers: Database servers storing all the wiki pages and computing all the complicated things like watchlists. The second one has just been set up, so utilization of the pair is probably <50% at the moment. However, the software to use both isn't ready/not used yet, db error messages are very common. Searching and similar db-intensive features are disabled already, situation would be much worse with those things enabled. This is a bottleneck; software can improve this.
1 file server: Central file server for images and other media and for some central configuration files. It is currently using NFS, but NFS has serious problems with scaling and lacks any built-in failover. Most current slowdowns are caused by this box being busy. It is also currently used to compute access statistics and to hold the big database dumps.
1 broken file server: Currently at siliconmechanics for repair. Is supposed to share the load with the other file server. How this load sharing would be done is under discussion and might only be possible by moving to a different network file system.; See Wikimedia servers

Challenges

The two databases use replication to keep data in sync. If the replication fails (which is not very unlikely) a new database copy has to be made. Currently this requires shutting down both database servers, like on June 19. Having a second database replica that is not accessed at all (and therefore shouldn't fail) will allow to bring a database back online without a downtime of the entire site. Current plan is to use the NFS server for this purpose. Zwinger, the currently active NFS server, has only 1GB of memory, which is not enough to be database slave and file server at the same time.

The general doubling will not be covered by the 20% reserve the apaches can currently provide.

Using an APC remote switch, the servers can be turned off and on, but access to the systems is not possible during boot. Broken grub settings have resulted in longer downtimes of some machines.

Solution approach

For NFS, purchase a new central NFS server to be used as primary node for media files, computing statistics and providing database dumps for download. Key characteristics are a big, fast disk subsystem, sufficient memory to cache the most often requested files and two CPUs to be able to compute statistics in the background. Will, the currently broken second NFS server will become backup of the central NFS server. Will will not be able to compute statistics in the background. Zwinger will be used as a DB server only, used as a second DB slave for the purpose of recovering a broken slave without shutting down the master.

For the Apache farm, get another 5 Apache servers, equipped with one CPU and moderate memory. A disk is not neccessarily needed, but a small disk is not a major cost factor anyway.

A 16-port serial terminal server can provide access to the most important servers, with the exception of some few web servers.

Budget impact

Position	Specification	Count	Costs
NFS Server	2 CPU Opteron, 4 GB RAM, 5*200GB SATA RAID, 3y next business day on site	1	4300 US$
Apache	1 Pentium 4 2.8GHz, 1 GB RAM, 40 GB Disk, 3y return to depot (SM1150A)	5	6000 US$
Terminal Server	32 port serial, 1 port Ethernet, SSH, cabling	1	2200 US$
Total			12,500 US$

An alternative

There were some concerns about concentrating so much space in one machine and other issues, such as a desire to put performance-sensitive scripts on machines not being used for heavy processing. One alternative proposal was

Will to be main image server, but not PHP. This prevents the possibility of an attack on the PHP server and whole site performance by hitting image/media files.
New 1GB RAM dual Opteron to be apache, database dump server (offloading that from zwinger, where big downloads on fast connections have seemed to hurt whole site performance), big batch job server (stats mainly), relying on Apache/Squid load balancing to shift page building away form it when it is doing batch tasks. 200GB or so hard drive for this task, maybe a second for mirroring.
Maybe thinking of 400 GB space (in RAID 5, 4 drives and one spare). The dump size will increase and it would be nice to keep some old backups.
Some concern about concentrating this in one machine. Can't that be done on the web servers with big drives? How many days of archive would a 200GB drive on a web server hold?
Zwinger to be main PHP server¿holding anything which affects site performance if it is interfered with. We've seen slowdowns when database dumps or big file copies used zwinger and removing this work from zwinger will help site performance and perceived reliability. Zwinger also holding database replicated backup, but this only running part time. Zwinger not holding anything involving large file copies.
200GB or so drives in two existing web servers to hold copies of will and zwinger data in case they fail.
200GB drives in three existing web servers to start tests of Coda redundant, distributed system, with a view to eventually moving to it for large file storage. Another existing web server as the controller machine for the Coda system.

Budget impact

Position	Specification	Count	Costs
NFS Server	2 CPU Opteron, 1 GB RAM, 4*200GB SATA RAID, 3y next business day on site (SM-2280SATA)	1	3000 US$
Disks	200 GB ATA	3	600 US$
Apache	2 CPU Opteron, 1 GB RAM, 200GB SATA, 3y return to depot (SM-2280SATA)	2¹	4600 US$
Terminal Server	32 port serial, 1 port Ethernet, SSH, cabling	1	2200 US$
Total			10,400 US$

¹first buy one for benchmarking, second one 2 months later.

See also: Apache hardware -- budget Athlon 2.8 with 1Gb ram from 370¿, 19" from 450¿

Vendor Info

Terminal Server vendor info

Better network FS for performance, DB scaling and replication

Disk i/o on zwinger is never really maxed out by NFS, disk space on apaches is mostly unused, NFS server is single point of failure (spof), failure mode: entire cluster freezes (and doesn't support load balancing or failover). It also doesn't scale as the number of NFS check packets grows with the number of apaches. There is a limit in NFS on the number of up-to-date-check packets it can handle, and that seems to be pretty much independent of disk i/o (all those requests shouldn't be there in the first place anyway).

I think there's no way around a network filesystem that provides

caching on clients
stateful connections with cache invalidation callbacks to make the thing scale (nfs does one request per open()!)
replication of volumes across several servers
built-in load balancing between servers for each volume (active-active)
reintegration mechanism to get servers back in sync after being down or disconnected, conflict resolution

Really good to have:

distributed locking on byte ranges of a file
shared access to block device
support for storing files distributed across cluster members, no central server

Afs and en:Coda (file system) were my first candidates- installed Coda at home and on zwinger, did some benchmarks at home (and use it for my music collection instead of NFS). Unfortunately some stress testing with Bonnie++ revealed problems with the Coda client software at high write loads.

So i've moved on to OpenAFS, Arla, InterMezzo, Lustre, NFSv4, GFS and OpenGFS. Some preliminary notes at http://wikidev.net/Network_file_systems. My current favourites are OpenAFS/Arla and GFS/OpenGFS.

GFS allows many MySQL instances running r/w on the same db files. This would solve the never-ending db-slave out-of-sync and too-small slave problem, plus it would allow to use all db's read-write with transparent failover. Oracle uses GFS for major parallel DB clusters, so it's very likely to work with MySQL as well. However, this would require purchasing Fibre Channel SAN hardware to perform really well.

Gwicke 11:40, 21 Jun 2004 (UTC)

MySQL on GFS can't do active-active (multiple r/w on the same files) because it doesn't use a distributed lock manager (one of the key elements of Oracle Parallel Server on GFS), so there's only a reliability benefit from an active-dormant configuration.
I would suggest you think about master MySQL server for updates and then do read-only queries on replicas. You burn some disk space but there's cache efficiency.
Consider mod_backhand so that apache instances can hand off to each other.
I don't know what interconnects you have, but NFS over gigabit ethernet is noticeably faster than fast ethernet for some applications. I work with a 1,000,000+ user education portal; our servers share data via NFS, and our system has about 200% more capacity using gigE.

- Joshua (just passing)

(Anonymous: Just a note, you can get Fibre Channel JBODs on eBay for cheap. We buy Dell PowerVault 610s (same as Clariion DAE-R or EMC DAE-R, lots of weird rebranding going on) with 10x18GB HDDs for around $300-400 each, use QLogic 2200 HBAs and then software RAID the disks. Been working great so far! I think they are 4U though so if rack space is a premium you might need something a little newer.)

Some questions/notes

Many high traffic sites use a separate set of webservers for images (photo.net is a good example of this). Why is using NFS better than having a separate webserver(s) for your media?
It may be that buying a SAN will be a better investment than continuing to buy mid-size RAID arrays on all of your servers. You could do your replication from that, which would be nice.
I understand the price of rack space versus hardware as far as Opterons versus Athlons, but it typically takes a few years to gain back the money you save in buying more expensive, smaller boxes. Perhaps it would be better to do your cost calculations on the one- to two-year timeline? Especially if you don't pay for rack space now.
You mentioned maintenance of all these machines: trust me, it will get worse. Here's an architecture for your servers that would let you VERY quickly add new ones onto the rack, and provision them to the greatest points of pain for your current needs:
- You'll need to buy systems with netbooting ethernet cards. No hard drives necessary
- You'll need a DHCP server and something like the Linux Terminal Server Project serving O/S images
- You'll need a quality failover proxy server setup (SQUID does this I believe)
- You'll need a SAN
- You would then create configurations, one for each "type" of server that you have, so, SQUID servers, PHP Servers, Replicating MySQL servers, Image Servers, etc.
- Then you'd tie an ethernet card's MAC address to the configuration that you want it to have.
- On boot of the machine, the following would happen: It would go to the network, get an IP address, download just enough of an operating system to mount its root directory on/over NFS, or over the SAN, and then load.
- Shortly thereafter, it would be ready to go, and would tell your proxy servers that you're ready to go into the mix, or somehow join the MySQL replicators (I know nothing about MySQL, so is this possible?)
- If the server dies, the load balancers take note, and don't send to that server anymore.

This would give you a setup which would let you plug in new commodity hardware very quickly for your server needs, removes the need for buying hard drives for all of them, and means you don't have to worry very much if one them fails.

We do something somewhat similar to this at my firm using VMWare's ESX product, and we're so much happier than our previous world that I can't possibly communicate it to you. Our total system administration time has gone _way_ down.

- Peter

A search server

A dedicated search server seems to be needed now. Updating and reading can become slow on ariel too (updating searchindex is very slow). So having a dedicated search server will free the main db server and the slave because the searchindex will be generated only on the search server.

Proposal :

Dual Opteron 242 with FC2 OS
4G RAM
6 x 200GB SATA drive in RAID 10 (so 600GB of disk space usable)
SM-2280SATA: $4400

Offsite backup, distributed caching

(Please move to a separate paragraph or delete... - Eric)

Proposals

There should be some kind of offsite backup of database snapshots. Even RAID can sometimes fail.
Web caches could be distributed a bit: At least one Squid at a separate location/uplink/IP-range, to reduce bottlenecks and dDoS vulnerability.

Quick proposal

Here's a quick proposal; we need servers, we could buy more later:

Position	Specification	Count	Costs
Search DB server	Dual-opteron 242, 4GB RAM, 6x200GB SATA drive in RAID 10, SM-2280SATA	1	4400 US$
NFS Server and his backup	2-CPU Opteron 242, 1GB RAM, 6*200GB SATA RAID (SM-2280SATA)	2	7600 US$
Apache	2-CPU Opteron 246, 1GB RAM, 200GB SATA (SM-1280SATA)	3	8700 US$
Terminal Server	32-port serial, 1-port Ethernet, SSH, cabling	1	2200 US$
Disk for suda	146GB scsi 10,000 rpm, same model	1	~ 700 US$	Total	23,600 US$

Plan

Add the SCSI disk to suda, reinstall suda with FC2 with a RAID 10 4x146GB SCSI 10,000 rpm (for db slave). suda won't be apache.
NFS server will replace zwinger; his backup will be ready if any pb arrive
Zwinger will become an apache
Will could now be sent to repair
Search queries and update will only go to the search db server
rabanus and yongle (each 4G RAM) will be switched as squid

Units used: 10 (supposing terminal server takes 1U)

Hmm, the three Opterons would cost about as much as 20 single-cpu Athlon XP's in mini-tower cases or 15 of them in 1U 19". I doubt that six Opteron CPUs can deliver the same speed, so this looks like a very bad deal to me.

NFS will very likely be replaced sooner or later, so buying too much dedicated hardware for it now might not be a good idea. The 20 apaches would come with 80Gb HDs each, so a total of about 1.4TB free space that can be used in a network RAID-1 (feature currently worked on for GFS) or for AFS volumes. Apart from that, there are iSCSI SAN solutions around that might be better and probably cheaper than an NFS server. -- Gwicke 11:29, 27 Jul 2004 (UTC)

Sooner or later, as you said. I talk about now! We need a reliable central NFS server now! Shaihulud 21:51, 28 Jul 2004 (UTC)

And your 15 apaches will take 15U. Dual Opteron, only 3U. One day, we will pay for rack space. Think about future. And you talk about cheap hardware, I talk about well known hardware with supermicro motherboards. We dont have actually a technician to work everyday on servers.... If you take hardware from Silicon Mechanics, 1 P4 2.8GHz 1U cost $1200 each....

Another alternative for the Apache Servers would be 4 SM-1280SATA 1U Dual Opteron Servers with 2 CPU Opteron 242, 1GB RAM, Seagate 160GB 7200.7 SATA and dual Gigabit Ethernet for $2172.00 each, $8688.00 total --217.238.42.253 20:39, 28 Jul 2004 (UTC)