Grants:PEG/WM DE/Improve toolserver reliability/Report

Report accepted
This report for a Project and Event grant approved in FY 2009-10 has been reviewed and accepted by the Wikimedia Foundation.
  • You may still comment on this report on its discussion page, or visit the discussion page to read the discussion about this report.
  • You are welcome to Email grants at wikimedia dot org at any time if you have questions or concerns about this report.

Report on the outcome of the Project Improve toolserver reliability.

In summary, the project can be said to have been a full success. The only set backs were some delays in hardware delivery and setup.

In Summer 2009, we bought:

  • 2 SunFire X4250, 32GB RAM, 16x 146G disk
  • 2 SunFire X2100, 4GB RAM
  • 1 Sun Storage 2530 array, 5x 146G disk

The cost exceeded the originally estimated 40000 USD by a few percent, due to a weak dollar at the time. The difference was covered by Wikimedia Germany. Delivery took some time, since some parts turned out to be defective and had to be replaced. Delivery was completed only by the end of 2009.

While the original Plan was to buy 3 X4250s as database servers, we decided (after consulting with Erik Möller) to only buy two of those, and free a third X4250 we already had in service as an NFS head. So we bought two X2100s to serve as a high-availability (HA) cluster for serving the home directories for the Toolserver users (NFS) as well as providing other critical services like LDAP. This provided us with a more reliably solution for NFS, as well as 3 X4250s to act as redundant database servers, as originally planned.

We started moving user directories and services to the HA cluster in November 2009, and finished the migration by the end of December. We began setting up the redundant database servers in January and had finished by February.

AS a result of the new setup, reliability and performance of the Toolserver was significantly improved. There are fewer problems with overloaded databases, and the redundant copies saved us at least one long outage of several days, when one of the servers experienced a hardware failure. The redundant servers in the HA cluster allowed us to perform routine maintenance without causing downtime for the online tools (previously, rebooting a server meant up to 60 minutes of outage for the online tools).

We believe that Toolserver reliability improvement project was very successful, improving even more aspects of the Toolserver cluster than originally planned. We thank the Wikimedia Foundation for their support! We also tahnk River Tarnell for the outstanding work on setting up and maintaining the Toolserver infrastructure.