February 2005 server crash

Please note that this page is not an official statement of the Wikimedia Foundation.

"<brion> it seems small until you have to copy the whole thing to another computer over a network while thousands of people curse your downtime" -- seen on #wikipedia@freenode

Summary edit

All Wikimedia's Florida servers went offline at 22:14 UTC on Monday, February 21, 2005 due to tripped circuit breakers inside the colocation facility. Read-write service to all our wikis was restored at 22:26 UTC on Tuesday, February 22, 2005. Recovery work meant things ran a bit slower until all servers were restored.

Events edit

At about 22:14 UTC on Monday, February 21, 2005, network contact to all Wikimedia servers at our Florida colocation facility was severed. Initially we diagnosed the problem as the network switch crashing, blocking off access to our servers. (There have been some incidents with this switch, and some network changes are slated for the coming week related to this.)

After about a half hour of network poking and phone calls we got ahold of Jimbo who got ahold of the colo, who let us know that there was in fact a power problem: circuit breakers had tripped in both of the circuits leading to our machines. This had knocked out everything including the machines with dual redundant power supplies.

(The cause of the circuit breakers cutting out is not known to us at this time, please call the colo and ask if interested in more details. (We're kidding here: we really don't want dozens (or thousands) of people calling our colo to ask :-))

With no services available, attempting to visit any of our sites was of course not very successful. No DNS server to resolve the host names, and no web server to contact even if you did. Also no mailing lists, etc. (Yes, we know off-site DNS is A Good Idea: we're working on it.)

Once the switch was back and network access restored, we started attempting to contact and recover servers as they completed their reboots and filesystem checks, and get things online.

The first priority was the main file server, DNS resolution, and the mail server. Within a couple hours, the main file and DNS servers were back online, as well as many of the web servers and two of the web proxy servers.

Second priority after this was setting up a downtime message so visitors would get some idea what was going on.

Third priority was to recover the databases and get the sites back online. Here we had bigger problems.

Database recovery edit

Our databases run under MySQL, almost exclusively using the InnoDB transactional database type. In theory, this is resistant to nasty events such as crashes and power outages. In practice, there are still problems; if data writes are not fully committed when the database thinks they are, the data can still come out corrupt after an outage (See LiveJournal's woes a few weeks ago with a similar problem affecting some of their database servers.)

This, unfortunately, was our case as well. We had one master database, four actively replicating slave servers, and one slave used for reporting and backup. At the time of the power loss the reporting slave was applying updates and catching up on an update backlog which had accumulated during last week's old article version compression work. It was 31 hours behind real time at the time.

The master and all four active slaves failed to complete InnoDB recovery on MySQL startup, due to corruption of pages in the InnoDB data store. Full recovery from these corrupted databases may have been possible with additional tools, but would be error-prone, labor intensive, and might not yield good results.

Fortunately the reporting/backup slave was completely intact. We copied its data set to several other machines, started reapplying the last 31 hours' worth of logged updates, and started putting the wikis back into availability in read-only mode.

About 24 hours after the initial power failure, we had two fully updated servers including a suitable new master and a backup slave, and put the wikis back into full read-write mode.

Copying of the good data to additional servers is still ongoing, and a few slower functions have been temporarily disabled until more servers are caught up and online for Wednesday's daytime traffic rush.

Lessons learned edit

  • A number of servers have configuration issues where they do not automatically run the necessary services on boot, requiring additional work.
  • We currently don't have an offsite backup DNS; our backup DNS server is on the same local network. Stupid, stupid, stupid! We know this is wrong, and need to set up an offsite backup.
  • After the LJ fiasco, we talked about the possibility of putting a UPS backup battery in our cage but this hadn't been done yet. Hopefully we'll get that done sooner rather than later! (Though it's been suggested, on the obligatory Slashdot thread and elsewhere, that in-cage UPS's may violate the fire code since the master Molly Switch won't shut them off.)
  • We could have been back online some hours sooner if that one good database had been current; time had to be spent letting it catch up.
  • We'd have been back online within 2-4 hours of the power failure if all databases had remained intact. We need to look more at the configuration and the hardware to see why those databases were corrupted. Is write caching enabled when it should not be? (it's enabled and on 4 of the 5 has battery backed up cache, so the problem is probably elsewhere) Do we need to fix our configuration or, like LJ, is our hardware lying to us? Could it be a known issue involving Linux 2.6 on Opterons and InnoDB, as conjectured in a Slashdot comment (see links below)?

It could have been worse edit

We might have had them all corrupted. We'd have then had to spend additional hours recovering a corrupted database, or worse attempting to go back to the February 9 backup dump and updating from there. James found some messages from MySQL people asking how we were doing and offering asistance when he returned from his trip but we didn't need to accept their offers.

References edit