According to the official Gmail blog, Ben Treynor, VP Engineering and Site Reliability Czar, claims that the reason Gmail was down for roughly 100 minutes yesterday was due to a portion of servers being taken offline for upgrades. The traffic, redirected to the remaining servers, were able to cope with the load but the routers responsible for directing the web queries to the servers couldn’t handle the additional responsibility.
From the blog post:
Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.
Even though the service is free (to most – I have an account which I pay for) I can’t see this as being an acceptable answer. If you are the “VP Engineering and Site Reliability Czar” you, or the people under you, should have been able to predict what would happen in this exact scenario. This is what labs and load testing is for.
I think what bugs me most about this whole ordeal is that Treynor states that the service was unavailable for “about 100 minutes” but then concludes his post with “Gmail remains more than 99.9% available to all users, and we’re committed to keeping events like today’s notable for their rarity.”
Let’s do some basic math here folks. Google tells me that:
If 99.99% is the availability figure then 00.01% must be the acceptable outage window right?
00.01% of 525,948.766 minutes = 52.5948 minutes per year.
Stated 100 minutes of outage – Allowed 52.5948 minutes per year = 47.4052 minutes of availability that we are all owed back from Google.
Based on my complicated calculations it appears as though Google has exceeded it’s %99.99 uptime mandate. I know the people at Google are smart but I don’t think they’re smart enough to rewrite the laws of basic mathematics. But maybe they are smarter than me as I was unable to find anything that stated that the 99.99% figure was a per year calculation. If the 99.99% uptime is spread over 100 years then, yes Google, you are still justified in your uptime calculations. If it is a yearly figure, however, I want to know how I go about recouping my 47.4052 minutes of availability that I am owed…I’m sure you’ll let me know.
UPDATE – Well I guess Google doesn’t owe me anything because they are only stating 99.9% uptime. My mistake. I have also found out that, based on the Google Apps SLA, they are stating 99.9% availability per month – http://www.google.com/apps/intl/en/terms/sla.html…very tricky 🙂