As a followup to the Google Gmail outage post I thought I’d post the email notification sent from Google to all of its “premium account holders”.
I know it’s a form letter but it does convey their apologies, provides links to the incident report, and details the credit that they will be providing for the outage. I really do appreciate the credit as, based on the terms of their SLA, Google was well within their rights to simply claim that the outage fell within their conveyed SLA figures. Good for you Google 🙂
Dear Google Apps customer,
We would like to follow up on the recent Gmail outage and resulting credit through our service level agreement (SLA) with you.
Between 12:45 PM to 2:15 PM PDT | 19:45 – 21:15 GMT on Tuesday, September 1, 2009, Google Apps Gmail users were unable to access their accounts through the Gmail web interface. Users could continue to access their accounts via IMAP and POP. No data was lost during this time; messages were received and delivered, but could not be displayed.
As a result of this incident, we are extending a 3-day SLA credit to your account. This credit will be reflected in an automatic 3-day extension to your Google Apps term date, and no action is needed on the part of your administrators.
We understand that this service outage has affected our valued customers and their users, and we sincerely apologize for the disruption and any impact.
Following are the key points from the incident report:
On Tuesday, September 1, a small portion of Gmail’s web capacity was taken offline during a routine upgrade and service update. This is normal operating procedure as the Gmail web interface runs in multiple locations, and Gmail’s request routing automatically directs users’ requests to available servers. However, we underestimated the increased load that some of the new updates placed on request routing.
As a result, at approximately 12:30 PDT, a few request routers became overloaded and responded by refusing all incoming requests. This response transferred the load to the other request routers, and as the effect rippled through the system, almost all of the request routers became overloaded. As a result, users could not access Gmail through the web interface since their requests could not be routed to a Gmail server. Gmail processing and access through the IMAP/POP interfaces continued as usual because these processes use different request systems.
Upon receiving the error alerts, the Gmail Engineering team immediately began analyzing the issue and initiated a series of actions to help alleviate the symptoms. After determining the root cause to be insufficient available capacity, the Engineering team deployed a large-scale addition of request routers through Google’s flexible capacity server systems. As they distributed incoming traffic across the expanded pool of request routers, access to the Gmail web interface returned to normal.
During the incident, we published ongoing reports to the Google Apps dashboard, Gmail Help Center, the Enterprise and Gmail blogs, and the GoogleAtWork and Google Twitter feeds, to help provide customers with the latest status and available workarounds.
The complete incident report (http://www.google.com/appsstatus/ir/buuqdnt6fcervea.pdf) in the Google Apps Status Dashboard describes the corrective and preventative measures to address the underlying causes of the issue and to help prevent recurrence. For ongoing service performance information, please see the Google Apps Status Dashboard at http://www.google.com/appsstatus.
Once again, we apologize for the impact that this incident has caused. Thank you very much for your continued support.
Sincerely,
The Google Apps Team