
So You Want to Write a Security Book, Eh?
 – Andrew Hay
 – Friday, September 18 * 8:00pm – 9:00pm
Have you ever thought about writing a security book, but were not sure where to start? What kind of book should you write? How do you get a publisher? What can you expect to make off your book?
Join Andrew Hay, author of the OSSEC Host-based Intrusion Detection Guide, Nagios 3 Enterprise Network Monitoring, and the Nokia Firewall, VPN, and IPSO Configuration Guide, to learn the pros and cons of being a security author and to learn if you’ve got what it takes to write the next great security book.
Full details here: http://www.sans.org/ns2009/night.php
 According to the official Gmail blog, Ben Treynor, VP Engineering and Site Reliability Czar, claims that the reason Gmail was down for roughly 100 minutes yesterday was due to a portion of servers being taken offline for upgrades. The traffic, redirected to the remaining servers, were able to cope with the load but the routers responsible for directing the web queries to the servers couldn’t handle the additional responsibility.
According to the official Gmail blog, Ben Treynor, VP Engineering and Site Reliability Czar, claims that the reason Gmail was down for roughly 100 minutes yesterday was due to a portion of servers being taken offline for upgrades. The traffic, redirected to the remaining servers, were able to cope with the load but the routers responsible for directing the web queries to the servers couldn’t handle the additional responsibility.
From the blog post:
Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.
Even though the service is free (to most – I have an account which I pay for) I can’t see this as being an acceptable answer. If you are the “VP Engineering and Site Reliability Czar” you, or the people under you, should have been able to predict what would happen in this exact scenario. This is what labs and load testing is for.
I think what bugs me most about this whole ordeal is that Treynor states that the service was unavailable for “about 100 minutes” but then concludes his post with “Gmail remains more than 99.9% available to all users, and we’re committed to keeping events like today’s notable for their rarity.”
Let’s do some basic math here folks. Google tells me that:
If 99.99% is the availability figure then 00.01% must be the acceptable outage window right?
00.01% of 525,948.766 minutes = 52.5948 minutes per year.
Stated 100 minutes of outage – Allowed 52.5948 minutes per year = 47.4052 minutes of availability that we are all owed back from Google.
Based on my complicated calculations it appears as though Google has exceeded it’s %99.99 uptime mandate. I know the people at Google are smart but I don’t think they’re smart enough to rewrite the laws of basic mathematics. But maybe they are smarter than me as I was unable to find anything that stated that the 99.99% figure was a per year calculation. If the 99.99% uptime is spread over 100 years then, yes Google, you are still justified in your uptime calculations. If it is a yearly figure, however, I want to know how I go about recouping my 47.4052 minutes of availability that I am owed…I’m sure you’ll let me know.
UPDATE – Well I guess Google doesn’t owe me anything because they are only stating 99.9% uptime. My mistake. I have also found out that, based on the Google Apps SLA, they are stating 99.9% availability per month – http://www.google.com/apps/intl/en/terms/sla.html…very tricky 🙂
I had an idea early this morning that may or may not work and may or may not have been attempted before. Frankly, if it has been done before, it hasn’t been done in a while so it’s time to kick it off again. In an effort to get to know more about my peers and friends I’m going to start the ball rolling on the “5 Things You Might Not Know About…” project. The rules:
Hopefully this gets the ball rolling. I’m going to tag the following people in the hopes that they join in on the insanity: Michael Santarcangelo, Justin Foster, Anton Chuvakin, Jennifer Jabbusch, and Erin Jacobs.