Remember meForgot password?
    Log in with Twitter

article imageGoogle Explains Recent Gmail Outage

By Chris Rowson     Sep 2, 2009 in Internet
Google 'Reliability Czar' Ben Treynor explains why Gmail failed, how it was fixed and what can be done to prevent this from happening again.
Gmail users across the planet flew into a panic yesterday when Google's extremely popular email service failed unexpectedly.
Google's Ben Treynor explained that the outage began with routine maintenance to Gmail's infrastructure.
This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.
Ben went on to explain that Gmail's automatic load balancing system failed, causing the system to become overloaded within a matter of minutes.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.
Upon establishing the nature of the failure Google engineers worked frantically to bring extra routers online to deal with requests and get Gmail back up.
...the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google's architecture), distributed the traffic across the request routers, and the Gmail web interface came back online.
To prevent this from happening again, Google engineers intend to put in place measures to sufficiently isolate failure rather than allow it to spread across the network.
Over the next few weeks Google engineers will work on improving Gmail's infrastructure. It is hoped that changes such as allowing routers to process traffic more slowly, rather than preventing them from accepting requests completely will prevent this type of problem from happening again.
Although this failure shows how even the biggest businesses can get it wrong, and how difficult it is to model a large scale systems failure, it will also undoubtedly leave some businesses reevaluating what data they entrust to online services and may effect confidence in Google's flagship 'Google Apps' web application business.
More about Google, Gmail, Outage, Fail
Latest News
Top News