http://www.digitaljournal.com/tech-and-science/technology/amazon-typo-took-the-internet-down-on-tuesday/article/487072

Amazon typo took the Internet down on Tuesday

Posted Mar 3, 2017 by James Walker
Amazon has revealed the cause of a service outage to its Web Services cloud platform earlier this week. The problem caused many major websites and apps to be inaccessible for a short time. The company blamed human error, citing a typo as the root cause.
Amazon is upping the ante in the free shipping battle with Walmart.
Amazon is upping the ante in the free shipping battle with Walmart.
Leon Neal, AFP/File
Much of Amazon's Web Services infrastructure went offline along the East Coast on Tuesday. Web Services is a cloud platform which other companies can use to rent out servers. Sites including Airbnb, Medium, Netflix, Slack and Trello were among those impacted by the disruption.
Amazon eventually brought the severs back online a short time after the problems began. In a detailed report yesterday, it publicly explained what went wrong, explaining how a routine maintenance operation went awry and took servers offline across the region.
On Tuesday morning, a member of Amazon's team was investigating issues that were causing the company's server billing systems to run unexpectedly slowly. The worker ran a command designed to shut down a small number of servers used by the billing system. A typo led to the command being entered incorrectly and a much larger number of servers was shut down than intended.
This caused a cascading effect across the datacentre. The servers that went offline were used to support other elements of Amazon's cloud infrastructure, preventing them from operating correctly. To get everything back online, Amazon ended up forcing a full restart of the affected systems. This took longer than expected as the servers had to validate the integrity of their data before responding to external requests.
At the height of the outage, Amazon Web Services was essentially unusable on the East Coast. Amongst the impacted sites was Amazon's own Web Services Health Dashboard, preventing the company from providing updates on the outage. It had to use Twitter instead to broadcast announcements.
Amazon apologised for the inconvenience to customers and users. It said it is making "several changes" to prevent a similar scenario occurring again. The tool used to remove storage servers now runs more slowly, preventing servers from being shut down in large numbers. Additionally, improved failure recovery mechanisms will make it quicker to restart core services going forward.
"We want to apologize for the impact this event caused for our customers," said Amazon. "While we are proud of our long track record of availability with the Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further."
Amazon's embarrassment was compounded by an unfortunate coincidence. As the outage began, Adrian Cockcroft, the company's Vice President of Cloud Architecture Strategy, was on stage publicly detailing the benefits of using Web Services. The executive is not thought to have been aware of the cascading errors his team was rushing to fix. Despite the widespread service disruption, Amazon has avoided describing the issues as an "outage," instead calling them a product of "high error data rates."