Connect with us

Hi, what are you looking for?

Tech & Science

Amazon typo took the Internet down on Tuesday

Much of Amazon’s Web Services infrastructure went offline along the East Coast on Tuesday. Web Services is a cloud platform which other companies can use to rent out servers. Sites including Airbnb, Medium, Netflix, Slack and Trello were among those impacted by the disruption.
Amazon eventually brought the severs back online a short time after the problems began. In a detailed report yesterday, it publicly explained what went wrong, explaining how a routine maintenance operation went awry and took servers offline across the region.
On Tuesday morning, a member of Amazon’s team was investigating issues that were causing the company’s server billing systems to run unexpectedly slowly. The worker ran a command designed to shut down a small number of servers used by the billing system. A typo led to the command being entered incorrectly and a much larger number of servers was shut down than intended.
This caused a cascading effect across the datacentre. The servers that went offline were used to support other elements of Amazon’s cloud infrastructure, preventing them from operating correctly. To get everything back online, Amazon ended up forcing a full restart of the affected systems. This took longer than expected as the servers had to validate the integrity of their data before responding to external requests.
At the height of the outage, Amazon Web Services was essentially unusable on the East Coast. Amongst the impacted sites was Amazon’s own Web Services Health Dashboard, preventing the company from providing updates on the outage. It had to use Twitter instead to broadcast announcements.
Amazon apologised for the inconvenience to customers and users. It said it is making “several changes” to prevent a similar scenario occurring again. The tool used to remove storage servers now runs more slowly, preventing servers from being shut down in large numbers. Additionally, improved failure recovery mechanisms will make it quicker to restart core services going forward.
“We want to apologize for the impact this event caused for our customers,” said Amazon. “While we are proud of our long track record of availability with the Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”
Amazon’s embarrassment was compounded by an unfortunate coincidence. As the outage began, Adrian Cockcroft, the company’s Vice President of Cloud Architecture Strategy, was on stage publicly detailing the benefits of using Web Services. The executive is not thought to have been aware of the cascading errors his team was rushing to fix. Despite the widespread service disruption, Amazon has avoided describing the issues as an “outage,” instead calling them a product of “high error data rates.”

Written By

You may also like:

Tech & Science

CMG is integrating advanced simulation tools with NVIDIA hardware and high-performance computing software to accelerate time-to-decision for energy leaders.

Entertainment

Fabian Arnold, David Millbern, and Meredith Thomas star in the new movie "Big Rage," which premiered on November 1st on Here TV.

Entertainment

Alison Victoria, interior designer, Emmy-nominated producer, and social influencer, chatted about her latest endeavors.

Entertainment

Academy Award winner Eddie Redmayne, Lashana Lynch, and Ursula Corbero star in the new spy thriller series "The Day of the Jackal."