Amazon's cloud crash caused by, of all things, a typo
Related Blog Posts
Research Triangle Park, N.C. — As humans become more reliant than ever on technology and the global Internet of Things, the human factor becomes even more dangerous as Amazon's massive cloud failure clearly demonstrated Thursday.
In explaining what happened, Amazon says an "input" was "entered incorrectly," and - boom!
The cascading effect on the Internet could have been catastrophic. Think of how you would have to live or do business if the Internet crashed entirely for hours, days - weeks.
"[A]n authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," Amazon explained in a web post published Thursday.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
Read that as a typo.
Or as The Associated Press described it: A "fat finger."
And since the Internet is so interlinked now, when Amazon went down, as the Drudge Report declared in an all-red emergency headline, much of the Internet on the East Coast went down, too - or was dramatically slowed.
Amazon customer or not, you paid for Amazon's typo, too.
Nothing simple about it
S3 refers to what Amazon calls its Simple Storage Service. But the system, while perhaps designed to be simple for customers to access and use, is hardly simple. Amazon has built a very successful, world-leading cloud system network with thousands and thousands of servers interlinked across high-speed networks at a variety of data centers. Yet the failure shows how dangerous the cloud is to companies and consumers who now rely on this network to store everything from records to recipes while using cloud power for research and everyday corporate functions.
Amazon apologized, of course, and says it is taking remedial steps such as further partitioning machines to (hopefully) prevent a cascade of failures in the future.
Amazon says it is making changes to its system to make sure incorrect commands won't trigger an outage of its web services in the future.
"[W]e want to apologize for the impact this event caused for our customers," Amazon noted. "While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further."
Here is the full explanation from Amazon. While technical, it offers a fascinating look inside the world of tech - and also serves as a reminder that maybe you need to be more careful with your own files (backups, alternative sources for access and storage).
Also, beware of typos.
The Amazon report
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.
From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
Published at: https://aws.amazon.com/message/41926/
WRAL TechWire Publisher and Editor Rick Smith dishes out tidbits from the local technology sector. Read more articles…
Please Log In to add a comment.
Latest for Insiders
- Community colleges demo their power in workforce development
- Windsor Circle's CEO on pivot, layoffs: 'Buck stops with me'
- SAS steps up analytics talent drive with $3.4M Clemson deal
- After $4M fund raiser, what's next for health data startup Bivarus?
- Raleigh Chamber chair: New CEO reflects commitments to development, diversity
- IBM's pledge to hire vets triggers mix of reactions, anger
- Field set for WRAL TechWire Awards - vote today
- Southeast Venture Conference evolves: Fusion set for June
- As Oracle cuts jobs, no discussion in trash-talking earnings call
- New FCC chair stresses 'bringing benefits of digital age to all Americans'