Amazon Says Employee Error Caused Tuesday's Cloud OutageBy
Worker took down too many servers while fixing billing system
Company pledges to make several changes to avoid recurrence
Amazon.com Inc. said a human error at its cloud business caused sweeping outages across the internet for several hours earlier this week.
Amazon said efforts to fix a billing system bug caused prolonged disruptions Tuesday. An Amazon Web Services employee working on the issue accidentally switched off more computer servers than intended at 9:37 a.m. Seattle time, resulting in errors that cascaded through the company’s S3 service, Amazon said in a statement Thursday. S3 is used to house data, manage apps and software downloads by nearly 150,000 sites, including ESPN.com and aol.com, according to SimilarTech.com.
A major failure from what appears to be a minor maintenance procedure highlights that AWS, and the cloud computing industry in general, still have some maturing to do, said Ed Anderson, an analyst at Gartner Inc.
"The fact that an incorrect keyboard entry could bring down an entire region shows they have some operational issues," Anderson said. "Even though they are the world’s biggest cloud provider, they still have some work to do to refine their processes."
Amazon said it is "making several changes as a result of this operational event."
"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level," the company said.
AWS has data centers around the world that handle the computing power for many large companies, such as Netflix Inc. and Capital One Corp. Amazon and competitors like Microsoft Corp. and Alphabet Inc.’s Google are growing their cloud businesses as customers find it more efficient to shift their data storage and computer processes to the cloud rather than maintaining those functions on their own. Widespread adoption also increases the likelihood that problems with one service can have sweeping ramifications online.
Despite the incident, AWS is a reliable service and most customers affected by the outage will forgive and forget, Anderson said.
"AWS has a pretty good track record of being up and reliable," he said. "This outage was an anomaly, not part of a standard pattern."