Amazon cloud outage caused by human error

Jessica Kim Cohen - Tuesday, March 7th, 2017

Amazon has released a statement addressing last week's Amazon Web Services' S3 system outage, which disrupted a number of websites on the afternoon of Feb. 28.

The AWS S3 service disruption, which impacted cloud computing service in the northern Virginia region, took place when Amazon's S3 team was working on an issue in its S3 billing system. An authorized S3 team member incorrectly executed a command, which was meant to remove servers for one of the S3 subsystems used by the billing process — causing more servers to be removed than intended.

"The servers that were inadvertently removed supported two other S3 subsystems," according to the statement. This mistake required each of the systems to need a full restart.

Since last week, Amazon has made a few changes to its protocols, including improving the recovery time of S3 subsystems and modifying its tools that remove servers.

"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly," according to the statement. "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."

Click here to view the full statement.

Amazon cloud outage caused by human error

Featured Learning Opportunities

Featured Whitepapers

Featured Webinars