What caused numerous customers and businesses to witness outage for several hours, was nothing else but a human error. To be specefic, it was a typo.
Amazon has released an official statement explaining the outage caused by its Amazon Web Services (AWS) a public cloud infrastructure provider.
In a blog post, Amazon wrote, "The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37 AM PST, an authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that are used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
This small mistake took down two subsystems in the US-EAST-1 region, a massive data centre location. The removal of these two larger systems is what knocked so many services offline which includes Quora, Twitch, Kickstarter, Slack, Business Insider, Expedia, Atlassian’s Bitbucket HipChat. The list also includes AWS Service Health Dashboard (SHD), which is necessary for Amazon to update its own status page.
Though Amazon decided to restart all the systems, S3 was unable to service requests. Both systems required a full restart, and this process took longer than expected because the servers have not been completely restarted "for many years."
The index subsystem was fully recovered by 1:18 pm PT, while the placement subsystem was recovered by 1:54 pm PT. By that point, S3 was operating normally.
The company further noted that it's making "several changes" because of the latest incident. To avoid such problems in the future, Amazon said, "While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."
Amazon has also started dividing parts of the index subsystem into smaller cells and changing the administration console for the AWS Service Health Dashboard.
"We will do everything we can to learn from this event and use it to improve our availability even further,” the company concluded.