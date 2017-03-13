On Tuesday, tens of thousands of websites including Quora, Medium, and Giphy were taken down as servers with the Amazon Web Service (AWS) crashed for a few hours. People immediately speculated about what had caused such a serious crash, with ideas ranging from a software bug to even a massive DDOS attack.

But as Amazon revealed, the cause behind the server crash was that a “command was entered incorrectly” – or in other words, a typo. Amazon’s Simple Storage Service (S3) team was debugging a billing issue and intended to “remove a small number of servers” to fix the issue. But thanks to the typo created by an “authorized member”, a far larger number of servers went down, creating a chain reaction with their resulting effects on the Internet.

Because of this typo, an estimated S&P 500 companies lost an estimated $150 million in those few hours, partially because online stores were closed or had their loading times increase by triple digit percentages. But while we roughly know how this incident happened, it shows the problems which we have with the Internet. We like to think of the Internet as a place with millions of websites, but this incident shows how so much of its power is being concentrated within a few websites and companies. If we look at the underlying problems which caused Tuesday’s incident to be so problematic as well as how both Amazon and other companies should adjust to this, it becomes clear that we will need to create an Internet which has more redundancies and cannot be taken down by a simple mistake or even sabotage.

One typo, much chaos

The details behind the Amazon outage shows just how easy it is to shut down servers and how hard it can be to put things back together. When that one person entered a typo, the servers shutting down affected two other S3 subsystems, one of which managed metadata and location information of all S3 objects in their Northern Virginia data centers. Without the data centers, network requests for those servers could not be completed, which burdened other servers and slowed down websites, including those serving PDF conversions.

In order to fix this problem, Amazon had to restart the affected systems – except they had not actually restarted those servers for years. And just like your computer can have problems if you do not restart it for a long time, the servers needed a few hours to get back on track.

The good news is that Amazon is taking steps to ensure that this sort of event cannot happen again. They are implementing new safeguards which will prevent so many servers from being taken down so quickly, working to ensure that S3 servers can recover faster in the event of a shutdown, and fixing its service health dashboard so that people can learn something is wrong with AWS faster.

Too much concentrated power?

But the question which everyone needs to consider is how this incident has shown how much power is concentrated in the cloud computing which most websites use today. Slate points out that the Internet was originally designed to be decentralized so that it could not be taken out by just striking at one point like in the event of a nuclear strike. However, the collapse of a Northern Virginia data center now can have serious effects on the Internet as a whole.

Some writers have even go so far as to suggest that Amazon could be viewed as violating antitrust legislation with its dominant hold over the Internet. But the reality is that cloud computing is beneficial for everyone. It lets websites be hosted at a low price and eliminates the old days where they would scramble for packets of data to keep running. Furthermore, massive cloud hosts are more secure precisely because hacking into Amazon or Microsoft is such a herculean task compared to smaller providers. If Amazon was really some dominant monopoly, we would not be seeing other companies like Microsoft or Google trying their best to compete with them in cloud technology.

Still, businesses and users should learn from this debacle and make adjustments appropriately. A major reason why some websites were particularly hard hit was because all of their servers were connected to the Northern Virginia servers. A good website should be spread out through multiple regions, creating a failsafe that reduces the chances of a website slowdown or crashing.

Cloud providers will screw up, get hacked, or have errors as these examples show, and businesses have to make sure that they are not totally dependent on a single provider. But Amazon is committed towards fixing this problem, and no one wants to go back to the old days of decentralized data. Businesses will have to carefully monitor their own Internet and cloud security to prevent failures like these from drastically hurting them.

