The Worst Cloud Outages of 2013 (So Far)

From Amazon to Dropbox and Microsoft to Google, we've seen some nasty cloud outages in the first half of the year. Which company failed the worst?

Credit: Federal Government of the United States / Wikimedia Commons
Warning: Stormy skies ahead

Cloud computing provides plenty of perks for both businesses and casual users, but while cloud servers may live in fluffy white shapes in the sky, they aren't immune to earthly errors.

As any cloud dweller knows, Web-based services can crash and burn just like any other type of technology. If the companies behind them are smart, you shouldn't lose any data in the long run -- but you'll likely lose a bit of sanity during the time the service is offline.

While 2013's only halfway done, we've already seen some cringe-worthy cloud failures this year. Here are the worst -- so far.

Amazon takes a nosedive

Date: Jan. 31, 2013

Duration: 49 minutes

Failure: Amazon Web Services have been responsible for major outages before, but it's not often that the company's own Amazon.com home page goes down. Earlier this year, though, it did just that: Instead of its usual array of eye-catching wares, Amazon.com displayed a simple text error message for almost an hour on an otherwise uneventful January day.

The message, "Http/1.1 Service Unavailable," shed little light on what was actually going on. Internet speculation initially suggested a denial-of-service attack might have been involved, but those claims appeared to be dubious. While Amazon never officially commented on the cause, later reports indicated an internal issue was more likely the culprit.

The Amazon outage aftermath

Fallout: An online retailer like Amazon obviously depends on, you know, being online in order to do business. Looking at the company's previous quarterly earnings and doing a little fancy math, some industry-watchers estimated the site could have potentially suffered close to $5 million in missed revenue for that single hour of offline time.

Fix: Amazon stayed quiet on what it had to do to get its business back up in proper order, noting only that the issue affected just the home page -- not internal pages -- and had no impact on its AWS cloud hosting operation.

Credit: Twitter
Dropbox drops the ball

Date: Jan. 10, 2013

Duration: Around 16 hours

Failure: The main selling point of a service like Dropbox is that you can rely on it as if it were your own local hard drive -- so when the service is unavailable for an entire day, it doesn't bode well for business.

That's precisely what happened on Jan. 10 of this year: Around 3:30 p.m. Pacific Time, Dropbox acknowledged its services were on the fritz, telling customers via Twitter that all client-syncing and file-uploading would be unavailable "for approximately the next hour."

Fast-forward to 7:09 the next morning, and -- hallelujah! -- the problem was resolved.

The Dropbox outage aftermath

Fallout: Suffice it to say, users who relied on Dropbox for their file storage needs weren't happy when their drives disappeared for a day. Dropbox customers took to Twitter to voice their dissatisfaction, using the hashtag #DropboxDown to organize the complaints.

"Dropbox is down. Users are wailing. Can't trust cloud 100%," one user said.

"I'm in a full flop sweat now that @Dropbox is down. Tried syncing lecture readings for the past half hour," wrote another.

Fix: Dropbox never divulged what went wrong on D-Day 2013, but Amazon sent out statements assuring publications that the outage had nothing to do with its oft-blamed Amazon Web Services.

Credit: Wikipedia
Facebook falls flat

Date: Jan. 28, 2013

Duration: Two to three hours

Failure: Facebook users around the globe found themselves unable to keep up with their friends' status updates on the morning of Jan. 28. Insignificant as it may seem, Facebook is a site frequented by more than a few people -- so a few hours of downtime didn't go unnoticed.

Adding to the oddness of the situation, hacker group Anonymous had posted a video earlier in the month in which it threatened to attack Facebook and take the site down on that very same day. What actually happened?

The Facebook outage aftermath

Fallout: People went two to three hours without knowing what their former high school classmates ate for breakfast. The world, astonishingly, did not end.

Fix: Facebook said the downtime was the result of a DNS issue that "prevented people who typed 'facebook.com' into their browsers from reaching the site" -- an issue that was easy enough to resolve. There were no signs that Anonymous had any involvement with the outage.

Facebook's flop affected only the regular desktop website; the company's mobile sites and apps remained online during the incident.

Microsoft melts down, part one

Date: Feb. 1-2, 2013

Duration: Around two hours

Failure: February was a rough month for Microsoft. On Feb. 1, the company's Office 365 editing suite and Outlook.com mail service both stuttered. Users were unable to access the services for about two hours, which -- given the enterprise-connected nature of those products -- is no small amount of time.

Then, a day later, Microsoft's Bing search engine suffered a nearly two-hour outage of its own. How do we know? We Googled it, of course.

The Microsoft outage aftermath, part one

Fallout: On the Office 365 and Outlook.com front, user forums and social media lit up with irritated customers' complaints. And on the Bing side, both people who actually use Bing were probably pretty upset.

Fix: The main outage, according to Microsoft, was the result of "routine maintenance" gone awry. More specifically, the company said a "scheduled network configuration change" was the "root cause of the issue," and "necessary repair steps" were able to be "successfully implemented and validated" in order for the "impact" to be "mitigated."

Whew. Conversational speakers those engineers are not.

Credit: Brian Reischl / MSDN Forums
Microsoft melts down, part two

Date: Feb. 22, 2013

Duration: Over 12 hours

Failure: Microsoft's second February flub made the first look like a walk in the park. Early in the evening of Feb. 22, the company's Windows Azure cloud storage service went kaput, with all attempts at secure access timing out as unavailable.

In a likely related twist, other Microsoft services like Xbox Live, Xbox Music, and Xbox Video also started acting up, with users unable to access cloud-connected data or utilize any multimedia content tied to the products.

Credit: Twitter
The Microsoft outage aftermath, part two

Fallout: Once again, forums and social media became the places for disgruntled customers to turn. And boy, were they ever disgruntled.

"What I've noticed is a complete lack of estimates on when issues will be resolved," one commenter wrote.

Some tweeters were less forgiving. "Windows Azure Storage, Amateur Hour Outage," one user quipped. "Running a global IT business - $37b: Building a cloud data system - $5b: Renewing a $10 SSL cert - Priceless," remarked another.

Fix: Microsoft revealed an expired SSL certificate was the cause of the crash (seriously?!). Once the company managed to "validate" and "implement" the "recovery options," all was well again.

Credit: YouTube
Google Drive

Date: March 18-19, 2013

Duration: About 17 hours total

Failure: It all started on Monday, March 18, when many Google users faced slow load times or full-on timeouts while trying to access their Drive documents and files. That lasted for about three hours.

If only the troubles had stopped there. A day later, a second Google Drive outage kept some users from accessing the service for about two hours. Two days after that, the Schmidt really hit the fan when Drive went down for a whopping 12 hours.

Yikes. Where's Sergey's skydiving team when you need 'em?

The Google Drive outage aftermath

Fallout: The cloud-reliant crowd isn't one to stay quiet, so you can bet forums and social networks filled up fast with frustrated folks. Google did its best to keep users apprised of engineers' progress with its Apps Status Dashboard.

Fix: Google said the initial issue was related to a glitch in the company's network control software. The system apparently failed to load-balance and introduced unwanted latency into the company's servers. That, in turn, led to an issue with Drive's connection-management system.

Google promised it fixed the bug and tweaked its load-balancing setup to allow for "greater isolation" between its network services. The company also adjusted its Drive-specific software to make the service "more resilient" when it comes to latency and recovery.

Credit: YouTube.com
CloudFare doesn't fare well

Date: March 3, 2013

Duration: About an hour

Failure: CloudFare's business revolves around protecting and accelerating sites around the Web, but on the morning of March 3, the company's own site and all of its services kicked the bucket, taking down some 785,000 other sites -- including Wikileaks, 4chan, and some government websites -- along with them.

(Video at left shows BGP sessions being withdrawn as CloudFlare's routers crashed.)

Credit: CloudFlare
The CloudFare outage aftermath

Fallout: For about an hour, if you tried to get to any CloudFare-connected site, all you'd get was a cryptic "No Route to Host" error message along with a mildly amusing sense of irony.

Fix: In a postmortem of the incident, CloudFare said a systemwide failure of edge routers -- which connect CloudFare's system to the rest of the Internet -- was the cause of the crash. While a few routers going down would typically cause traffic to be shifted, in this case, a bug took every single router offline at the same time.

Engineers found the offending code, cleared it out, then had to wait for teams in 23 data centers across 14 different countries to physically reboot all the routers.

Credit: Twitter
Dropbox drops off the Web again

Date: May 30, 2013

Duration: About 90 minutes

Failure: After almost five months of mainly smooth sailing, Dropbox disappeared for a second time in 2013 at the end of May. This time, the service went offline for about an hour and a half, leaving users with no way to access their files or upload any new material.

Credit: spud murphy / Wikimedia Commons
The Dropbox outage aftermath

Fallout: After January's 16-hour Dropbox-dead fest, folks were understandably irked when the service seemed to shut down again. Luckily, the site returned to normal before too much irritation could boil up.

Fix: Dropbox stayed mum on the cause of its second 2013 crash, saying only that its service had returned to normal and that it "apologize[d] for any inconvenience that might have been caused."

Credit: Isaac Mao / Flickr
The Fail Whale flails

Date: June 3, 2013

Duration: About 45 minutes

Failure: Twitter's infamous Whale of Doom reared his blubbery head this summer when the service crapped out for just under an hour. During that time, users were unable to access the service to send or read tweets. After the first 25 minutes, service returned for some folks but remained slow and spotty for a while longer.

In the grand scheme of things, of course, a single outage of 45 minutes is pretty impressive for a service that used to be offline more often than not.

Credit: Twitter
The Twitter outage aftermath

Fallout: Timelines came up blank and tweets went undelivered for a short time -- and Google+ presumably saw a sudden spike in activity from all the people asking if Twitter was down for everyone else.

Fix: Twitter said an error during a "routine change" sent the Fail Whale swimming to the Web's surface. Engineers "rolled back the erroneous change" as soon as they pinpointed the problem, and the tweets starting flowing again with no wily whales in their way.