by David Taber

Google Calendar Outage: Lessons Learned

Opinion
Oct 13, 2010
Cloud ComputingInternet

As of Tuesday, Google's Calendar service was suffering interruptions lasting as long as 8 days for some furious users. What can you learn from it?

Like many small businesses, my firm uses Google for shared calendaring. It’s got enough features, it’s well integrated with e-mail and sync servers, it works, and it’s free. For these reasons, it’s got millions of users. So it doesn’t take a big problem to generate a lot of user noise.

This particular issue took 0.2% of calendars offline for several days. But that’s still 2,000 calendars, which means a lot of anxiety. We’ll get into how Google handled this in a moment, but let’s look at the general lessons first.

Peter Sandman has developed a way to predict the severity of peoples’ reactions to unhappy events. The higher the perceived danger, and the lower the person’s control over the situation, the more severe their reaction will be. This is why we’re much less afraid of electrocuting ourselves with a toaster than we are of being attacked by a shark, even though we’re 30 times more likely to die of an encounter with a toaster.

Unfortunately, all cloud-based solutions trigger both the perceived-danger and low-control sensitivities. In a cloud solution, users can’t see the machines, they can’t run a diagnostic, and they can’t run down the hall to yell at the IT people. Even the most sophisticated user has no idea how long it takes to recover their data, or can do anything about the steps involved. Further, if the service is simply down and users are presented with a blank page, they have no idea if this is a temporary interruption or a complete data loss.

Google made this natural tendency worse by not applying resources early enough to stem the well of resentment. In addition, many user symptoms were magnified by third party products that Google had no idea how to respond to. Adding fuel to the fire, they missed implied deadlines and their communication was so sparse that it lead to the exact opposite of crowd control.

What Your Team Can Learn

• It goes without saying that frequent backup of critical data would have made this much less of a user problem. Of course, nobody does this because it’s not automatic. Look for at least a semi-automated weekly backup of all system data from any cloud-based service. Make sure some database administrator (DBA) owns the task of doing a regular meta-data backup (as I have yet to see this automated in any SaaS offering).

• Before you sign up for a cloud-based service, make sure that their service-level agreement (SLA) covers emergency situations and disaster recovery. Ask about their system performance/uptime dashboard (they do have one, don’t they?) and review it for completeness and frequency of update. Google’s simplistic “thumbs up/thumbs down” was not enough to quell user anxiety.

• Ask the service provider for a timeline of one of their recent service interruptions. All you’re looking for is a Gantt chart, or a spreadsheet — any evidence that they have actually analyzed their patterns of troubleshooting and response. If you can’t see any evidence of this, or they won’t share it with you even under non-disclosure…points off for them. At the very least, get them to tell you the date of the last outage they had.

• Get someone to review their discussion forums or other online community areas, looking for how they handled communications with the community during their last service problem. You’re looking for frequency of update, not speed of solution. (Of course you want a speedy solution, but crowd control is about having people feel informed and listened to.)

• If your company is serious about cloud computing, you’ll want to set up your own emergency response system to supplement what the cloud vendors do in three areas:

1. Troubleshoot users’ problems and verify that there isn’t cockpit error (also known as “problem exists between chair and keyboard” or PEBCAK).

2. Identify interactions between the cloud service and other software products that may exacerbate problems.

3. Provide clear, complete, and timely updates about the problem, the ramifications, the solution path, and timelines. Make sure they never make promises about a solution deadline, as repeated postponement will dramatically lower your credibility along with the SaaS provider’s.

David Taber is the author of the new Prentice Hall book, “Salesforce.com Secrets of Success” and is the CEO of SalesLogistix, a certified Salesforce.com consultancy focused on business process improvement through use of CRM systems. SalesLogistix clients are in North America, Europe, Israel, and India, and David has over 25 years experience in high tech, including 10 years at the VP level or above.

Follow everything from CIO.com on Twitter @CIOonline