Cherwell IT Service Management Blog
Resources, Best Practices, and Solutions for ITSM Pros

What Google’s Glitch and Gmail Failure Teach Us About Service Management

Posted by

blog-error-500
Gmail and Google+ went down yesterday, and in a weird coincidence, Google users searching on the term “Gmail” experienced a bizarre glitch, inadvertently sending emails to a guy named Dave from Fresno.

Read these Tech Crunch articles if you’re interested in the details.

“Temporary Error 500. We’re sorry, but your Gmail account is temporarily unavailable.”

Wow. Really? Gmail? …I mean, Google?

For those of us in the IT service management world, the idea of Google experiencing this level of failure is sobering. Well, and at least a little gratifying, but that’s another blog. Here are 5 things Google’s colossal technical failures of the past 24 hours can teach us about supporting customers in the unpredictable world of technology:

  1. The colossal failure WILL happen, and likely it will be at the WORST time possible. If Google can experience this level of technical failure, you and I can forget about getting it right every time. Google made $51B last year, so resources wasn’t an issue. This is a hard reality, because the people we serve in our organizations don’t typically have a grid for “count on it, failure will happen.” Well, let me say, they might, right up until it happens to them.
  2. What can we do? Socialize the expectation that failure WILL happen. Mistakes will be made. Technology will let us down. But also be sure you can deliver on the promise of your services. And then, communicate that when it does happen, “Our team of professional technologists will respond quickly, communicate updates to you, and resolve the issues as quickly as possible.”
  3. Brace for the criticism (and the fact that there will be a lot of laughter at your expense). At the same time Gmail went down, engineers responsible for keeping Google online all the time just happened to be sitting down for a Q&A on reddit. Imagine that? In the words of Tech Crunch writer Greg Kumparak: “Heh. Worst.Timing. Ever.” He also went on to point out that this team of engineers are called the “Site Reliability Team” who are “responsible for the 24×7 operation of Google.com.” Funny, really funny stuff. And I’m not even showing you all the funny tweets. Poking fun at Google+, one writer wrote, “The problem is currently affecting a huge number of users. Google+ is also down, although you’d be forgiven for not having noticed that sooner.” The laughter will come (after the screaming), and it will likely be at your expense.
  4. What can we do? Not much. Except laugh. And be reminded that we shouldn’t take ourselves too seriously. Let the criticism shape you as a leader: take responsibility, respond with patience, offer clear communication.
  5. Use this as an opportunity to reinforce and shape your culture. Probably one of the most telling moments of the story was revealed during the reddit interview. The engineers kept answering questions while the services were down (because you can bet Google employs more than 4 engineers!). Here is the revealing exchange:

Reddit use notcaffeinefree asks: “Sooo…what’s it like there when a Google service goes down? How much freaking out is done?”

Google’s Dave O’Connor responds: “Very little freaking out actually, we have a well-oiled process for this that all services use— we use thoroughly documented incident management procedures, so people understand their role explicitly and can act very quickly. We also exercise this [sic] processes regularly as part of our DiRT testing. Running regular service-specific drills is a big part of making sure that once something goes wrong, we’re straight on it.” (http://tcrn.ch/1d2FQIK)

And that, in a nutshell, may be why Google banked $51B last year.

Leave a comment

Your email address will not be published. Required fields are marked *

CAPTCHA *
Time limit is exhausted. Please reload CAPTCHA.