by Esther Schindler

Microsoft’s Zune Meltdown: Three Lessons Developers Should Learn

Opinion
Jan 05, 20097 mins
Developer

On December 31, all the 30GB Zune models turned into bricks because of a Leap Year firmware coding error. This quality assurance and testing debacle demonstrates three lessons every software developer should take to heart.

On the last day of 2008, every one of an older model of the Microsoft Zune MP3 player (the ones with 30GB of storage) locked up. The devices were back in operation again a day later, and Microsoft explained the cause of the trouble:

“A bug in the internal clock driver related to the way the device handles a leap year. The issue should be resolved over the next 24 hours as the time change moves to January 1, 2009. We expect the internal clock on the Zune 30GB devices will automatically reset tomorrow (noon, GMT).”

With that data, Zune and technical users have some idea of what happened in the “Z2K9” incident. Microsoft’s Scott Hanselman wrote a very good technical analysis for programmers about the dangers of such edge cases, and apparently he’s not the only one to cover the bug. (My thanks to Indrajit Chakrabarty for the pointer.)

However, aside from “how not to write code like that,” there are three important things for developers and software QA professionals—and their managers—to take away from the experience.

This Was a Failure of the Software Development Process and QA Testing

It’s great that the technical problem was so easily addressed (“wait a day”), but it’s one heck of an embarrassment for Microsoft. I’m not talking about their PR issues per se, though Microsoft is still trying to live down the Red Ring of Death debacle with its XBox. However, Microsoft has a long history of, shall we say, a less-than-stellar reputation for quality, and they did not do themselves any favors with this incident. I feel especially sorry for the authors of the new book, How We Test Software at Microsoft (cue: pointing and giggling) and the many smart people I have met from the company. (They have great people. Really. Some of the smartest techies I’ve met. But somehow Microsoft doesn’t seem to create a culture that demands quality.)

But the bottom line is that this problem was entirely preventable. As a London-based web developer pointed out to me, “Edge conditions such as year transitions on leap years really ought to be tested as a matter of course, and shouldn’t be that difficult to do on devices where you can adjust the clock.” The date problem really should have been spotted before it was checked in, he says; any sort of code review probably would have spotted the infinite loop possibility. So why wasn’t it done? Why wasn’t it caught?

I do understand the notion of “ship on time,” and that some things get lost in the eternal desire to make a production date. Quality assurance testing is not the only victim. But this is a well-defined problem set with pretty darned obvious unit tests. (I won’t be surprised if I get e-mail messages from QA Tools companies telling me that their products include such tests as a matter of course. Just post a response to this post, folks. In this context, it’s fine.)

We all make mistakes. But the purpose of software engineering is to catch and fix errors before the product is released.

For further contemplation: Would your company’s software development process have caught an error like this?

Failure to Learn From History: It’s Date Math for Crissakes!

I could easily forgive this bug if it were terribly obscure, such as occuring only on Wednesdays when the user name is six characters long and starts with an “M.” But it’s not. It’s standard date math. The computer industry has been doing date math since the first mainframe, and every possible variation has been tackled, in every possible language, from Assembly language and COBOL to Python and Ruby. That’s not to say that we haven’t encountered date math problems before (see Classic WTF: The Bug That Shut Down Computers World-Wide) but to make a leap year mistake today? Sheesh.

There’s no excuse for a date math problem to crop up in your code, particularly for something as clearly understood as leap years. I’m not asking you to write a date routine to play a piece of music on the Thursday before Easter, after all. Just something that is well documented, with plenty of opportunity for code reuse. Why is any developer re-inventing this particular wheel?

For further contemplation: In what “wheel-reinvention” is your team currently engaged?

That Programmer is Probably Still in the Computer Industry

When I asked for input on LinkedIn, I described the Zune Z2K problem as an EPIC FAIL. To my surprise, several respondents took issue with my label, because this was “only” a consumer device, and 24 hours of interrupted audio is not critical to anybody’s life.

I disagree. This is a major failure, particularly if you take a long-term view. For one thing, I firmly believe that each of us should do our absolutely best, all of the time. This is called a “work ethic.” If it’s your job to do something, it’s your job to do it right. Yes, I realize that people and events beyond your control can make this difficult; that doesn’t mitigate your responsibility. Even if we’re talking about a trivial MP3 player, it was that team’s job to create the best possible product that fulfilled all its promises, including Happy Customers. Goal: Create Great Stuff. End Result: an embarrassing failure which reflects poorly on our profession.

And as trivial as this device might be in mission-critical terms, it’s not a minor one for Microsoft’s sales. Though CIO.com puts relatively little attention on consumer devices (read 20 USB Gizmos That Have No Place in the Enterprise (But You’ll Love Just the Same) to read a fun exception), I’m quite sure that no one at the company considers the Zune product success unimportant, and I’d bet dark chocolate that that Microsoft dreams of overtaking iPod sales.

Yes, I’m glad this mistake was made on an MP3 player and not at a power plant or on a medical device. But here’s the thing: the Zune that failed was an older model. The programmer who wrote the errant code has moved on. Most probably the developer is still in the programming field, and has spent the last few years writing even more embedded systems for Microsoft or for other companies. For all we know, last week that programmer wrote the core software for a power plant or medical device. Does that change your perceptions?

Quality Assurance doesn’t apply only to fixing code. It applies to fixing broken application development processes [external PDF], whether it’s because Quality Assurance departments aren’t given the respect and resources they deserve or because developers leave it to QA to find every problem rather than bearing personal responsiblity for their own code (my gut tells me this is compartively rare, but I’m sure it happens), or any other reason. As one correspondent noted, “It may have been a development problem or a requirements problem. Such things are always, ultimately, management problems.”

Of all the thousands of articles I’ve written, and the dozens of books that bear my name, the single quote from my 18 years of writing that’s used most often as a usenet signature is something I wrote in the last issue of OS/2 Magazine in 1996: “Microsoft’s biggest and most dangerous contribution to the software industry may be the degree to which it has lowered user expectations.” Apparently, that quote is still frighteningly relevant.