Stop Complaining When Microsoft Pulls an Update to Fix it

Lego builders construction
flickr/dierk schaefer (One-Time Use)

Has it been a bad few months for Microsoft and updates? In August there was the Surface Pro update that was pulled and rewritten, and not released until September. Then a Lync Server update was pulled when some users couldn't install it because of certificate issues, and a new feature intended to simplify syncing new libraries in OneDrive for Business was removed and replaced too. After that the Outlook Web App started having problems in the new version of Chrome, because the browser dropped support for a way of handling dialog boxes that the Exchange team was still relying on.

Does all that mean a problem with quality control that bodes ill for Microsoft, especially as Windows 10 will have the option of frequent updates and improvements? Or does it just mean that the shift from the three year update cycle of Office and Windows to the three week release cycle of Azure is reaching more of Microsoft's products?

That's a shift that will make things more stable as well as giving them more features faster - but only if you remember that the principle of fast-moving services is that things will break and what matters is how well you recover.

Think of Facebook; one of the original drivers of consumerization, as business users said 'why can't the software and hardware I use at work be as simple and attractive as what I do online in my spare time'. Facebook's motto has long been 'move fast and break things'; as it's grown, that's shifted to emphasize reliability, but it's still about moving fast, because that's the only way you can be stable.

Unless you're building a system that will never change, stability in deployment is more like riding a bicycle than building a monument.

As the Azure CTO you expect Mark Russinovich to talk about releasing features and updates frequently as a way of getting stability; as he told us recently "The only way to make it so you can get more stable is to release more often. Once you get your systems - from your engineering systems, to your deployment systems, to your monitoring systems tuned, so you're getting things out quickly and detecting where health goes awry really quickly, then you don't have to let things bake for ever." But you'll hear the same thing from members of the Windows team, like Jeffrey Snover, the creator of PowerShell - which comes out as part of the Windows Management Framework. That's getting updated between new versions of Windows and again Snover talks about shipping being like riding a bicycle; if you're too slow, you are much less stable.

Perry Clarke, who runs the engineering team for Office 365, talks about updates in terms of entropy; "it's the statement that there's no free lunch," he jokes. "If you want to build systems that are really efficient you want to find processes that minimize the amount of entropy and disorder you create - they're called reversible. They don't look like massive big step functions of change that happen all at once. "An on premise update every three years? That's a huge step change from a stable if elderly state to a new state that's supposed to be better but can be painful to get to - not just because of the upgrade process, but because the new release is built as a single, interdependent system.

"When you deploy a huge amount of change, if we're not validating each one they get bound up in tightly coupled layers  with a lot of changes across the system. When you're done you have a very large number of very good things and then a small number of things that are not so good and it's very hard to pull out those not so good things later on." He calculates the amount of work addressing interactions as being much worse than linear; "it's at least quadratic so there is a declining return on batching things up together."

Incremental cloud improvements are very like reversible processes. "You're making a whole bunch of very small changes, each one very reversible in the sense that the impact is small and you can easily undo them long before they are coupled with changes later on. So when you make mistakes you can reverse them and that causes a little bit of entropy when you do the reverse. But in the end you're not just spreading this randomness, this disorder across time; you actually end up introducing less entropy in the system to get the same amount of work done in the process."

Add to that continuous validation the faster feedback loop - instead of finding out two years after you ship code you started working on three years ago that's there's a problem (because customers waited for the service pack before installing), you see problems straight away.

That can be far more disruptive because that problem could hit everyone using the service at once, and if everyone sees downtime at once it's a much bigger issue than the same amount of disruption for the same number of people spread out over much longer periods of time. On premise software is naturally isolated, so only one installation sees the problem at a time. Clarke is unperturbed by the bad publicity that might bring; "it's worth having a few kerfuffles to make sure proper attention is maintained," he says, declaring that " the disinfecting power of sunlight is a positive thing for the world."

Office 365 tries to get the same notion of isolation by 'flighting' changes to larger and larger 'rings' of deployment, starting with internal teams, like the few hundred people working on Exchange. "Every few days some change is rolled to the environment that affects thousands of people, and then we expand those rings out to all of Microsoft, which is tens of thousands of people, and then to a set of users or tenants we think are more risk tolerant. Eventually you will get it out worldwide, but that won't be for several weeks." That way the team has multiple chances to catch problems, and see how changes affect user productivity too.

Experiments are much cheaper when you can just put them on the next train. "You have an underlying methodology for keeping your trains running, and you can have a lot of parallel trains running at different layers of scope so you don't create batches of change. Some of those trains will get cut off because fixing the defect isn't worth it when you can wait for another train that's only a day behind it."

Sometimes the problem isn't that the service changed. "The world is changing, people are constantly using the service differently; code that was working six months ago and had worked for three years, then people start doing something different that causes it to break. Then it's about detecting that, getting to the root cause and getting a fix or a workaround - and getting that down to minutes or hours, or even sub-second with autorecovery," says Perry.

Those same principles of rings and flights apply to updates for devices and local software; Windows Update makes an update available to a small group initially and then to progressively more users. That gives Microsoft a chance to spot problems and pull updates.

Is there a quality problem? Clarke notes that we all have more devices, plus there are just more updates. "I do think the speed of feedback to the device and to the native client is increasing and that is leading to increasing rates of seeing updates. I haven't seen statistics that indicate the regression rate is increasing. The statistics we do see is that clients are getting better."

But devices and the software that runs on them hasn't gone through as much of the process of decoupling changes from each other as cloud services have, making them possibly more brittle. "To some degree, devices and large legacy clients are the ones that are toughest to migrate to this world of frequent continuous updates," Clarke points out, "but I think they're all headed this way."

The question isn't just about the quality of updates; there's also just how much change we can cope with. Conversations with Microsoft customers have gone from "how come you guys are such dinosaurs? We're going to bet on vendors who are not as good as Microsoft today but are going faster," Clarke says.

"Now the conversations are 'wow, could you slow down the change? My company is not sure it's comfortable with the rate of change happening'."

That might mean thinking more carefully about what we change. "If there's only so much change you can introduce, you have to be careful of your budget to make sure you get maximum value. Before the constant was developer hours and the amount of value I can get out of that is almost infinite because the cost of change is small. Now we're starting to get to the capacity of human beings to build muscle memory to be efficient." And that means picking the changes that make people more productive - which won’t be the same for everyone. "We're going to shift to thinking that productivity isn't a commodity, because no company is the same as everyone else. And the opportunity to do something really unique for your company is much more valuable and interesting," he says.

And really, that's one of the key principles of consumerization.

This story, "Stop Complaining When Microsoft Pulls an Update to Fix it" was originally published by CITEworld.

NEW! Download the State of the CIO 2017 report