Is One of VMware's Best Features a Really Bad Idea?

Who ever said moving VMs from one physical server to another was a way to improve stability or manage server capacity wisely?

One of the more interesting things about covering the computer business is the number of "Duh" moments involved.

I don't mean the conversations where you're way out of your depth and an enthusiastic savant is walking you through the IT equivalent of quantum physics for kindergarteners, hoping you'll pick up the technology quickly enough to talk about why it's important in the first place.

And not the monologues with those who only think they're savants, misinterpret your suicidal stupor for awestruck reverence and their own need to sell you a bill of goods for evangelical zeal.

If you're curious about or involved with technology at all, you look forward to the former situation as a an opportunity for a guided tour of the future, and dread the latter as evidence that relatives of the tick and tapeworm have successfully evolved into soul-sucking intellectual parasites. Unless you feel a need to pre-pay a karmic debt through suffering, in which case they're all good.

The best kind of "Duh" moment usually comes in the middle of a conversation with someone with impressive expertise on a topic you feel as if you already understand pretty well, until they undermine one of your major assumptions.

Like, say, the idea that being able to move a virtual machine from one physical server to another is a good thing. How could it not be?

Isn't one of virtualization's greatest features the ability to move an application and the VM on which it runs from a balky or overloaded physical server to a better one without having to do more than point and click?

Every data center manager wants or needs the ability to shift VMs around while they're running, preferably without even touching them, by setting up rules that say a VM should move when capacity on its host hits a predefined capacity level. Right?

"In a development or QA environment, sure, that makes a lot of sense. I can't imagine why I'd want to do that in a production environment, though," says Chris Steffen, principal technical architect at Kroll Factual Data, a credit-reporting and financial-information services agency in Loveland, Colo.

"Hot-provisioning is a cool gimmick, but on a scale of one to 10, with 10 being critical, live migration of VMs is about a three," Steffen says. "I don't know what kind of environment other than QA or development, or in a staging environment you'd want to do it, but it's not going to be in any production environment I'm responsible for; it's not going to be any environment dependent on any kind of SLA or performance requirements. The justification for the inherent risk just isn't there."

It would be easy enough to brush Steffen off on a partisan basis. He's not only a big user of Microsoft's Hyper-V and other virtualization products, he's THE big user, or at least the leading poster child for Microsoft virtualization products—which come under fire regularly from VMware and the rest of the industry for their inflexibility and lack of the kind of management tools that let you shift VMs around at will.

Kroll's top IT people and CEO surprised the rest of the company's tech crew by deciding in 2003 (the Bronze Age of x86-based virtualization) that the company would base its whole data-center strategy on virtualization and (ack!) would do it with Microsoft products.

The second part of the decision actually came later, after Microsoft promised all its newest technology, support direct from the dev teams, and regular air-drops of virtualization specialists to get Kroll up to speed up to acceptable performance levels.

Kroll is not, by any stretch of the imagination, a typical Microsoft virtualization shop, Steffen offered, about 30 seconds into our first conversation. He doesn't apologize for the ridiculous levels of support Microsoft has offered, but he doesn't apologize for Microsoft's faults, either.

Without extraordinary help, especially in the early days, Kroll could not have met the high service-level criteria that are absolute requirements in the credit-reporting business, where unquestioned data security, five-nines availability and sub-second response times are non-negotiable.

In that kind of unforgiving environment, shifting an application and its VM from one physical server to another -- which increases the possiblity that minor differences in configuration, access to data, patch status, or even location within the network could cause a production server to glitch or cross -- is what's technically known as a "stupid risk."

Duh.

I knew that. The only data center managers willing to make changes to an active application or server are those eager to experience the thrill of job hunting and unemployment-insurance acquisition.

How many production apps or servers get patched or updated while they're actually running, compared to those that get patched on a regular schedule when they're scheduled to be offline for some limited time anyway?

"Most of Cisco's big-iron devices have dual power supplies and so forth, for redundancy," Steffen says. "But when you're showing someone around on a tour, do you pull out one of the plugs to prove it works? If you did, someone like me would be there ready to strangle you."

"I don't know what the risk is, but it's an unnecessary risk," Steffen says. "I can't think why I would do a hot move as opposed as a cold move."

In development or QA environments, or moving from QA to staging, or from staging to production, live migration is a great idea. But it's not critical. The time requirements in those situations are loose enough that crashing an application while moving it around isn't necessarily fatal. But neither is the time penalty of shutting it down before moving it.

A better approach—to carefully map out your available capacity and put new VMs on hosts you know for sure have capacity and will continue to have it—isone that Microsoft also hasn't been particularly good at—at least until its Virtual Machine Manager 2008 became available as a relatively stable beta. (Microsoft announced yesterday that VMM 2008 would ship in September.)

That requires good resource-mapping and VM management capabilities; but, more important, it requires better and different capacity planning than is typical in either physical or virtualization-enabled data centers right now.

Rather than put an app on a physical server and keep loading that host up with VMs until it's ready to overload, Steffen advises putting limits on the load any physical server gets.

He also advises keeping at least 10 percent of your total compute capacity in reserve—even if that means keeping one or more servers spinning and hot but otherwise idle—so you can expand new VMs into it when you need to.

It takes less than 15 minutes to launch a new VM when Kroll needs new capacity; installing it on a server with spare capacity enhances the flexibility without increasing the risk.

That's the beauty and flexibility of virtual servers, not the questionable ability to move a VM from one server to another while it's running, Steffen says.

Why launch a VM on any server you think is going to be overloaded enough that you'd have to move it later without being able to secure its data and shut it down first, Seffen asks.

Fortunately, I had a good answer.

Duh.

Insider Resume Makeover: How (and When) to Break the Rules
Join the discussion
Be the first to comment on this article. Our Commenting Policies