Apr 15 2010

High-Availability Benefits from Virtualization

Agencies combine virtualization with fault-tolerant hardware for ultimate resiliency — and to ease server management.

It’s been said that some organizations are too big to fail. The same is true of your critical applications and data. A server crash, a power failure, even user error can cause your systems to become unavailable precisely when you need them most. That’s why more federal agencies are turning to virtualization to ensure high availability of their most critical IT assets.

“The benefit to using virtualization for high availability is that it’s much simpler for IT managers,” says Dan Kusnetzky, vice president of research operations for the 451 Group. “You don’t have to change applications manually if they’re running inside encapsulated servers or clients using motion technology. Virtualization offers simplicity, in that you have multiple machines running on a single server and the workload can move back and forth as needed.”

Of course, high availability means different things to different people. For some, it’s having a virtualized system where, if a critical app or even an entire server fails, a new virtual machine automatically takes over within minutes or possibly seconds. For others, it’s using fault-tolerant servers that provide full hardware redundancy, allowing for real-time replication of processing and data storage and assuring uptime that approaches 99.999 percent. 

Keep It Moving

“The apps we worry most about are our web-based electronic document workflow application, selected applications in our eTools web-based suite of contract management applications, the database underlying the eTools apps and, of course, everyone’s No. 1 mission-critical app, e-mail,” says Michael R. Williams, CIO for the Defense Contract Management Agency. DCMA is responsible for making sure Defense Department contractors meet all their contractual obligations.

Williams says the primary goal in moving DCMA to a virtualized environment was to save money by collapsing 17 data centers into two. Higher availability was just “the icing on the cake,” he says. “It turns out virtualization brings with it flexibilities that can be leveraged to increase redundancy (for example, clustering in virtual mode and virtual failover machines in standby status. That, plus the ability to rapidly fire up a virtual server to pick up the workload from a failed machine, increases availability.”

Virtualization alone, however, won’t guarantee continuous operation. The most reliable approach is to create a virtualized environment using fault-tolerant hardware to synchronize data processing across multiple virtual machines.

When it comes to the need for continuous operation, few agencies can match the Federal Aviation Administration, which has used fault-tolerant hardware from Stratus for air traffic control and critical systems for more than 30 years. Two years ago, FAA began virtualizing its operations-critical international message switching system using Stratus systems running VMware ESX.

Building in fault-tolerant hardware to ensure that systems are continuously available can add 25 percent to 35 percent to the cost, says Denny Lane, director of product management for Stratus. But the alternative can be far costlier.

Customers rarely take the time to calculate how much downtime really costs them, Lane says. The amount of time it takes to resync and restart systems, the loss of data and productivity and the consequences of failing to comply with federal regulations can be enormous.

“Even if you have a clustered hardware solution and it fails and restarts, what’s going on during that time can be lost forever,” Lane says. “A high-availability system that gets rebooted may not be good enough. If you have gaps in the auditability of data, you can run into penalties.”

Kusnetzky agrees: “The lowest level of high-availability requirements can be met by virtual machine software combined with motion technology, but the highest levels of availability cannot be achieved by virtualization because the transition time is too long,” he says. “Put in boxes designed for continuous availability, have virtualization software running on them, and you’ll never see a failure.”