Designing a disaster recovery plan has traditionally forced companies to strike a delicate balance. To create a plan that restores operations quickly, an enterprise needs to invest significant capital. On the other hand, costs can be cut dramatically if an enterprise is willing to withstand longer periods of operations downtime. During the planning stages and while the computer network runs properly, the forces to reduce costs are felt the strongest and often prevail. But when disaster strikes and the network goes down, everyone starts screaming to get the network up and running again, as fast as possible. Finding a way to walk this tightrope is a major challenge, but with the advent of virtualization, deploying disaster recovery plans that restore operations quickly—and at a reasonable cost—is quite possible.
MORE ON CIO.com
ABC: An Introduction to Business Continuity and Disaster Recovery Planning
Virtualization at Warp Speed: How One Company Made it Fly
IT Drilldown: Virtualization
At Transplace, we developed a new disaster recovery plan based on virtualization when we moved our infrastructure to a new production data center in 2007. We also took that time to refresh our hardware and review our overall architecture. Previously, we ran daily backups and physically moved the data to an off-site location. With this process, we risked being down for a half day if we experienced a problem in the middle of the day. This type of plan also limited us in that we only backed-up once a day, which meant we risked losing a day’s worth of work. This plan also required us to have dedicated servers that sat idle except when we executed a recovery.
After we moved into our new data center in Dallas at the end of 2007, we began to plan our new disaster recovery data center in Arkansas, into which we moved in February 2008. At the storage level, we deployed network-attached storage and SnapMirror software from Network Appliance to create virtual storage for our database and application servers. SnapMirror allows us to send copies of all changes to our backup facility on a near real-time basis without impacting the performance of the applications. Anytime a record changes in production, it sends a copy to our disaster recovery facility. This shared-storage approach also allows us to manage storage centrally. We buy storage only when we need it.
At the database level, we deployed IBM P570s with AIX as the operating system, leveraging its logical partitioning technology. This combination allows us to partition each server to look like multiple servers, and we can run multiple database servers by sharing the capacity of the individual servers. In the disaster recovery facility, the database server runs four to six copies of Oracle that we use for testing and development most of the time, but if the need arises, we can shut down the virtual servers and run the disaster recovery instance of Oracle on that same server. This also allows us to make the most efficient use of our Oracle licensing costs, which are charged by each physical CPU core.
At the application server level, where we run VMware and Windows on Dell servers, the content of each virtual machine is also replicated to the disaster recovery site anytime an update occurs. With VMware and IBM database servers, we use a set of servers for testing and development. When we need to run a disaster recovery restore, we turn off the virtual servers for test and development, bring-up the ones for disaster recovery, and we’re good to go. All the data and content of the servers is quickly copied over.
Four-Step Disaster Recovery Process
For enterprises ready to develop a disaster recovery plan, we recommend a four-step process that helps frame the project and ensures a reliable disaster recovery process:
Step 1: Enablement
Make sure all the data is properly transferring to the disaster recovery data center. Ensure that all the proper hardware in the disaster recovery data center is in place, will remain stable and is running on up-to-date operating systems. Also, review all applications and decide how long you can you go without each one. This helps prioritize the most crucial applications. Some applications might need to be restored in less than an hour while you might be able to do without others for up to 12 hours. This part of the plan becomes an internal SLA.
Step 2: Testing
Develop detailed procedures and processes on how and how often to test the disaster recovery plan. We recommend at least once per quarter. You also need to determine how to measure success so that you can evaluate the testing and document the findings to compare one test to another with a high level of validity.
Step 3: Cutover Documentation
You need to document exactly how you will cut over if and when a disaster strikes. There will be some elements similar to the test process, but there will also be differences for how you execute procedures while under a live disaster recovery. With all the pressure your IT staff will be under, it’s critical that this step be clearly and thoroughly documented.
Step 4: Returning to Normal Production Infrastructure
Just as important as how to cut over to your disaster recovery infrastructure is knowing how to return to your normal production infrastructure. It’s not always a case of doing things in reverse, and it’s a process you should also test.
It’s important to bring all of the key vendors and your internal IT team into the same room at the same time. This gives everyone a chance to voice concerns, explain how their piece of the puzzle contributes to the overall project, and to understand the functions of the other parts of the project. If you get yourself into a position where you act as the go-between among your vendors, important information will undoubtedly be lost in translation.
Enterprises should take a good look at compression technologies. With all of the data that needs to be copied to the disaster recovery site all day long, it’s important to reduce the amount of bandwidth you require so that your network runs efficiently.
Looking back on the disaster recovery plan we started in 2007, we feel we have achieved the ultimate balance: a simple way to recover operations fast—but at a relatively lower cost than traditional disaster recovery systems. Without a doubt, virtualization played a vital role in helping us achieve this mission.
Vincent Biddlecombe is the CTO of Transplace. He has more than 15 years of experience in IT consulting with an emphasis on transportation management systems.