As a DBA, your job is to provide database systems that are reliable, secure and available when required. In order to achieve these goals, you must understand your customer's service requirements, usually conveyed through Service Level Agreements.
MORE ON SERVICE LEVEL AGREEMENTS
Far too many DBAs have no idea what their Service Level Agreements are, or if it's possible to meet them. Good DBAs know what they are, obtain budget to design and implement systems to meet them, and ensure they're met through Disaster Recovery planning and testing.
Let's take a look at what's typically included in Service Level Agreements, understand the ramifications of an availability target, and how we can ensure the SLAs are achievable.
Typical Service Level Agreements
Whilst every organization will have different Service Level Agreements, most fall into similar groups or categories. Common ones for SQL Server systems are as follows:
System Availability. e.g. The database servers must be available 7 days a week, from 6am to midnight
Acceptable Data Loss. e.g. No more than 15 minutes of data entry can be lost
Recovery Time. e.g. In the event of a disaster, the systems should be back up and running within one hour.
Performance. e.g. Transaction response time should not exceed 2 seconds.
Specific statements such as those presented above are required in order to design systems and processes to meet expectations. What's important at this point is understanding the need to make design and infrastructure decisions in the context of meeting Service Level Agreements.
How Many Nines?
System Availability, a common SLA item, is often measured in terms of the amount of 9's in the availability percentage target, with "Five Nines" representing 99.999% system availability. This translates to a 5 minute maintenance window, or allowed downtime, per year, or 0.8 seconds per day! In contrast, a 99% target represents 1.7 hours per week. Removal of each "nine" from the uptime target significantly reduces the cost of building an environment that meets the target, as Table 1 helps demonstrate.
Table 1: How Many 9s Do You Need?
This table shows the sharp decrease in downtime for each additional 9 in the availability target. 99% availability allows for about 3.5 days downtime per year. 99.999% allows 5 minutes!
|Availability Target||Downtime Per Year (Approx.)|
|90 percent||36 days|
|99 percent||3.5 days|
|99.9 percent||8 hours|
|99.99 percent||52 minutes|
|99.999 percent||5 minutes|
Perhaps the real question for a business is not "How much down time is acceptable?" or "How fast should the system be?" but "How much are you prepared to spend?" In posing this question, you should prepare an options paper with various configurations that list the corresponding cost and benefits as they relate to the Service Level Agreements.
A common entry in options papers relates to backup and recovery technology, an example of which is highlighted in Table 2. In this example, two backup options are presented along with their corresponding costs. The first option is a sophisticated Storage Area Network snapshot backup technology that allows near instant backup and recovery. The next option is the standard SQL Server backup and recovery. In both cases, the advantages and costs are presented.
Table 2: What Will It Cost?
Option papers that highlight the cost of various options, such as this example for backup and recovery, help in achieving realistic expectations when developing service level agreements.
|Item/Cost||SAN Snapshot Backup||Native SQL Backup|
|Benefits||Near-instant recovery||Reduced complexity|
|Recovery Time||about 5 minutes||about 45 minutes|
Options papers containing examples such as this present clear choices and sharpens the debate on what's really important. After all, what's the point in spending lots of money building and maintaining a "Five Nines" system when an hour of downtime per week is perfectly acceptable?
Fortunately, SQL Server 2008 introduces a number of new and enhanced features that assist you in meeting your Service Level Agreements.
Ensuring Service Level Agreements Can Be Met
It's often the case that despite your best efforts, the Service Level Agreements cannot be met, and you won't discover this until disaster strikes. In order to feel comfortable with the agreements in place, it's crucial that you anticipate and plan for disaster.
DBAs often make the mistake of defining disaster too narrowly. Small events can have just as big of an impact as the larger, less likely ones. An appropriate Disaster Recovery plan is one that anticipates a variety of disasters and implements processes to test that the recovery plans are valid through simulation. The next section, Planning for Disaster, addresses these topics.
Planning for Disaster
There are two ways of dealing with the potential for disaster: expect and plan for it, or ignore it, head to the pub and hope for the best. As attractive as that last option sounds, it's not something I recommend!
Planning and preparing for disaster can be an expensive and complicated process. A good DBA expects disasters to occur and obtains management support and funding to prepare, plan and test recovery plans for a variety of potential disasters. There is nothing like an actual disaster to make or destroy a DBA career!
Let's consider the types of disasters that may occur, what's involved in planning for a complete site failure, and the process of simulating disasters in order to test recovery plans.
Ask a DBA, or any IT person for that matter, for their definition of disaster, and most of the responses will involve cataclysmic events involving tsunamis, bushfires, landslides and various other events that destroy entire environments and all the hardware and data contained within. Whilst these are valid possibilities, they are rare and unlikely to happen to all but the most unlucky of us!
Far more likely are disasters that occur on a much smaller scale, like accidentally dropping a table (see Table 3). Without a plan for dealing with small events such as this, the impact on an organization can be just as large as the earlier examples.
Table 3: Minor and Major Disaster Scenarios
"Disasters" include smaller, more frequently occurring events such as disk failure. Whilst not as dramatic as a fire that destroys a whole building, they can have just as much impact.
|Disaster Category||Example||Possible Mitigation|
|Hardware Failure||Disk Failure||RAID Disks, Backups|
|Memory Errors||ECC RAM|
|Accidental/Malicious Damage||Drop Table/DB||Backups|
|Virus||Network Security, AV Software|
|Data Theft||Encryption, DB and Physical Security|
|Environmental Disaster||Fire||Second data center, DB mirroring|
Preventing and dealing with these issues is beyond the scope of this article. The important point here is for a DBA to anticipate and prepare recovery plans for a range of potential disasters, not just the big ones.
Component Complexity and Mean Time between Failure
It's not uncommon for SQL Server databases to be deployed on large clusters containing hundreds of thousands of individual hardware components, each of which is subject to failure. Whilst some components, particularly hard drives, are supplied with "Mean Time Between Failure" (MTBF) statistics by their manufacturers, such statistics are usually meaningless when considered in the context of a large system with inter-component dependencies. As such, the importance of anticipating, preparing and planning for disaster is crucial.
Complete Site Failure
The previous section provided examples of smaller, contained disasters. It's also important to plan and develop a process for a complete environment failure. Not only will this be required for large scale disaster purposes, it's also required in minimizing downtime when migrating systems from one physical location to another, or during a side by side upgrade. Whilst such migrations may never actually occur, having such a plan ready to go is a valuable asset, and the process of producing such a plan often leads to a deeper understanding of the systems you manage.
Planning for a complete environment failover is obviously orders of magnitude more complicated than individual disasters. It's common for database systems to be connected to a number of other systems, possibly those from other organizations via custom interfaces; therefore, a database recovery plan for a complete environmental failover should take into account all connected systems, and should therefore be part of a wider recovery plan.
Whilst each organization will have its own unique recovery process for a complete site failure, there are a number of common items that should be included in all such plans:
Declaration of Disaster. Who declares the disaster? Should failover to the backup environment be automatic or manual? Whilst automatic failover plans sound good. there are typically very expensive and must make a range of assumptions on the definition of "failure." Most sites, even if they have the technology for automatic failover, prefer a manual declaration and failover method.
Failover Systems. This can high or low tech. Examples include Automatic failover using Synchronous Database Mirroring, or an agreement with another organization to share computing resources when required.
Documentation. Detailed documents should be created and maintained to assist in the recovery process.
Offsite Backups. This can be as low tech as the DBA taking home backup tapes, or a more sophisticated technique involving block level disk replication to a remote site using SAN technology.
The sophistication of a disaster recovery plan is usually determined by both Service Level Agreements and the Declaration of Disaster method.
Simulating and Testing Recovery
As well as anticipating disasters and preparing recovery plans, a DBA must obtain funding to simulate disasters and test the corresponding recovery plan. Depending on the disaster and the sophistication of the recovery plan, this can be an expensive exercise.
A DBA should classify potential disasters in order of the most likely to the least likely, and prioritize funding and planning for those most likely to occur. Alternatively, events which will have the largest organizational impact may be given the highest priority. In any case, simulating and testing recovery is crucial in ensuring a proven plan is ready to go when required.
Finally, any plan should be tested on a regular basis, should involve all the relevant people, and ideally occur at a random, unprepared time, much like a fire-drill. Testing on a regular basis will ensure the continued accuracy of the recovery details, and, if nothing else, a random recovery test will liven up the day!
In summary, the idea behind disaster recovery planning is to reduce the surprise factor when it occurs. Anticipation and preparation are key attributes.
This article is based on SQL Server 2008 Administration in Action by Rod Colledge, to be published in January 2009 (ISBN: 193398872X). It is being reproduced here by permission from Manning Publications. Manning early access books and ebooks are sold exclusively through Manning. Visit the book's page for more information.