Click here to monitor SSC
  • Av rating:
  • Total votes: 26
  • Total comments: 5
Rod Colledge

Planning for Disaster

16 March 2010

There is a certain paradox in being advised to expect the unexpected, but the DBA must plan and prepare in advance  to protect their organisation's data assets in the event of an unexpected crisis, and return them to normal operating conditions. To minimise downtime in such circumstances should be the aim of every effective DBA. To plan for recovery, It pays to have the mindset of a pessimist. 

The true measure of a soldier’s worth is not the thousands of hours of drills they’ve completed, or their success in simulated war games. They’re remembered (or forgotten) for their instinct and bravery in the heat of battle, for the split second decisions they make which have a profound and enduring impact on the lives of many.

Be it a Purple Heart or a Victoria Cross, the crowning achievement of a soldier’s career is the moment in which the intersection of preparation and opportunity presents them with a platform upon which to prove their worth. They’re trained and paid for this moment, yet few will ever have the opportunity to respond to it.

In the context of Database Administration, the true measure of a DBA’s worth is not the number of training courses they’ve attended, or qualifications they’ve earned, or books they’ve written; it’s their response to an unexpected crisis, the speed and efficiency with which they recover their organization’s data and return them to normal operating conditions.

The best DBAs know they’re paid to be the guardians of their organization’s most crucial asset: data. They value its importance to the extent that they consider and simulate various failure scenarios in order to hone their response to them. Like a soldier on the battle field, optimism plays no part in their plans. They expect failure, and they know exactly what to do when it happens.

A Disaster Recovery (DR) plan is an essential element in any DBA’s toolkit. In this the first of the Seven Habits of the Highly Effective DBA, let’s take an inside look at the breadth and depth of an effective DR plan.

1. What could possibly go wrong?

Your next disaster will be unanticipated. It’s a motto I like to use when devising my DR plans. By necessity, DBAs need to take a pessimistic approach to their DR plan, in order to ensure that they consider as many as possible of the vast number of things that can (and do) go wrong.

When dealing with complex and fluid situations such as disaster recovery planning, half of the battle is in defining the parameters. Military leaders often refer to this as reducing the unknown unknowns, in other words, minimizing the number of things we don’t know we don’t know.

As a DBA, reducing the unknown unknowns means coming up with a list of possible disasters. A good starting point is to categorize disasters into one of the following four categories:

Human Errors; This category covers all of the unanticipated human failures that happen from time to time. I once worked with a DBA who used the diagramming tool in Enterprise Manager to create a basic model of a production database. He dropped a table from the database thinking his action would only remove it from the diagram. Other classics are DELETE commands run against a table without a WHERE clause, and shutting down a server instead of logging off.

Hardware Faults; As hardware prices fall, it is becoming more common for organizations to build redundancy into their server designs. RAID storage, dual power supplies, ECC RAM and clustered servers all combine to reduce the chances of a hardware fault bringing down a system. However rare such events may be today, we cannot ignore the possibility of their occurrence, and they need to be considered when developing our DR plans.

Environmental Disasters; Most people’s initial definition of disaster includes events in this category. While bushfires, flooding and landslides happen, they’re very rare compared to the other categories presented here. That said, the DBA must at least imagine the wholesale destruction of the database infrastructure that could result, and plan an appropriate response.

Security Breach; An often overlooked area in terms of disaster recovery, these types of disasters relate to situations in which unauthorized access to data constitutes a disaster. Be it company secrets, credit card numbers, payroll data or classified documents, unintended access to this information requires a swift and coordinated response to limit the damage.

Once we have a list of things we know could go wrong, we can go about designing a response to each of them. But before doing so, we need to consider the Service Level Agreements under which we operate.

2. What about Service Level Agreements?

Service Level Agreements (SLAs) define the service level that we agree to provide, typically in terms of factors such as the maximum transaction response time, the maximum numbers of hours of downtime per year, and the maximum amount of data loss that’s considered acceptable.

An important prerequisite to DR planning is to define and agree on a SLA with the business (the people that pay our salary). SLAs set the framework within which we can design an appropriate recovery response to each disaster we consider. For example, an SLA that stipulates zero downtime and data loss obviously requires a far more extensive and expensive infrastructure than that which would be required by a more-typical SLA, which allowed a set amount of downtime and data loss.

Designing database infrastructure and a DR plan in the absence of a SLA is a common mistake and frequently leads to overly expensive solutions and service level disappointment from the business, during an actual disaster. A DBA should never assume what service levels the business expects and the business should never expect that zero data loss and zero downtime comes cheap. Calibrating the infrastructure investment to service level expectations is an important task, and one that requires both parties to engage in clear communication on the various options, and their associated service levels and costs. A great catalyst for this discussion is the SLA

SLA Option Papers

When negotiating service level agreements with the business, it’s very easy for the conversation to turn into a circular argument around costs, options and requirements. For example, asking a business how much downtime and data loss is acceptable will typically result in the answers “none” and “none”, with the usual DBA response to this request being “how much do you want to spend”. It’s difficult for a business to answer this question without understanding what the options are, how much each of them will cost, and their relative strengths and weaknesses. When presented clearly, such as shown in the example in figure 1 for backup/restore, option papers enable the business to clearly see what they will get for their money, and helps to really sharpen the debate on what the real priorities are.

Figure 1; A simplified option paper comparing the costs, benefits and limitations of 3 backup/restore options.

A classic SLA metric is uptime target percentage, and is usually expressed using the number of nines in the target. For example, an uptime of 99 % or “2 nines” represents about 3.5 days of downtime per year. In contrast, an uptime of 99.999 % or "5 nines" represents about 5 minutes per year! A “5 nines” solution is exponentially more expensive than a "2 nines" solution, with a single downtime incident typically causing a target failure in a "5 nines" environment. A business needs to weigh up the benefits of a large infrastructure investment, required for a 5 nines solution, compared with the benefits of spending that money elsewhere, for example, on staff training. If a few days of downtime per year (spread over the whole year) is acceptable, then it does not make sense to throw money at expensive infrastructure when the investment in other areas will yield a better overall outcome.

So, once we’ve recognized that a) things go wrong and b) we need to talk to the business on aligning infrastructure investment with their service level expectations, the next step is to address the nuts and bolts of SQL Server High Availability and Disaster Recovery.

3. SQL Server High Availability Disaster Recovery

The terms High Availability and Disaster Recovery are often used interchangeably to talk about the same thing. I think there’s an important yet subtle difference:

  • Disaster Recovery addresses what needs to be done when unexpected downtime has occurred.
  • High Availability addresses what we need to do, as DBAs, to avoid down time.

The latter is a very broad statement, and deliberately so. While there are “big ticket” items such as failover clustering and database mirroring, there are a whole lot of smaller yet equally important things that also contribute and detract from the goal of a highly-available database. For example:

  1. Not installing a service pack may lead to a security vulnerability
  2. Inadequate physical security may enable someone to remove/unplug the server
  3. A lack of training may lead to simple, yet costly human errors

All of these mistakes, regardless of the presence of clustering and/or mirroring, will impact on the uptime target. High availability is therefore as much a state of mind as it is a SQL Server feature or option.

The scope of this article does not allow for a detailed discussion on each of the items that impact on High Availability. What I’d like to do instead is to compare and contrast the most commonly-used SQL Server High Availability techniques, beginning with Failover Clustering.

3.1 Failover Clustering

There are numerous components within a server that can fail, from power supplies to motherboards, CPU and memory. Some of these components can be designed redundantly by use, for example, of dual power supplies and ECC RAM, to prevent a failure from causing the server as a whole to fail. However, there are some faults that will cause a server wide failure, and for those situations, Failover Clustering provides many benefits:

  1. All databases on the failed server move to an alternate server in a single action
  2. SQL Server agent jobs, system configuration, maintenance plans logs also automatically move to an alternate server
  3. Application connection strings do not need to be aware of the alternate failover server(s); the clustering logic automatically handles the connection redirection

Failover clustering is flexible in terms of the number of possible configurations that are supported i.e. the number of passive servers in the cluster compared to the number of active servers. If clustering is configured incorrectly, a single server may end up taking the load of a number of active and failed servers, and so performance may be poor after failover.

The other major consideration for clustering is that, unlike some other High Availability solutions, the disk containing the database files is shared among the cluster nodes, and therefore acts as a single point of failure (notwithstanding RAID support), a shortcoming addressed by both Database Mirroring and Log Shipping.

3.2 Database Mirroring

Unlike clustering, servers in a database mirroring solution can be, and often are, physically separate, both geographically and in terms of their hardware configuration. There is no requirement for the servers to share a storage system, meaning that mirroring pairs are often setup across large geographical distances in order to provide protection from catastrophic events (think 9/11).

Database mirroring pairs can be maintained in either a synchronous manner, in which case transactions will not commit on the principal database until written to the mirror’s transaction log, or asynchronously, for faster mirroring performance (but a higher risk of data loss).

Depending on the connection technique, applications can be configured to automatically reconnect to the database mirror on failure of the principal, although this is more complex than failover clustering, and requires usage of the SQL Server Native Client (SNAC).

Each database is individually mirrored, meaning that a server containing lots of databases requires a much larger administrative effort when compared to failover clustering. Further, the SQL Agent jobs, logs and server configuration settings are not automatically moved with the database.

Introduced in SQL Server 2005, one of the big advantages of Database Mirroring is the real time (or near real time) delivery of transactions from the principal database to the mirror database. This was a big leap forward when compared to the closest available technology in SQL Server 2000, Log Shipping, leading some to question the need for this older technology moving forward. As we’ll soon see, Log Shipping remains an important tool in the box.

Consider a situation in which someone accidentally deletes every row in a table. In a mirroring solution, that change is automatically (or very soon after) made to the mirrored database and therefore provides no easy rollback option. In addition to a number of other advantages, Transaction log shipping addresses this situation nicely.

3.3 Log Shipping

Transaction log shipping works by periodically restoring the transaction log backups from a primary database to one or more secondary databases initially created from a full backup of the primary. The secondary database(s) remains in the NORECOVERY/STANDBY state, enabling it to restore transaction logs from the primary. This process, in effect, turns the secondary database into a mirror of the primary, with a configurable transaction latency (15 minutes, by default).

Transaction log shipping, whilst sometimes considered an older and slower version of mirroring, provides a number of unique advantages. Firstly, by default, log shipping ships transaction logs from the primary database to the secondary (mirror) database(s) every 15 minutes. In our earlier example of the accidental deletion of all the rows in a table, those 15 minutes affords us an opportunity to intervene in the automatic transaction log restore to the secondary database(s), therefore avoiding the replication of the unintended deletion. For example, we could;

  1. Disable the SQL Agent jobs created by log shipping that automatically backup, copy & restore transaction logs from the primary to secondary databases,
  2. Disconnect client connections from the Primary database,
  3. Take a new transaction log backup on the primary database,
  4. Manually copy and restore transaction logs to the secondary database using the STOPAT command to restore up to the point of the unintended delete command,
  5. Restore the secondary database WITH RECOVERY making it usable as a replacement for the primary database,
  6. Reconnect clients to the new database

Log shipping also provides a number of other advantages when compared to mirroring; there can be multiple destination/copy databases from a single source database, and the destination database can be created in a manner that enables it to be used for read-only reporting in between log restore operations, an option not available in mirroring (unless using database snapshots).

On the downside, log shipping does not offer automatic failover and client redirection, and transaction latency is typically much slower.

So, which is the best High Availability option?

3.4 Options Compared

What should be obvious by now is that each of the three options (Clustering, Mirroring and Log Shipping) provides unique strengths and weaknesses. For this reason, many sites choose to combine two or more of these technologies in order to maximize the strength of the overall solution. A common configuration is to use clustering within a primary data centre for local failure protection, and then mirror (or log ship) critical databases off site for added protection.

Figure 2 (reproduced with permission from SQL Server 2008 Administration in Action, Manning Publications) compares and contrasts the attributes of these three options.

Figure 2; High Availability options compared. Reproduced with permission, SQL Server 2008 Administration in Action, Manning Publications, 2009

3.5 Other Components of a HA Solution

There is much more to an effective High Availability solution than just SQL Server features or options. Further, there are many other ways in which DBAs can improve availability, some of which include:

  • Backups; Don’t underestimate the importance of a sound backup strategy. We’ll be covering this more in the upcoming articles in this series
  • Replication; Considered more of a data distribution technology than a High Availability solution; it is frequently used for the latter purpose nonetheless
  • SQL Azure; The cloud is here! SQL Azure Data Sync enables an on-premises database to be synced up (and down) to/from the cloud, providing an alternate failover point in the event of disaster

No strategy is foolproof, and the best-laid plan can sometimes look spectacularly inadequate in retrospect. In that context, we need to be constantly validating our plans, a task best achieved through the DBA Fire Drill.

4. Fire Drills

Earlier in the article, we discussed the important difference between High Availability and Disaster Recovery. We noted that High Availability is concerned with keeping systems up and avoiding downtime, and Disaster Recovery is about what to do when an unexpected event occurs which leads to downtime.

One of the crucial (and often overlooked) aspects of any Disaster Recovery plan is the importance of simulating various disasters in order to determine the effectiveness of the High Availability systems put in place. Such exercises, or Fire Drills as I like to call them, provide a number of benefits; firstly, the ability to measure/adjust the High Availability system’s effectiveness in preventing downtime, and secondly, as a means of fine tuning the Disaster Recovery plan should downtime result.

Depending on the failure, the Disaster Recovery plan will of course vary, but in all cases, there are a number of common inclusions:

  • Clearly documented restore scenarios including the location of restore/verification scripts
  • Decision escalation points. Some decisions are irreversible, and following the correct chain-of-command during an actual disaster is crucial in avoiding any rash decisions that may lead to permanent data loss
  • Contact phone numbers (including after hours numbers) for application support staff, senior DBAs, and business representatives
  • Follow up post mortem that examines what went wrong, the success/failure of the Disaster Recovery plan & High Availability design, and what steps can be taken in the future to avoid the same outage scenario

One of the common excuses given for not simulating disasters is the cost and time involved. While these are legitimate concerns, failure to run these simulations can lead to confusion, long outages and potentially irreversible mistakes under the pressure of real disaster.

One of the more successful approaches to disaster simulation is to conduct random fire drills during which DBAs are tasked with recovering from a random disaster, simulated in a test environment, at an unannounced time.

This technique is highly effective in training the DBA how to respond quickly, with the appropriate actions, during a real disaster. It is invaluable for newly-hired DBAs, enabling them to very quickly come up to speed with all the specific processes required for disaster recovery, but also livens up the day, and adds a competitive spark to the DBA team as a whole. DBAs often compete to see who can recover a system the fastest with the least amount of data loss!

Summary

In this article, we covered arguably the most important aspect of being a DBA; being prepared for an unexpected disaster. Despite its importance, a lot of DBAs ignore this element of their job, and those that do can only hope that luck runs their way.

In the next article, we’ll concentrate on a fundamental DBA task; storage design and validation.

Figure 2 in this article is taken from SQL Server 2008 Administration in Action by Rod Colledge and reproduced with permission from the publisher, Manning Publications. Save 35% on the print or eBook edition of SQL Server 2008 Administration in Action. Enter ‘redgate35’ in the Promotional Code box when you check out at manning.com

Rod Colledge

Author profile:

Rod Colledge is an independent SQL Server consultant based in Brisbane, Australia. He’s the author of SQL Server 2008 Administration in Action, was a technical editor on the SQL Server MVP Deep Dives book, writes for Quest and Simple-talk, is a frequent speaker at local SQL Server user groups, and received the SQL Server MVP award in January 2010. Rod blogs here at rodcolledge.com, is the founder and operator of sqlCrunch.com, and was a highly rated speaker at the 2009 PASS summit in Seattle. He enjoys spending time with his young family, playing golf very badly, and pretending to taste the difference between a Pinot Gris and a Sauvignon Blanc!

Search for other articles by Rod Colledge

Rate this article:   Avg rating: from a total of 26 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Protect your most vulnerable asset first
Posted by: Andy Bailey, www.availabilityadvisor.com (not signed in)
Posted on: Thursday, March 18, 2010 at 6:24 AM
Message: As Rod points out, true environmental disasters are, thankfully, few and far between. However, I am continuously amazed at how many customers I meet who mix up the environmental stuff with technology failures and end up with a mishmash of a solution that, when it comes to the crunch, simply does not work.

From the technical perspective, protect your most vulnerable asset, the compute platform (the thing that is running the DB) – forget complex clustering and such like – use an ultra reliable server – fast to deploy, easy to manage and very cost effective. In the interests of full disclosure, I should point out here that I work for Stratus Technologies, though :).


Subject: Re: stratus
Posted by: andyirving (view profile)
Posted on: Monday, March 22, 2010 at 9:05 AM
Message: doesn't defend against human error, nor are they easy to manage! there's limited expansion available within the chasis and many configurations aren't supported. plus when they go wrong, they do go wrong big style. you're better off getting multiple commodity servers

Subject: This is very average article
Posted by: Roger Smith (not signed in)
Posted on: Monday, March 22, 2010 at 12:59 PM
Message: Very disappointing.

Simple-talk has lost its charm.

Subject: Easy to manage
Posted by: Andy Bailey (not signed in)
Posted on: Wednesday, March 24, 2010 at 12:03 PM
Message: I'll agree and disagree Mr Irving! If a human cocks it up, then so be it - isn;t that down to training? The servers are super easy to manage - how could you say such a thing - we have our own management console and there is more snmp and wmi information than you know what to do with. Current systems have huge expansion capability. If they go wrong, its usually back to the human I'm afraid. Come and have a "play" with them sometime. Open invitation.

Subject: Missing the point
Posted by: REClark (not signed in)
Posted on: Thursday, March 25, 2010 at 7:57 PM
Message: This is one of the better articles I have read in years in planning for, preparing and practicing for disaster recovery. With over 21 years of experience in the financial industry, one year in the manufacturing and advertising industry and two years in the health care industy I have seen several companies who are unprepared for a true disaster recovery. Or at least they will be extremely surprised when/if they are in a true DR situation because they did not plan or test their DR accordingly. IT and the business users are still not intuned to the DR concept. The first time a company encounters downtime is the first time they realize the cost and benefits of a solid DR. Being proactive is time consuming and expensive but on the other hand how much money is lost to a company when they are not functing with mission critical applications? Add to the money being lost but the companies brand name and customer experiences. PRICELESS!!!

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.