In my experience, disaster recovery (DR) planning is almost always put on the back burner, especially in organizations which have neither clearly-defined Service Level Agreements nor sane internal profitability tracking.
This seems to be a rash statement, however in reality it seems to be very difficult – especially in large organizations – to establish a consensus on the value of investing money in defining a coherent disaster recovery strategy. It is not often seen as a proactive cost-saving activity.
The initial conflict
There are several reasons why the value of a DR strategy is underestimated in many organizations; I believe that the most important reasons are:
- The core values of the business are easily misunderstood – the most valuable assets being the employees and their knowledge together with the company’s data, as well as the ability to retain both assets as long as possible in a coherent state.
- The lack of caution by the company’s decision makers, who at times fail to recognize the value of investing into a fail-proof DR strategy, or are unwilling to give it the priority it deserves.
- A misplaced faith by the business in the intrinsic reliability and resilience of computer systems. The consequence of disaster is often met with amazed disbelief.
The gap between IT and Business Decision Making
Where a DR plan is flawed, incomplete or missing, this is often due to considerable differences in the understanding of the steps needed to prevent or recover from disasters in many large enterprises. There seems to the lack of coordination between the ‘have’, ‘wants’ and ‘needs’ . In other words, very often there is a big difference between what the company currently has, what the company wants and what the company really needs as a DR strategy.
First of all, the ‘has‘ aspect is very often undocumented and rarely sufficient as well as being untested (unless the company has already suffered data-loss due to a disaster).
The ‘wants‘ aspect is mostly a consequence of the fascination of many of the IT staff with technology and it usually surpasses the budget of the IT department.
The ‘needs‘ aspect is dependent on the business model, and how the business is regulated by legal demands, vendors and customers agreements.
The bottom line is that the DR strategy has to be developed and enforced by a person or group which is equally involved in the business’s decision-making as well as the company’s IT strategy.
I have seen executives from the Business department coming to work for a month at the IT department where I’ve worked, with the purpose of defining the adequate golden balance between ‘have’, ‘wants’ and ‘needs’.
In the ‘downloads’ section of this article you will find a template which will help you to create a proposal for a solution-approach for the business side of the organization.
Keep in mind that the more convincing the presentation is, the more room for improvements in the DR planning that you will get approval for from the business.
How to meet the needs
I will discuss this mostly from the data perspective.
It is important to identify the priorities of the restore or recovery process after a disaster, and the interdependencies. Only by getting the sequence and priorities right can one minimize the cost and the time to recover IT support for the business operations. Recovering everything at the same time is not often practical or possible, and would take longer than prioritizing the recovery most important parts of the system.
The DR plan has to outline areas of the business data which have to be recovered first, and the systems that these essential systems, in turn, depend on.
To get this right, it pays to categorize data systematically. As an example of this, I illustrate in the picture below a representation of the way that the different types of data can be understood and categorized in the enterprise:
The data is sliced by two dimensions: time and access.
The access dimension represents the pattern in which the data is accessed; this can be multiple short running transactions that are serving user queries – which read or modify data often.
The time dimension indicates the way the data is treated from a historical perspective, i.e. how much of the data is vital for daily business operations, and how much of the data is treated as historical data used for analysis – not on a regular basis.
Each company/ enterprise has their own unique patterns of data access and data handling; however, applying a categorization to the data similar to the one above is a necessary initial step for every disaster recovery plan. As mentioned earlier, it is very important to be prepared to restore the most critical data first and not to attempt everything at the same time. Hence, without a similar categorization of the data it would be very hard to decide the approach to the disaster planning and the actions to be taken during recovery.
For example, if the day-to-day business depends on 10% of the data which is transaction intensive, and the rest 90% is just for analytical (billing, fraud etc) purposes and it is used less frequently, then the priority should be the 10% of the data.
Another example can be an enterprise where the daily business is relying heavily on 90% of the data for analysis purposes and decision making, and 10% of the transactional data is not so important.
Both cases will dictate different approaches to selecting the right technology for the DR: The DR team have a number of different technical options to building resilience into the existing systems.
- Clustering is a great way to have high availability; however it is an expensive solution: 2 identical servers are needed with a shared storage, which demands the purchase of additional storage for backups. (Storing backups on a clustered disk is just not a good idea!)
- Using a standby server can be much cheaper, since it does not have to be identical to the production server – the storage can be cheaper and it is not shared. An alias can be used for the application configurations and, in case of failure, the alias gets pointed to the standby server, and no application configs have to be changed.
- Offloading data to cloud services – this is a very debatable solution because of security and performance concerns. In certain cases, however, it might prove to be the cheapest option.
As a bottom line, the most important part of the DR planning is the correct segregation of data in way which allows for implementing the least expensive and fastest solution to restore business operations as usual.
How to prepare for a disaster
The best way to prepare for a disaster is to include the following items in your DR solution:
- Set of maps explaining the current architecture of the network, data flows and hardware involved.
- A list of on-call specialists who can help with hardware and software problems.
- List(s) of scripts which will help restore the most important part of the system first. These scripts should include both performance tuning queries and queries which modify objects. For example, a PowerShell script which shows the remaining database file space and disk space, a script which is used to create and swap partitions in SQL Server databases and so on.
- A list of applications and their dependencies. (Also keep track of which applications and data should be brought back online first.)
- A list of restaurants / shops around the restore site – disaster recovery may take longer than expected (it always does, actually!) and it is best not to be hungry and thirsty when full concentration on the task is essential.
- A list of hardware vendors in the area – for example, in the case of a disk failure running to the closest hardware store and buying a cheap hard drive may minimize a lot of downtime time in a disaster situation, compared to the option of waiting for the proper storage with the data being inaccessible.
- List of regional companies which provide last minute getaways (just in case any other step of the DR plan fails).
After defining the DR plan, it has to be tested regularly, and refined in the light of experience. Furthermore, each team member should be familiar with the steps which have to be taken in the case of disaster.
And when disaster happens- the real world example
The most important part is not to panic. (We have a big poster in the datacenter saying ‘DON’T PANIC!’ – it helps.)
When a disaster happens the most important part is actually knowing that a disaster has happened and what is affected – is it a physical device loss (i.e. SAN catching fire) or is it a software problem (i.e. a SQL Server instance shutting itself down because of software errors) or is it a logical problem (i.e. a user actually dropping their production database)? I have witnessed all of these examples in real life.
The first step of the process is figuring out the affected areas and the magnitude of the disaster.
Keeping a script which gives insight into the operational part of the system is very helpful in this case. Providing this script is beyond the scope here, but here are some ideas on what to include:
- A script which iterates through all disks, mountpoints and lists used and free space.
- A script which pings the physical server(s).
- A script which immediately copies the SQL Server logs and default trace files from the server(s) to a remote location. The problem with the default trace is that it gets overwritten quickly in a busy environment and if it is copied to a different location it might be able to provide vital information about who did what (- I wrote an article a while back about how to explore the Default Trace – https://www.simple-talk.com/sql/performance/the-default-trace-in-sql-server—the-power-of-performance-and-security-auditing/ ).
- A very smart idea is to have a script prepared which will gather performance information from the troubled server from a remote machine. It is very easy to setup a file which starts a performance counter gathering immediately after execution (I wrote an article a while back about Perfmon counters gathering – https://www.simple-talk.com/sql/performance/collecting-performance-data-into-a-sql-server-table/ ). Of course, it would be even more useful to have Redgate’s SQL Monitor installed, configured and running, because it can help tracing the steps that lead to the disaster.
- In the case of a clustered environment, one could have a script which displays which SQL Server instances are running on which node.
- One could also have a script which displays the current physical folder locations of the database and log files, as well as the location of the latest backup files (trust me – it is better to have this script because there can always be last minute surprises).
Of course there are many other smart scripts and notifications which can help during a disaster.
The next step after a disaster happens is to contain the disaster so it does not spread further.
A burning hard drive may be a problem but if a burning hard drive sets the entire server rack on fire then it is a much bigger problem; if a user drops one database it is a problem, but if they drop 10 more, it is a much bigger problem and so on.
Therefore, after establishing the cause of the disaster it is best to isolate it as soon as possible.
Keeping passwords and being able to access the systems is very important. This includes not only SQL Server but the whole range of passwords, even including routers and the like. Keeping the logins and passwords in a safe place but still accessible in the case of a disaster is critical for the success of the recovery operations.
And finally, keep in mind that there are quite a few options to have backdoors open in case of a disaster: there are many examples and here are a few:
- Using technology like HP iLO helps a great deal when a remote server starts behaving abnormally – for example if it won’t start after planned maintenance.
- Using remote desktop access products may help a lot in the case of a disaster happening while your (only) system administrator is on vacation.
- Enabling DAC and keeping the admin login information in a safe place is a great way to prepare for a SQL Server disaster. It happens sometimes that the SQL Server is not responsive in any other way and the Dedicated Admin Connection might be the only way in.
In conclusion, disaster recovery planning involves a lot of hard work but keep in mind that it is much better to be able to detect a potential disaster and prevent it then to fight hard to bring the system back up.
Nowadays there are many tools which can help with preventive maintenance and proactive problem detection. This makes it much easier to be prepared for the worst and then still hope for the best. (Remember, Murphy was actually an optimist.)