Click here to monitor SSC
  • Av rating:
  • Total votes: 131
  • Total comments: 12
Matt Simmons

Unteachable Disaster Recovery Techniques

11 March 2010

There are some skills which are extensions of your instincts, and which you can only learn though years of experience. Matt Simmons has this brought home by the fact that he was recently minutes away from a data-loss disaster, and he doesn't quite know how he prevented it.

Oh, do I have a confession to make today! I’ll tell it straight: I got lucky. I got really lucky, and despite my admonitions to the contrary, I’m being hailed as a hero and savior of the day. Nothing could be further from the truth.

To give you some background on the story, I have three sites to deal with. Our primary site is active, where production is run and from where clients are served; and we also have a secondary center to where production can be moved in the event of a catastrophic failure of the primary site. We also have a third “backup” site where all of the data is pooled, written to media, and shipped to offsite-storage. In a nutshell, the data flows from the primary to the secondary, then to the backup site. All of the various data does this, but last Friday, I was concerned with the database.

Our byzantine license
precludes us from
doing anything more
robust with [our
antiquated database]...

We have what I would consider to be an antiquated database. Our byzantine license precludes us from doing anything more robust with it, so we just use log shipping to ensure that the data is replicated to all of the backup instances in a timely manner. Due to bandwidth considerations, this happens at the top of every even-numbered hour.

Last Friday, I was working late in order to get through a particularly large backlog, and I was in the zone. I had my headphones on and not a care in the world, until I got an instant message from someone in our New York office. The question was simple, but profound: “How often do we backup the primary database?

That’s not the sort of question that you should give a knee-jerk response to, even at the best of times. I considered the possible answers for a moment, and then I remembered that it was Friday night at 5:45pm. Ignoring the query, I immediately opened remote terminals to all of the backup databases and shut them down, preventing any data being replicated from the primary.

Reasons and Repercussions

As it turned out, that last step saved us. What had happened was that someone in New York had accidentally deleted several thousand records from the database, and I managed to shut down the backup databases before the 6 o’clock log shipping. If I hadn’t been contacted when I was, or if I hadn’t immediately turned off the backup databases, then scant few minutes later the transaction log that contained the command to delete the records would have percolated throughout the backup instances, wiping the data everywhere. I got lucky.

Admittedly, had the data been lost from all of the instances, it still wouldn’t have been the end of the world; as I mentioned, we do move tapes offsite. I could have called our media storage people and had them ship us the tape. I could have then recovered the last instance of the database from the tape, replayed it up to the point of that mistaken command. Then, I'd have  put it on an external hard drive, driven to the datacenter, copied it off and activated it. It would probably only have taken two days. Not a catastrophe but, like I said, I got lucky and managed to avoid that particular headache.

What made me shut
down the databases
with only the slightest
hint that something
was wrong?  I don’t
know, and that's hard
to deal with.

The reason that I’m being hailed as a hero is because of my swift response in shutting down the secondary databases. What everyone marvels at, but no one really considers, is why I took that action. What made me shut down the databases with only the slightest hint that something was wrong?  I don’t know, and that’s hard to deal with.

Since I don’t know, I can only chalk it up to experience. It was late on a Friday, that wasn’t a conversational kind of question, and then there was also my knowledge of the person asking it: That question didn’t strike me as the kind  he would pose ‘off the cuff’, so to speak. In the end, I can only say that it was my intuition and experience in dealing with the network and those people that prompted my reaction. My biggest problem with this is that experience and intuition are impossible to teach. They must be acquired.

That being said, there are some things which can be taught; and so, yesterday, my supervisor and I held a training session for everyone in the company. The underpinnings of the data flow and backup policies were covered, and we highlighted the fact that what really saved us was the fact that someone spoke up immediately about a problem, and told a person who could respond effectively. I have worked hard to make sure that our users know that they can come and tell us about problems and mistakes without being rebuked, and our training session emphasized that behavior.

My experience certainly gave me pause, and I’m in the middle of evaluating how I can deal with what would have happened if we had missed that 15 minute window. If it happens again, at a less opportune time, what can I do to make sure that the recovery process isn’t stretch across two days? I have to answer that question now. Maybe you should reconsider some of the critical systems in your infrastructure, and re-evaluate your disaster recovery plan. If you don’t have one, I urge you to build one. Like many things, by the time it’s obvious that you need one, it’s too late.

Matt Simmons

Author profile:

Matt Simmons is an IT Administrator with several years experience on small and medium networks. He is currently employed in the financial services industry, and has previously worked in logistics and internet services. Matt maintains a systems administration blog, and spends his spare time reading and learning new things. He can be reached at or via his blog at

Search for other articles by Matt Simmons

Rate this article:   Avg rating: from a total of 131 votes.





Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.

Subject: Absolutely!
Posted by: jawildman (view profile)
Posted on: Friday, March 12, 2010 at 2:20 PM
Message: I find myself doing this all the time. Simple questions set off alarms in my head, based on the experience I have with the environment, the person, the team, the application, etc, etc. I'm not doing prod support, so the alarms for me tend to lead to architectural or process discussions. But the principle is the same.

There is no way to teach this other than to spend the time in the trenches and get the scars. It's why old soldiers don't trust new recruits to cover their backs as well as another old soldier.

You've got to do the time...

Subject: and..
Posted by: jawildman (view profile)
Posted on: Friday, March 12, 2010 at 2:22 PM
Message: I'm a firm believer that just like there are always some skills (shooting a basketball) that no matter how much I practice, I won't be good at it (better yes, but not 'good'), so there is a mindset, outlook, sense, something that makes a sysadmin a good sysadmin. A few people have it. The rest just type commands and click icons.

Subject: oh how true...
Posted by: Anonymous (not signed in)
Posted on: Friday, March 12, 2010 at 3:55 PM
Message: Those of us who can relate to this know of our own instances in which you nearly averted disaster due to instinct, or were just a bit too late. Good topic.

Posted by: Pedro C (not signed in)
Posted on: Monday, March 15, 2010 at 7:45 AM
Message: I had similar situations in the past. Being a sysadmin is the best job in the work.

Subject: Yes.
Posted by: Benny Crampton (not signed in)
Posted on: Monday, March 15, 2010 at 8:20 AM
Message: Exactly this. Very good article. We downplay instincts as useful in a sysadmin job, but they are extremely useful, especially in a situation like this. A good base knowledge is not nearly as useful without experience and instincts behind it.

Subject: A Veteran of the wars
Posted by: ajligas (view profile)
Posted on: Monday, March 15, 2010 at 7:35 PM
Message: Great column!

I think most of us that have been around for a while can definitely relate to this subject. But you can never be reminded enough to stay at the top of your game and keep your reaction times at your best.

Subject: Thanks Everyone
Posted by: Matt Simmons (not signed in)
Posted on: Monday, March 15, 2010 at 8:49 PM
Message: Hello Everyone, and thanks very much for the kind words. I'm glad that you all have shared my experiences in terms of relying on instincts. It was a pretty crazy night, and I'm glad I got lucky.

It says a lot about the profession of system administration that we've all got this shared kind of experience. We live and die by seemingly random or chance events. But then I suppose that's not much different than in the "real world".

Thanks everyone.


Subject: come and tell us about problems
Posted by: Anonymous (not signed in)
Posted on: Tuesday, March 16, 2010 at 2:56 AM
Message: I could not agree more to your statement about users coming to you and telling you about problems and mistakes.
Yes, it might be satisfying in the short term to put the blame on some idiot for making a really really stupid mistake, but you really want that idiot to come to you and tell you about what he did. You want him to tell you just as it happened, and not three months after the fact, and after having interviewed half of the company.
If you give harsh punishment, your users will try to cover mistakes, but the consequences will still be there. But, if your users come and confess about what they did, and you do NOT give them a hard time, but help them get the problem solved, then they will come back.

Now, the problem with this policy: you will have some simple-minded dimwits (most probably higher up in the chain) who just refuse to learn the system they have to use, and come to you for every minor hiccup they encounter, including stuff you told them several times again and again in great detail.

What will it be for you? Having to deal with dimwits, or not knowing about mistakes until it is too late? Choose your poison.

Subject: Log Recovery Tool
Posted by: TheSQLGuru (view profile)
Posted on: Thursday, March 25, 2010 at 7:06 AM
Message: Sharp thinking there!

I will note that if you had the ApexSQL Log Recovery tool you would have been able to very quickly and easily create an UNDO script to get the data back. Personally I think EVERYONE should have that capability in their production environments!!

Disclaimer: I have a close relationship with Apex, use their products and recommend them to my clients. Also, if you care to you can mention TheSQLGuru sent you you will get a discount and my daughter will get a few coins for her college fund.

Subject: Time is money
Posted by: jerryhung (view profile)
Posted on: Thursday, March 25, 2010 at 10:22 AM
Message: I had to do similar recovery last week

phone call came, asking us to restore
- 1st step: we tried to use RedGate Object Recover, but it cannot read logs to restore
- 2nd step: immediately restore the FULL, the DIFF, all the log till 15 mins before the "DELETE" with STOPAT
- 3rd step: use RedGate Data Compare to compare the tables, found ~1400 rows of data, confirm, and ran the script to re-insert the data
- 4th step: no awards, no heroes, life goes on as usual...just grateful for Red Gate :)

Subject: Training was a good move
Posted by: John L (view profile)
Posted on: Monday, March 29, 2010 at 4:27 PM
Message: The training session was a good move. That being said, I do have to cringe at the thought of an end-user being able to initiate the DELETE of death...

Subject: Been there -- done that -- tired of the t-shirt
Posted by: JimPen (view profile)
Posted on: Thursday, December 13, 2012 at 9:04 PM
Message: I've been a production DBA for years. I have had so many DR recoveries from end-users, HW and SW failures, I've lost count.

I've never been at a company that I could justify the cost of an Apex/Red Gate solution at a SQL level because the IT seniors/company management looked at the OS level. So I've always had to resort to a belt & suspender type solutions.

I do the belt level at the SQL level. I make sure what level of recovery each DB/App is needed (full/24 hour/point-in-time/1 hr loss/15 min/etc.). Then I set up SQL backups locally to account for that.

Then I do the suspender portion at the OS level -- replicated disks, weak servers, etc. at offsite. But I make sure that management knows that if we get to that level of DR the loss of data will be a min of X hours and a max of X hours. Management is told that explicitly, usually in writing, and has to "sign" off on it.

Then the inevitable user error becomes a matter of restoring a SQL DB locally and pulling the data. The HW/SW failure depends on the location of the backups and level of degradation.

I will admit things like clustering and replication do play in. I like clustering, and hat SQL replication.

I agree that DR planning can't be taught. But experience "learns" you where the holes are and how to plug them.

I don't want to take away from Matt's success from scant info, because I've done that as well. But sometimes your backup plans need other alternate backup plans to be the hero.

As for the late comment I heard about this entry from a SQL Server Central daily mail.


Top Rated

The Poster of the Plethora of PowerShell Pitfalls
 One of the downsides of learning a new computer language is that transfer of training doesn't always... Read more...

Getting Data Into and Out of PowerShell Objects
 You can execute PowerShell code that creates the data of an object, but there is no cmdlet to generate... Read more...

SMTP Routing Changed in an Exchange Database Availability Group
 In Exchange Server 2010 it is possible to make the system more resilient by creating a Database... Read more...

Emulating the Exchange 2003 RUS for Out-of-Band Mailbox Provisioning in Exchange 2007
 Exchange's Recipient Update Service was important in Exchange 2000 or 2003 in order to complete the... Read more...

The Postmasters
 The Exchange Team introduces themselves, and keeps you up-to-date Read more...

Most Viewed

Upgrade Exchange 2003 to Exchange 2010
  In this article, the first of two in which Jaap describes how to move from Exchange Server 2003... Read more...

Upgrade Exchange 2003 to Exchange 2010 - Part II
 In Jaap's second article on upgrading straight from Exchange Server 2003 to 2010, he explains how to... Read more...

Goodbye Exchange ExMerge, Hello Export-Mailbox
 ExMerge was a great way of exporting a mailbox to an Exchange PST file, or for removing all occurences... Read more...

Exchange E-mail Addresses and the Outlook Address Cache
 Because Exchange auto-complete cache uses X.500 addresses for e-mail sent to addresses within the... Read more...

PowerShell One-Liners: Collections, Hashtables, Arrays and Strings
 The way to learn PowerShell is to browse and nibble, rather than to sit down to a formal five-course... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.