Unteachable Disaster Recovery Techniques

There are some skills which are extensions of your instincts, and which you can only learn though years of experience. Matt Simmons has this brought home by the fact that he was recently minutes away from a data-loss disaster, and he doesn't quite know how he prevented it.

Oh, do I have a confession to make today! I’ll tell it straight: I got lucky. I got really lucky, and despite my admonitions to the contrary, I’m being hailed as a hero and savior of the day. Nothing could be further from the truth.

To give you some background on the story, I have three sites to deal with. Our primary site is active, where production is run and from where clients are served; and we also have a secondary center to where production can be moved in the event of a catastrophic failure of the primary site. We also have a third “backup” site where all of the data is pooled, written to media, and shipped to offsite-storage. In a nutshell, the data flows from the primary to the secondary, then to the backup site. All of the various data does this, but last Friday, I was concerned with the database.

Our byzantine license
precludes us from
doing anything more
robust with [our
antiquated database]…

We have what I would consider to be an antiquated database. Our byzantine license precludes us from doing anything more robust with it, so we just use log shipping to ensure that the data is replicated to all of the backup instances in a timely manner. Due to bandwidth considerations, this happens at the top of every even-numbered hour.

Last Friday, I was working late in order to get through a particularly large backlog, and I was in the zone. I had my headphones on and not a care in the world, until I got an instant message from someone in our New York office. The question was simple, but profound: “How often do we backup the primary database?

That’s not the sort of question that you should give a knee-jerk response to, even at the best of times. I considered the possible answers for a moment, and then I remembered that it was Friday night at 5:45pm. Ignoring the query, I immediately opened remote terminals to all of the backup databases and shut them down, preventing any data being replicated from the primary.

Reasons and Repercussions

As it turned out, that last step saved us. What had happened was that someone in New York had accidentally deleted several thousand records from the database, and I managed to shut down the backup databases before the 6 o’clock log shipping. If I hadn’t been contacted when I was, or if I hadn’t immediately turned off the backup databases, then scant few minutes later the transaction log that contained the command to delete the records would have percolated throughout the backup instances, wiping the data everywhere. I got lucky.

Admittedly, had the data been lost from all of the instances, it still wouldn’t have been the end of the world; as I mentioned, we do move tapes offsite. I could have called our media storage people and had them ship us the tape. I could have then recovered the last instance of the database from the tape, replayed it up to the point of that mistaken command. Then, I’d have  put it on an external hard drive, driven to the datacenter, copied it off and activated it. It would probably only have taken two days. Not a catastrophe but, like I said, I got lucky and managed to avoid that particular headache.

What made me shut
down the databases
with only the slightest
hint that something
was wrong?  I don’t
know, and that’s hard
to deal with.

The reason that I’m being hailed as a hero is because of my swift response in shutting down the secondary databases. What everyone marvels at, but no one really considers, is why I took that action. What made me shut down the databases with only the slightest hint that something was wrong?  I don’t know, and that’s hard to deal with.

Since I don’t know, I can only chalk it up to experience. It was late on a Friday, that wasn’t a conversational kind of question, and then there was also my knowledge of the person asking it: That question didn’t strike me as the kind  he would pose ‘off the cuff’, so to speak. In the end, I can only say that it was my intuition and experience in dealing with the network and those people that prompted my reaction. My biggest problem with this is that experience and intuition are impossible to teach. They must be acquired.

That being said, there are some things which can be taught; and so, yesterday, my supervisor and I held a training session for everyone in the company. The underpinnings of the data flow and backup policies were covered, and we highlighted the fact that what really saved us was the fact that someone spoke up immediately about a problem, and told a person who could respond effectively. I have worked hard to make sure that our users know that they can come and tell us about problems and mistakes without being rebuked, and our training session emphasized that behavior.

My experience certainly gave me pause, and I’m in the middle of evaluating how I can deal with what would have happened if we had missed that 15 minute window. If it happens again, at a less opportune time, what can I do to make sure that the recovery process isn’t stretch across two days? I have to answer that question now. Maybe you should reconsider some of the critical systems in your infrastructure, and re-evaluate your disaster recovery plan. If you don’t have one, I urge you to build one. Like many things, by the time it’s obvious that you need one, it’s too late.

Tags: , ,


  • Rate
    [Total: 131    Average: 4.6/5]
  • jawildman

    I find myself doing this all the time. Simple questions set off alarms in my head, based on the experience I have with the environment, the person, the team, the application, etc, etc. I’m not doing prod support, so the alarms for me tend to lead to architectural or process discussions. But the principle is the same.

    There is no way to teach this other than to spend the time in the trenches and get the scars. It’s why old soldiers don’t trust new recruits to cover their backs as well as another old soldier.

    You’ve got to do the time…

  • jawildman

    I’m a firm believer that just like there are always some skills (shooting a basketball) that no matter how much I practice, I won’t be good at it (better yes, but not ‘good’), so there is a mindset, outlook, sense, something that makes a sysadmin a good sysadmin. A few people have it. The rest just type commands and click icons.

  • Anonymous

    oh how true…
    Those of us who can relate to this know of our own instances in which you nearly averted disaster due to instinct, or were just a bit too late. Good topic.

  • Pedro C

    I had similar situations in the past. Being a sysadmin is the best job in the work.

  • Benny Crampton

    Exactly this. Very good article. We downplay instincts as useful in a sysadmin job, but they are extremely useful, especially in a situation like this. A good base knowledge is not nearly as useful without experience and instincts behind it.

  • ajligas

    A Veteran of the wars
    Great column!

    I think most of us that have been around for a while can definitely relate to this subject. But you can never be reminded enough to stay at the top of your game and keep your reaction times at your best.

  • Matt Simmons

    Thanks Everyone
    Hello Everyone, and thanks very much for the kind words. I’m glad that you all have shared my experiences in terms of relying on instincts. It was a pretty crazy night, and I’m glad I got lucky.

    It says a lot about the profession of system administration that we’ve all got this shared kind of experience. We live and die by seemingly random or chance events. But then I suppose that’s not much different than in the “real world”.

    Thanks everyone.


  • Anonymous

    come and tell us about problems
    I could not agree more to your statement about users coming to you and telling you about problems and mistakes.
    Yes, it might be satisfying in the short term to put the blame on some idiot for making a really really stupid mistake, but you really want that idiot to come to you and tell you about what he did. You want him to tell you just as it happened, and not three months after the fact, and after having interviewed half of the company.
    If you give harsh punishment, your users will try to cover mistakes, but the consequences will still be there. But, if your users come and confess about what they did, and you do NOT give them a hard time, but help them get the problem solved, then they will come back.

    Now, the problem with this policy: you will have some simple-minded dimwits (most probably higher up in the chain) who just refuse to learn the system they have to use, and come to you for every minor hiccup they encounter, including stuff you told them several times again and again in great detail.

    What will it be for you? Having to deal with dimwits, or not knowing about mistakes until it is too late? Choose your poison.

  • TheSQLGuru

    Log Recovery Tool
    Sharp thinking there!

    I will note that if you had the ApexSQL Log Recovery tool you would have been able to very quickly and easily create an UNDO script to get the data back. Personally I think EVERYONE should have that capability in their production environments!!

    Disclaimer: I have a close relationship with Apex, use their products and recommend them to my clients. Also, if you care to you can mention TheSQLGuru sent you you will get a discount and my daughter will get a few coins for her college fund.

  • jerryhung

    Time is money
    I had to do similar recovery last week

    phone call came, asking us to restore
    – 1st step: we tried to use RedGate Object Recover, but it cannot read logs to restore
    – 2nd step: immediately restore the FULL, the DIFF, all the log till 15 mins before the “DELETE” with STOPAT
    – 3rd step: use RedGate Data Compare to compare the tables, found ~1400 rows of data, confirm, and ran the script to re-insert the data
    – 4th step: no awards, no heroes, life goes on as usual…just grateful for Red Gate 🙂

  • John L

    Training was a good move
    The training session was a good move. That being said, I do have to cringe at the thought of an end-user being able to initiate the DELETE of death…

  • JimPen

    Been there — done that — tired of the t-shirt
    I’ve been a production DBA for years. I have had so many DR recoveries from end-users, HW and SW failures, I’ve lost count.

    I’ve never been at a company that I could justify the cost of an Apex/Red Gate solution at a SQL level because the IT seniors/company management looked at the OS level. So I’ve always had to resort to a belt & suspender type solutions.

    I do the belt level at the SQL level. I make sure what level of recovery each DB/App is needed (full/24 hour/point-in-time/1 hr loss/15 min/etc.). Then I set up SQL backups locally to account for that.

    Then I do the suspender portion at the OS level — replicated disks, weak servers, etc. at offsite. But I make sure that management knows that if we get to that level of DR the loss of data will be a min of X hours and a max of X hours. Management is told that explicitly, usually in writing, and has to "sign" off on it.

    Then the inevitable user error becomes a matter of restoring a SQL DB locally and pulling the data. The HW/SW failure depends on the location of the backups and level of degradation.

    I will admit things like clustering and replication do play in. I like clustering, and hat SQL replication.

    I agree that DR planning can’t be taught. But experience "learns" you where the holes are and how to plug them.

    I don’t want to take away from Matt’s success from scant info, because I’ve done that as well. But sometimes your backup plans need other alternate backup plans to be the hero.

    As for the late comment I heard about this entry from a SQL Server Central daily mail.