There are some skills which are extensions of your instincts, and which you can only learn though years of experience. Matt Simmons has this brought home by the fact that he was recently minutes away from a data-loss disaster, and he doesn't quite know how he prevented it.
Oh, do I have a confession to make today! I’ll tell it straight: I got lucky. I got really lucky, and despite my admonitions to the contrary, I’m being hailed as a hero and savior of the day. Nothing could be further from the truth.
To give you some background on the story, I have three sites to deal with. Our primary site is active, where production is run and from where clients are served; and we also have a secondary center to where production can be moved in the event of a catastrophic failure of the primary site. We also have a third “backup” site where all of the data is pooled, written to media, and shipped to offsite-storage. In a nutshell, the data flows from the primary to the secondary, then to the backup site. All of the various data does this, but last Friday, I was concerned with the database.
Our byzantine license
precludes us from
doing anything more
robust with [our
We have what I would consider to be an antiquated database. Our byzantine license precludes us from doing anything more robust with it, so we just use log shipping to ensure that the data is replicated to all of the backup instances in a timely manner. Due to bandwidth considerations, this happens at the top of every even-numbered hour.
Last Friday, I was working late in order to get through a particularly large backlog, and I was in the zone. I had my headphones on and not a care in the world, until I got an instant message from someone in our New York office. The question was simple, but profound: “How often do we backup the primary database?”
That’s not the sort of question that you should give a knee-jerk response to, even at the best of times. I considered the possible answers for a moment, and then I remembered that it was Friday night at 5:45pm. Ignoring the query, I immediately opened remote terminals to all of the backup databases and shut them down, preventing any data being replicated from the primary.
Reasons and Repercussions
As it turned out, that last step saved us. What had happened was that someone in New York had accidentally deleted several thousand records from the database, and I managed to shut down the backup databases before the 6 o’clock log shipping. If I hadn’t been contacted when I was, or if I hadn’t immediately turned off the backup databases, then scant few minutes later the transaction log that contained the command to delete the records would have percolated throughout the backup instances, wiping the data everywhere. I got lucky.
Admittedly, had the data been lost from all of the instances, it still wouldn’t have been the end of the world; as I mentioned, we do move tapes offsite. I could have called our media storage people and had them ship us the tape. I could have then recovered the last instance of the database from the tape, replayed it up to the point of that mistaken command. Then, I'd have put it on an external hard drive, driven to the datacenter, copied it off and activated it. It would probably only have taken two days. Not a catastrophe but, like I said, I got lucky and managed to avoid that particular headache.
What made me shut
down the databases
with only the slightest
hint that something
was wrong? I don’t
know, and that's hard
to deal with.
The reason that I’m being hailed as a hero is because of my swift response in shutting down the secondary databases. What everyone marvels at, but no one really considers, is why I took that action. What made me shut down the databases with only the slightest hint that something was wrong? I don’t know, and that’s hard to deal with.
Since I don’t know, I can only chalk it up to experience. It was late on a Friday, that wasn’t a conversational kind of question, and then there was also my knowledge of the person asking it: That question didn’t strike me as the kind he would pose ‘off the cuff’, so to speak. In the end, I can only say that it was my intuition and experience in dealing with the network and those people that prompted my reaction. My biggest problem with this is that experience and intuition are impossible to teach. They must be acquired.
That being said, there are some things which can be taught; and so, yesterday, my supervisor and I held a training session for everyone in the company. The underpinnings of the data flow and backup policies were covered, and we highlighted the fact that what really saved us was the fact that someone spoke up immediately about a problem, and told a person who could respond effectively. I have worked hard to make sure that our users know that they can come and tell us about problems and mistakes without being rebuked, and our training session emphasized that behavior.
My experience certainly gave me pause, and I’m in the middle of evaluating how I can deal with what would have happened if we had missed that 15 minute window. If it happens again, at a less opportune time, what can I do to make sure that the recovery process isn’t stretch across two days? I have to answer that question now. Maybe you should reconsider some of the critical systems in your infrastructure, and re-evaluate your disaster recovery plan. If you don’t have one, I urge you to build one. Like many things, by the time it’s obvious that you need one, it’s too late.