If you are part of a team that is required to ensure that an application stays running at all hours, then you're likely to experience that 3AM-callout feeling. Grant knows all too well what is required, and gives hard-won advice on the best way of keeping on top of the task of keeping the IT services running, no matter what time of day the problems occur.
Do you know where your SQL Server instances are?
What is it about 3AM anyway? That time was a big deal during the last US presidential election too. I’m not sure why, but there’s something about 3AM, regardless of your time zone, that seems to be when disk drives fill up or backups fail. Then again, maybe that’s just me. Until recently, I’ve been on-call non-stop since 1995. With a few exceptions, my worst moments usually started around 3AM. Being on-call is considered a major part of the DBA position. You’re to be there, ready, able, and knowledgeable, at any time, night or day, including 3AM. How the heck can you deal with it?
To you, the DBA or responsible IT person, being on-call should not simply be a matter of having your phone number added to an email list. Nor should your company look at on-call as just “having an email distribution list for on-call duties”. There really are a lot of aspects to setting up and managing your on-call processes. In this article, I’m going to discuss a number of different aspects of being on-call and setting up and managing the process. At the end I will supply a checklist for you to use to set up your own on-call processes.
If you’re the only DBA, or even the only IT person, then this is not for you. Most of us, however, split at least some of our duties with others. You may only have a team of three, but you have a team. You may be part of a huge IT organization with hundreds of members, which means you most certainly are part of a team. What’s one of the things we learn in kindergarten? Sharing. Sharing the duties of being on-call is one of the single best ways you can make it less onerous. This means you have to figure out the different people in the organization that are going to be on-call. You need to have a posted schedule, available to all, not just the on-call people, but also the people who will be calling, especially management. You need to know who you are supposed to track down when, because at 3AM your brain might not be firing on all eight cylinders.
Another consideration when sharing on-call is determining who gets called when. Let’s face it, not all problems are going to require a Kimberly Tripp (blog|twitter) level of knowledge about SQL Server internals to solve them. If a backup failed because a disk was full, I’m fairly certain a more junior IT person can handle it. So, if you have enough people, it might be worth establishing a two-tier approach. That way, you can have the simpler, easier-to-solve stuff go to the more junior members of the team. And what better incentive can you possibly have to stop being junior than to get more sleep?
Training and Documentation
Now you know who is going to be on-call. But are these people qualified? You should establish what the minimum level of knowledge would be for a person to qualify for the on-call rotation. Do they need to know how to write a T-SQL backup script, or just know how to restart one in SQL Agent? It depends on the needs of your organization. Is the purpose of your on-call support to act as a triage unit that determines the scope of a problem, and then calls a second-tier on-call team? Then the need for extreme levels of training is reduced. If you only have a few people for the on-call rotation then you have a greater need to see that they are equally well-trained, or the entire process will fall to one or a few people.
As a major part of the training, you need to make sure you keep good documentation. The worst thing in the world, and I know this from personal experience, is to get that 3AM call because a data load process failed, and you don’t have a clue what the job does, where it came from, if it’s important, how to fix it, who to call if you can’t fix it…you get the point. At 3AM, the best response to a lack of knowledge is to call management and then go back to bed. Now your managers start phoning each other and wondering who dropped the ball. If it was your project that got into production without good documentation, you’re going to have some fun meetings the following morning. Writing down how to recover a failed process, or at least letting everyone know who to call if a process fails, is vital to having a smooth response.
The Tiger Team
Most on-call situations are minor little things, such as a locked file preventing a backup from completing. These can be addressed by normal on-call processing. But sometimes the outage that occurs at 3AM is major and will take hours or even days to recover from. While I was in the Navy, certain pieces of equipment were vital to the operation of the submarine. When these things went offline a “tiger team” was created. The tiger team went watch-and-watch, meaning they worked 24 hours a day until the situation was resolved. I’ve done the same thing in other jobs.
The concept is simple: when you have a major outage it has to be fixed. The first person on-call, or the first senior support person, takes the first shift and starts working on the problem. Do you have a second senior level person? Send them home or back to bed, immediately. The first person works up to 12 hours - and no more - on the issue. At the end of that time, they turn it over to Person #2, who goes another 12 hours. The cycle continues until the problem is resolved.
Do you feel really comfortable with your ability to take a tail log backup? Great. How about to perform a point in time recovery? When was the last time you did it? Yesterday? Great! We’re done here. 6 months ago, or last year sometime, or you tested it once when you set the backups up 5 years ago… Once more, it’s 3AM, the call comes in, you need to do a point in time recovery now, with too little sleep, half-awake, and you haven’t done it for years, if ever.
You must practice recovery. It’s not different from the martial arts, where you practice techniques at all possible speeds against a variety of opponents. You must do the same thing with your on-call responses. Are you expected to respond to database corruption at 3AM? Then you’d better be recovering a corrupted database at least once a month or more. Set up a training program so that you and your team can go through the responses that you are documenting and expected to perform on a regular basis. This needs to be a standard part of the on-call team’s training. Practice like it was a real emergency and then, when you mess stuff up, practice again until you get it right. Use your documentation and follow it step-by-step. If the documentation is wrong, fix it, immediately.
Monitor your Servers
You’re on-call right? What or who is calling you? Are you getting calls from business people to report that they can’t connect to the database? Then chances are you’re not doing this right. Your servers should be calling you. You should be able to set up mechanisms or purchase products that will keep track of your servers and your processes such that, when you have an issue, it’s the server that lets you know. Getting monitoring and alerts right is something that takes a lot of time and effort and will require constant review because of the changes to your systems and applications. But it’s a major part of the on-call process.
You need to establish, as fast as possible, a very high signal to noise ratio. This means you need to get alerts that are meaningful and vital only. If you’re getting up at 3AM because there was a momentary spike on the CPU, one that no one could have noticed, let alone done anything about, you have an issue. About the second or third time this alert fires, you’re going to start ignoring it. As soon you start ignoring alerts, you’re in trouble, because the next alert is going to be important. This particular alert is noise. Eliminate it, or set the threshold higher, or longer, or whatever is needed to make it fire only when there is an issue that you not only can fix, but want to fix.
This might seem like a silly topic, but how heavy a sleeper are you? 3AM may be the deepest part of your sleep cycle and you might need something serious to wake you up. Pick a ring-tone that will do the job. I used to have the Dropkick Murphys’ version of Scotland the Brave as my ring tone. It worked well because it started off fairly sedate with just the bagpipes playing and then built up into a full blown guitar and drums version of the song that would wake the dead. But then I started to hate the song after a while, so I changed to a really obnoxious normal ring tone that did the job just as well. But the key here is to wake up and get up, because if you’ve adjusted your signal to noise ratio, this is important.
This brings up a couple of extra points. Make sure your phone is charged. If I were your boss, I wouldn’t be pleased to hear that the European wing of the business had been offline for four hours because you forgot to charge your phone (neither was my boss when I did it once). Make sure your phone is where you can hear it. It’s great that you have a scary ring tone that wakes up zombies, but if you left the phone downstairs (as a former co-worker used to do, regularly), it’s not doing you any good, is it?
There was a time when being on-call meant being ready to charge in to the office at a moment’s notice (and man, am I glad those days are over). Now you should have remote access set up from your home. Whatever level of security is needed by the company, even if you have to install a special router or carry an “on-call” laptop, you need to be able to get to the systems when that 3AM phone call arrives. Most of us live quite a distance from our places of work, (my new one is across the ocean, on an island off a different continent) so running into work is not a timely way to deal with system outages.
Today, with all the capabilities of tablets and phones, it’s entirely possible to access your servers from anywhere. If your company is serious about the on-call process, they should be willing to lay down the cost of unlimited data plans for the devices that need them. With these you can even tether your laptop, if needed, so that you have a full-fledged internet connection to your servers from anywhere. It might not be good enough for day-to-day work, but it will suffice when that call comes in and you’re at the local user group meeting.
This discussion is just an overview of all the possible decisions that have to be made around the on-call process for your business. But it should be enough to begin to set up a viable on-call system for most enterprises. Good luck, and I hope you’re not reading this at 3AM.
The On-Call Checklist
- On-call rotation, documented and published
- Establish and document tiered support, if needed
- Document the minimum levels of training and knowledge required to be on-call
- Document the processes that must be addressed regardless of the time
- Create an emergency response or “tiger team”
- Practice all the on-call tasks regularly
- Review the documentation as part of practice, and fix bad or missing documentation as soon as it’s discovered
- Set up monitoring and alerting
- Charge your phone and have it with you at all times when on-call
- Set up remote access
- Get a data plan for your mobile devices