Click here to monitor SSC
  • Av rating:
  • Total votes: 19
  • Total comments: 7
Grant Fritchey

It's 3AM and I'm on call

27 April 2011

If you are part of a team that is required to ensure that an application stays running at all hours, then you're likely to experience that 3AM-callout feeling. Grant knows all too well what is required, and gives hard-won advice on the best way of keeping on top of the task of keeping the IT services running, no matter what time of day the problems occur.

Do you know where your SQL Server instances are?

You should.

What is it about 3AM anyway? That time was a big deal during the last US presidential election too. I’m not sure why, but there’s something about 3AM, regardless of your time zone, that seems to be when disk drives fill up or backups fail. Then again, maybe that’s just me. Until recently, I’ve been on-call non-stop since 1995. With a few exceptions, my worst moments usually started around 3AM. Being on-call is considered a major part of the DBA position. You’re to be there, ready, able, and knowledgeable, at any time, night or day, including 3AM. How the heck can you deal with it?

To you, the DBA or responsible IT person, being on-call should not simply be a matter of having your phone number added to an email list. Nor should your company look at on-call as just “having an email distribution list for on-call duties”. There really are a lot of aspects to setting up and managing your on-call processes. In this article, I’m going to discuss a number of different aspects of being on-call and setting up and managing the process. At the end I will supply a checklist for you to use to set up your own on-call processes.

Sharing

If you’re the only DBA, or even the only IT person, then this is not for you. Most of us, however, split at least some of our duties with others. You may only have a team of three, but you have a team. You may be part of a huge IT organization with hundreds of members, which means you most certainly are part of a team. What’s one of the things we learn in kindergarten? Sharing. Sharing the duties of being on-call is one of the single best ways you can make it less onerous. This means you have to figure out the different people in the organization that are going to be on-call. You need to have a posted schedule, available to all, not just the on-call people, but also the people who will be calling, especially management. You need to know who you are supposed to track down when, because at 3AM your brain might not be firing on all eight cylinders.

Another consideration when sharing on-call is determining who gets called when. Let’s face it, not all problems are going to require a Kimberly Tripp (blog|twitter) level of knowledge about SQL Server internals to solve them. If a backup failed because a disk was full, I’m fairly certain a more junior IT person can handle it. So, if you have enough people, it might be worth establishing a two-tier approach. That way, you can have the simpler, easier-to-solve stuff go to the more junior members of the team. And what better incentive can you possibly have to stop being junior than to get more sleep?

Training and Documentation

Now you know who is going to be on-call. But are these people qualified? You should establish what the minimum level of knowledge would be for a person to qualify for the on-call rotation. Do they need to know how to write a T-SQL backup script, or just know how to restart one in SQL Agent? It depends on the needs of your organization. Is the purpose of your on-call support to act as a triage unit that determines the scope of a problem, and then calls a second-tier on-call team? Then the need for extreme levels of training is reduced. If you only have a few people for the on-call rotation then you have a greater need to see that they are equally well-trained, or the entire process will fall to one or a few people.

As a major part of the training, you need to make sure you keep good documentation. The worst thing in the world, and I know this from personal experience, is to get that 3AM call because a data load process failed, and you don’t have a clue what the job does, where it came from, if it’s important, how to fix it, who to call if you can’t fix it…you get the point. At 3AM, the best response to a lack of knowledge is to call management and then go back to bed. Now your managers start phoning each other and wondering who dropped the ball. If it was your project that got into production without good documentation, you’re going to have some fun meetings the following morning. Writing down how to recover a failed process, or at least letting everyone know who to call if a process fails, is vital to having a smooth response.

The Tiger Team

Most on-call situations are minor little things, such as a locked file preventing a backup from completing. These can be addressed by normal on-call processing. But sometimes the outage that occurs at 3AM is major and will take hours or even days to recover from. While I was in the Navy, certain pieces of equipment were vital to the operation of the submarine. When these things went offline a “tiger team” was created. The tiger team went watch-and-watch, meaning they worked 24 hours a day until the situation was resolved. I’ve done the same thing in other jobs.

The concept is simple: when you have a major outage it has to be fixed. The first person on-call, or the first senior support person, takes the first shift and starts working on the problem. Do you have a second senior level person? Send them home or back to bed, immediately. The first person works up to 12 hours - and no more - on the issue. At the end of that time, they turn it over to Person #2, who goes another 12 hours. The cycle continues until the problem is resolved.

Practice

Do you feel really comfortable with your ability to take a tail log backup? Great. How about to perform a point in time recovery? When was the last time you did it? Yesterday? Great! We’re done here. 6 months ago, or last year sometime, or you tested it once when you set the backups up 5 years ago… Once more, it’s 3AM, the call comes in, you need to do a point in time recovery now, with too little sleep, half-awake, and you haven’t done it for years, if ever.

You must practice recovery. It’s not different from the martial arts, where you practice techniques at all possible speeds against a variety of opponents. You must do the same thing with your on-call responses. Are you expected to respond to database corruption at 3AM? Then you’d better be recovering a corrupted database at least once a month or more. Set up a training program so that you and your team can go through the responses that you are documenting and expected to perform on a regular basis. This needs to be a standard part of the on-call team’s training. Practice like it was a real emergency and then, when you mess stuff up, practice again until you get it right. Use your documentation and follow it step-by-step. If the documentation is wrong, fix it, immediately.

Monitor your Servers

You’re on-call right? What or who is calling you? Are you getting calls from business people to report that they can’t connect to the database? Then chances are you’re not doing this right. Your servers should be calling you. You should be able to set up mechanisms or purchase products that will keep track of your servers and your processes such that, when you have an issue, it’s the server that lets you know. Getting monitoring and alerts right is something that takes a lot of time and effort and will require constant review because of the changes to your systems and applications. But it’s a major part of the on-call process.

You need to establish, as fast as possible, a very high signal to noise ratio. This means you need to get alerts that are meaningful and vital only. If you’re getting up at 3AM because there was a momentary spike on the CPU, one that no one could have noticed, let alone done anything about, you have an issue. About the second or third time this alert fires, you’re going to start ignoring it. As soon you start ignoring alerts, you’re in trouble, because the next alert is going to be important. This particular alert is noise. Eliminate it, or set the threshold higher, or longer, or whatever is needed to make it fire only when there is an issue that you not only can fix, but want to fix.

Ring Tones

This might seem like a silly topic, but how heavy a sleeper are you? 3AM may be the deepest part of your sleep cycle and you might need something serious to wake you up. Pick a ring-tone that will do the job. I used to have the Dropkick Murphys’ version of Scotland the Brave as my ring tone. It worked well because it started off fairly sedate with just the bagpipes playing and then built up into a full blown guitar and drums version of the song that would wake the dead. But then I started to hate the song after a while, so I changed to a really obnoxious normal ring tone that did the job just as well. But the key here is to wake up and get up, because if you’ve adjusted your signal to noise ratio, this is important.

This brings up a couple of extra points. Make sure your phone is charged. If I were your boss, I wouldn’t be pleased to hear that the European wing of the business had been offline for four hours because you forgot to charge your phone (neither was my boss when I did it once). Make sure your phone is where you can hear it. It’s great that you have a scary ring tone that wakes up zombies, but if you left the phone downstairs (as a former co-worker used to do, regularly), it’s not doing you any good, is it?

Remote Access

There was a time when being on-call meant being ready to charge in to the office at a moment’s notice (and man, am I glad those days are over). Now you should have remote access set up from your home. Whatever level of security is needed by the company, even if you have to install a special router or carry an “on-call” laptop, you need to be able to get to the systems when that 3AM phone call arrives. Most of us live quite a distance from our places of work, (my new one is across the ocean, on an island off a different continent) so running into work is not a timely way to deal with system outages.

Today, with all the capabilities of tablets and phones, it’s entirely possible to access your servers from anywhere. If your company is serious about the on-call process, they should be willing to lay down the cost of unlimited data plans for the devices that need them. With these you can even tether your laptop, if needed, so that you have a full-fledged internet connection to your servers from anywhere. It might not be good enough for day-to-day work, but it will suffice when that call comes in and you’re at the local user group meeting.

Conclusion

This discussion is just an overview of all the possible decisions that have to be made around the on-call process for your business. But it should be enough to begin to set up a viable on-call system for most enterprises. Good luck, and I hope you’re not reading this at 3AM.

The On-Call Checklist

  • On-call rotation, documented and published
  • Establish and document tiered support, if needed
  • Document the minimum levels of training and knowledge required to be on-call
  • Document the processes that must be addressed regardless of the time
  • Create an emergency response or “tiger team”
  • Practice all the on-call tasks regularly
  • Review the documentation as part of practice, and fix bad or missing documentation as soon as it’s discovered
  • Set up monitoring and alerting
  • Charge your phone and have it with you at all times when on-call
  • Set up remote access
  • Get a data plan for your mobile devices
Grant Fritchey

Author profile:

Grant Fritchey, SQL Server MVP, works for Red Gate Software as Product Evangelist. In his time as a DBA and developer, he has worked at three failed dot–coms, a major consulting company, a global bank and an international insurance & engineering company. Grant volunteers for the Professional Association of SQL Server Users (PASS). He is the author of the books SQL Server Execution Plans (Simple-Talk) and SQL Server 2008 Query Performance Tuning Distilled (Apress). He is one of the founding officers of the Southern New England SQL Server Users Group (SNESSUG) and it’s current president. He earned the nickname “The Scary DBA.” He even has an official name plate, and displays it proudly.

Search for other articles by Grant Fritchey

Rate this article:   Avg rating: from a total of 19 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Good Points
Posted by: @SQLShaw (not signed in)
Posted on: Wednesday, May 04, 2011 at 7:23 AM
Message: Thanks Grant. I don't think enough of us are talking about the support aspect about our work.


Subject: Good one...
Posted by: Rahul Singla (not signed in)
Posted on: Wednesday, May 04, 2011 at 8:23 AM
Message: Hi Grant, really good read... I went through it being largely an application developer because I had had some on call instances.. You seem to have had countless of them :)

Subject: great!
Posted by: maverick_ph (view profile)
Posted on: Thursday, May 05, 2011 at 7:59 PM
Message: kudos to you Grant! great article! i really love reading it over and over again..thank you again! more more more! keep it up! it help me a lot..

Subject: Thanks
Posted by: Grant Fritchey (view profile)
Posted on: Monday, May 09, 2011 at 8:54 AM
Message: Glad you all liked it.

maverick_ph, any ideas? What do you feel like you're not getting out of all the articles here on Simple-Talk and we'll get you more of that.

Subject: Brilliant
Posted by: Jedi (view profile)
Posted on: Tuesday, May 10, 2011 at 2:45 AM
Message: I thought those lonely, cold pin drop quiet nights of keyboard tapping only applied to me, and whispering on the phone to to disturb your partner.

Shout out to the partners with exceptional patience!

Subject: Respect
Posted by: Nicolaas (not signed in)
Posted on: Tuesday, May 10, 2011 at 5:09 AM
Message: Grant, really good article. I shudder whenever I hear the Nokia ringtone...

Subject: Re
Posted by: Grant Fritchey (view profile)
Posted on: Friday, May 20, 2011 at 11:02 AM
Message: RE: Brilliant
Yeah, I didn't even mention the long suffering Mrs. Scary. Good point

RE: Respect
Funny how the noise on those things starts to really grate isn't it?

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.