Click here to monitor SSC
  • Av rating:
  • Total votes: 28
  • Total comments: 6
Grant Fritchey

Preventing Problems in SQL Server

15 August 2011

It is never a good idea to let your users be the ones to tell you of database server outages. It is far better to be able to spot potential problems by being alerted for the most relevant conditions on your servers at the best threshold. This will take time and patience, but the reward will be an alerting system which allows you to deal more effectively with issues before they involve system down-time

No one enjoys server outages. Well, most people don’t. I actually get a bit of a kick out of server outages, the adrenaline, the do-or-die, pull out all the stops troubleshooting, the weeping and gnashing of teeth for the person who caused the outage (as long as it wasn’t me)…  all that can actually be kind of fun. But, the fact of the matter is, businesses really don’t like server outages. It usually means lost revenue, and you may as well translate that directly to your paycheck.

How quickly you respond to outages makes a big difference to how long they last. The speed of the response is directly correlated to the accuracy of your monitoring system. But really, is that all monitoring can do for you? Send a message to let you know that the vacation you had planned this year needs to be cancelled because the server is offline, causing the company to hemorrhage money and it’s all getting taken out of your bonus? That almost makes it seem like you don’t need monitoring on the server at all. After all, you’ll get a call from the business when they realize the system is offline, and that can be your alert.

But there’s a lot more you can do with alerts than simply respond to system outages—you can get proactive.

Proactive Monitoring

Monitoring to report the failure of a piece of hardware, software, or a process, is an important part of your monitoring solution, but it’s only a part. Another important aspect is to monitor for events and statuses that don’t represent an emergency, but instead represent an impending emergency. Which is better, to get an alert that your log drive is running out of space, or to get an alert that your log drive has already run out of space? Despite my daredevil enjoyment of the chaos generated from an outage, I know that my job is to prevent problems, so I’d rather get the warning before the catastrophe has occurred.

The good news is, it’s not that hard to set up monitoring to see when your systems are running out of space, or any number of other alerts for that matter. You can do this through the alerts offered in the SQL Agent, or you can set up alerts through Policy Based Management, or you can use a third-party tool. All these mechanisms will enable you to find out if there are long running queries, or excessive blocking, or if a drive is running out of space. The trick is making sure that you’re responding to the alerts. Sounds easy, but it frequently isn’t… because of extraneous ‘noise’.

Noise

For the sake of our discussion, assume for a moment that you have an alert that will let you know when any hard drive you have is over 80% full. With that in hand, you no longer have to worry about the drive running out of space because you’re going to be alerted long before that occurs and you’ll be able to do something about it, right? Maybe.

Enabling the alert, you see that multiple servers have drives with less than 20% free space, so you get started resolving these issues, chortling to yourself that you won’t be responding to outages at 3 am.

Description: C:\Users\Heather\Documents\Work\RG Editing\Articles\Preventing Problems in SQL Server\3am.png

From Wake-Up call by Larry Gonick

But there’s a snag… one of the servers has less than 20% free space, but it’s on a drive that is 1 TB in size, meaning it actually has nearly 200 GB of free space. Looking at a history of the data on that drive you note that there has never been an increase of more than 20 GB over a six month period. The chance of running through 200 GB in less than a day is vanishingly remote. Now what? Each time your alert polls, you’re going to see this server that you don’t care about. What’s more, you find several other servers that are the same. Most people’s initial response to this very common situation is to ignore the one, two, or three drives that they know are not an issue.

There is a concept that started out in electrical engineering called the signal-to-noise ratio. Simply put, a pure transmission of electricity represents the signal. Any degradation of the pure transmission, from resistance,  outside interference, impurity of the transmission medium etc, is referred to as noise. You determine the quality of your transmission by dividing your signal by the amount of noise you have. This concept can be applied to general communication and can be absolutely applied to monitoring and alerting: Think about the actionable alerts that you want to do something about as signal and the alerts that you don’t care about for one reason or another as noise.

You want to achieve a very high signal-to-noise ratio with your alerts. It’s very important that you get the right information at the right time, so that you can take appropriate action;  as soon as you begin to introduce alerts that you don’t care about, such as the large drive that won’t run out of space any time soon, you’re introducing noise into your system. The more drives that don’t match your criteria, the more noise you’re introducing.

And so far I’ve only been talking about a disk usage alert. You have to multiply this by the hundreds of possible alerts that you could set up on a system. Not all alerts would be applicable to that system. Not all alert thresholds would be the same for that system. As you start to see these non-applicable alerts and alerts that are set to the wrong threshold, you’re dealing with noise in the system.

Adjusting Alerts

If you’re building your own monitoring system, one alert at a time, you’ll be able to make adjustments to each alert as you go. This will keep your signal-to-noise ratio nice and high. If, however, you’ve purchased a third party monitoring system, it will usually have a large number of alerts built into the system. When you turn your alerting system on for the first time, frequently your screen fills up with all sorts of dire warnings about the state of your servers. Usually, many, or even most, of these alerts are accurate, actionable information—good signal. You respond to the alerts and fix or avert the problems, making your systems more stable and avoiding disaster. But there’s also usually a lot of noise. Alerts are firing because they are inappropriate to your systems, or because the thresholds are too low or too high. While these tools are supposed to make your life easier, no one said that ease would be free (not counting the cost of the software of course). The mistake made by most people is to simply clear the offending alert (or to turn the alerting system off altogether because it’s “broken”!).

Description: C:\Users\Heather\Documents\Work\RG Editing\Articles\Preventing Problems in SQL Server\NoAlerts.png

From Derek on Alert by Larry Gonick

Making it go away is not fixing it. A good alerting system will make the cleared alert come back up. After all, the idea of alerts is to poke you in the eye, to alert you. Instead you need to take the time and trouble to make the adjustments to the system so that it’s useful for you. The developers who built the monitoring system just didn’t know how your system was configured. So if you operate regularly with a large number of disks at greater than 80%, you need to adjust the threshold on that free space alert so that you reduce the noise of false or unnecessary alerts, and increase the signal, the accurate, timely alerts.

This process can take a little time. I would recommend setting aside a few hours to adjust the alerts immediately after you bring a monitoring system online, and then again every couple of weeks. Within a few of these sessions, you’ll probably have everything well in hand so you can start the next process—turning alerts on and off.

As I said before, most monitoring systems only enable a subset of the provided alerts. You need to determine if the ones they enabled are appropriate to your system. Further, you need to decide if the alerts that are disabled are needed on your system and what the appropriate thresholds are. But remember, the goal is to get your system to provide as much good information, pure signal, as possible, so only turn on the alerts you really need.

Finally, you’re going to find certain systems, drives, databases, whatever, just don’t match the criteria for alert thresholds you need for the rest of your system. You will need to isolate these exceptions so that they are not generating noise. Most third-party tools provide a means to do this, either by disabling the alert for certain systems or by adjusting the threshold for certain systems. The key here is that exceptions should be exceptional. If you find that you’re creating an exception for every database on every system, then maybe you need to go back and adjust the alert threshold itself. You don't want to try to maintain and document too many exceptions. Oh, didn’t I mention that? Yeah, if you create exceptions, you should document them so when an outage occurs you’re not scratching your head wondering why you didn’t get an alert, because you’ll have a record that you disabled the alert. This documentation process is another reason why your exceptions should be exceptional.

Conclusion

The goal when you set up your monitoring system alerts is to generate a pure signal. Alerts let you know when a system is offline, but more importantly, when a system is about to go offline. You need to take on the responsibility and the labor to adjust your alerts in order to increase the purity of that signal, and you will need to create some exceptions to your alerting.

Going through these tasks should provide you with a far more useful alert system and avoid the situation where you ignore everything or even turn it all off to avoid the noise. Your reward will be a well adjusted, proactive, alerting system which will result in an improvement in system up-time.

Grant Fritchey

Author profile:

Grant Fritchey, SQL Server MVP, works for Red Gate Software as Product Evangelist. In his time as a DBA and developer, he has worked at three failed dot–coms, a major consulting company, a global bank and an international insurance & engineering company. Grant volunteers for the Professional Association of SQL Server Users (PASS). He is the author of the books SQL Server Execution Plans (Simple-Talk) and SQL Server 2008 Query Performance Tuning Distilled (Apress). He is one of the founding officers of the Southern New England SQL Server Users Group (SNESSUG) and it’s current president. He earned the nickname “The Scary DBA.” He even has an official name plate, and displays it proudly.

Search for other articles by Grant Fritchey

Rate this article:   Avg rating: from a total of 28 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Great Article
Posted by: Josh Jones (not signed in)
Posted on: Friday, August 19, 2011 at 10:46 AM
Message: This is a great article Grant. I just began working in a new environment, and the first thing I noticed was that the previous DB manager had purchased a popular monitoring software, and subsequently the DBAs have been slowly turning the alerts off (instead of tuning them) because of the "signal to noise" ratio. So I'm pushing back on both the vendor AND the DBA staff to work on getting appropriate alerts, thresholds, and response processes in place. This is a good "reminder" for a lot of shops that this is an important part of managing the environment properly, which just helps everyone get more sleep :)

Subject: Thanks
Posted by: Grant Fritchey (view profile)
Posted on: Monday, August 22, 2011 at 6:43 PM
Message: Glad you found it useful. I agree. I think it's the one thing that people miss with all the monitoring tools.

Subject: Alerts can be incredibly useful.
Posted by: timothyawiseman@gmail.com (view profile)
Posted on: Wednesday, August 24, 2011 at 4:53 PM
Message: You make a good point. Having appropriate alerts has saved me more than once from havin a system become unusable. And much like unit-tests, when I find something that takes the system down that I did not have an alert for, I tend to try to create one to monitor for signs of it for next time.


Subject: Excellent
Posted by: Grant Fritchey (view profile)
Posted on: Friday, August 26, 2011 at 6:37 AM
Message: That's the way to run it Timothy. Well done.

Subject: Good article
Posted by: Anonymous (not signed in)
Posted on: Thursday, September 01, 2011 at 4:44 PM
Message: Excellent article. I love the honesty about enjoying downtimes. I'm not sure I would want to admit it publicly, but if I didn't enjoy them I would find another line of work.

One key bit of alerting I often see missed is the difference between getting alerted when something fails, and getting alerted when it failed to run. If you define an SLA for your jobs, and rules to check that they have run successfully, you won't ever have to wake up and find out that you have no backups for the last month because someone turned a sql agent off.

Subject: Failed to run
Posted by: Grant Fritchey (view profile)
Posted on: Tuesday, September 06, 2011 at 11:54 AM
Message: Excellent point. It's something that I do, but I should have pointed it out. You're 100% correct. Thanks for adding to the discussion.

And yeah, outages are fun aren't they? It feels wrong to love them... but I sure do, except when I have major plans like dinner with the Mrs. or a camping trip or something. But other than that. They're great!

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.