No one enjoys server outages. Well, most people don’t. I actually get a bit of a kick out of server outages, the adrenaline, the do-or-die, pull out all the stops troubleshooting, the weeping and gnashing of teeth for the person who caused the outage (as long as it wasn’t me)… all that can actually be kind of fun. But, the fact of the matter is, businesses really don’t like server outages. It usually means lost revenue, and you may as well translate that directly to your paycheck.
How quickly you respond to outages makes a big difference to how long they last. The speed of the response is directly correlated to the accuracy of your monitoring system. But really, is that all monitoring can do for you? Send a message to let you know that the vacation you had planned this year needs to be cancelled because the server is offline, causing the company to hemorrhage money and it’s all getting taken out of your bonus? That almost makes it seem like you don’t need monitoring on the server at all. After all, you’ll get a call from the business when they realize the system is offline, and that can be your alert.
But there’s a lot more you can do with alerts than simply respond to system outages-you can get proactive.
Monitoring to report the failure of a piece of hardware, software, or a process, is an important part of your monitoring solution, but it’s only a part. Another important aspect is to monitor for events and statuses that don’t represent an emergency, but instead represent an impending emergency. Which is better, to get an alert that your log drive is running out of space, or to get an alert that your log drive has already run out of space? Despite my daredevil enjoyment of the chaos generated from an outage, I know that my job is to prevent problems, so I’d rather get the warning before the catastrophe has occurred.
The good news is, it’s not that hard to set up monitoring to see when your systems are running out of space, or any number of other alerts for that matter. You can do this through the alerts offered in the SQL Agent, or you can set up alerts through Policy Based Management, or you can use a third-party tool. All these mechanisms will enable you to find out if there are long running queries, or excessive blocking, or if a drive is running out of space. The trick is making sure that you’re responding to the alerts. Sounds easy, but it frequently isn’t… because of extraneous ‘noise’.
For the sake of our discussion, assume for a moment that you have an alert that will let you know when any hard drive you have is over 80% full. With that in hand, you no longer have to worry about the drive running out of space because you’re going to be alerted long before that occurs and you’ll be able to do something about it, right? Maybe.
Enabling the alert, you see that multiple servers have drives with less than 20% free space, so you get started resolving these issues, chortling to yourself that you won’t be responding to outages at 3 am.
But there’s a snag… one of the servers has less than 20% free space, but it’s on a drive that is 1 TB in size, meaning it actually has nearly 200 GB of free space. Looking at a history of the data on that drive you note that there has never been an increase of more than 20 GB over a six month period. The chance of running through 200 GB in less than a day is vanishingly remote. Now what? Each time your alert polls, you’re going to see this server that you don’t care about. What’s more, you find several other servers that are the same. Most people’s initial response to this very common situation is to ignore the one, two, or three drives that they know are not an issue.
There is a concept that started out in electrical engineering called the signal-to-noise ratio. Simply put, a pure transmission of electricity represents the signal. Any degradation of the pure transmission, from resistance, outside interference, impurity of the transmission medium etc, is referred to as noise. You determine the quality of your transmission by dividing your signal by the amount of noise you have. This concept can be applied to general communication and can be absolutely applied to monitoring and alerting: Think about the actionable alerts that you want to do something about as signal and the alerts that you don’t care about for one reason or another as noise.
You want to achieve a very high signal-to-noise ratio with your alerts. It’s very important that you get the right information at the right time, so that you can take appropriate action; as soon as you begin to introduce alerts that you don’t care about, such as the large drive that won’t run out of space any time soon, you’re introducing noise into your system. The more drives that don’t match your criteria, the more noise you’re introducing.
And so far I’ve only been talking about a disk usage alert. You have to multiply this by the hundreds of possible alerts that you could set up on a system. Not all alerts would be applicable to that system. Not all alert thresholds would be the same for that system. As you start to see these non-applicable alerts and alerts that are set to the wrong threshold, you’re dealing with noise in the system.
If you’re building your own monitoring system, one alert at a time, you’ll be able to make adjustments to each alert as you go. This will keep your signal-to-noise ratio nice and high. If, however, you’ve purchased a third party monitoring system, it will usually have a large number of alerts built into the system. When you turn your alerting system on for the first time, frequently your screen fills up with all sorts of dire warnings about the state of your servers. Usually, many, or even most, of these alerts are accurate, actionable information-good signal. You respond to the alerts and fix or avert the problems, making your systems more stable and avoiding disaster. But there’s also usually a lot of noise. Alerts are firing because they are inappropriate to your systems, or because the thresholds are too low or too high. While these tools are supposed to make your life easier, no one said that ease would be free (not counting the cost of the software of course). The mistake made by most people is to simply clear the offending alert (or to turn the alerting system off altogether because it’s “broken”!).
Making it go away is not fixing it. A good alerting system will make the cleared alert come back up. After all, the idea of alerts is to poke you in the eye, to alert you. Instead you need to take the time and trouble to make the adjustments to the system so that it’s useful for you. The developers who built the monitoring system just didn’t know how your system was configured. So if you operate regularly with a large number of disks at greater than 80%, you need to adjust the threshold on that free space alert so that you reduce the noise of false or unnecessary alerts, and increase the signal, the accurate, timely alerts.
This process can take a little time. I would recommend setting aside a few hours to adjust the alerts immediately after you bring a monitoring system online, and then again every couple of weeks. Within a few of these sessions, you’ll probably have everything well in hand so you can start the next process-turning alerts on and off.
As I said before, most monitoring systems only enable a subset of the provided alerts. You need to determine if the ones they enabled are appropriate to your system. Further, you need to decide if the alerts that are disabled are needed on your system and what the appropriate thresholds are. But remember, the goal is to get your system to provide as much good information, pure signal, as possible, so only turn on the alerts you really need.
Finally, you’re going to find certain systems, drives, databases, whatever, just don’t match the criteria for alert thresholds you need for the rest of your system. You will need to isolate these exceptions so that they are not generating noise. Most third-party tools provide a means to do this, either by disabling the alert for certain systems or by adjusting the threshold for certain systems. The key here is that exceptions should be exceptional. If you find that you’re creating an exception for every database on every system, then maybe you need to go back and adjust the alert threshold itself. You don’t want to try to maintain and document too many exceptions. Oh, didn’t I mention that? Yeah, if you create exceptions, you should document them so when an outage occurs you’re not scratching your head wondering why you didn’t get an alert, because you’ll have a record that you disabled the alert. This documentation process is another reason why your exceptions should be exceptional.
The goal when you set up your monitoring system alerts is to generate a pure signal. Alerts let you know when a system is offline, but more importantly, when a system is about to go offline. You need to take on the responsibility and the labor to adjust your alerts in order to increase the purity of that signal, and you will need to create some exceptions to your alerting.
Going through these tasks should provide you with a far more useful alert system and avoid the situation where you ignore everything or even turn it all off to avoid the noise. Your reward will be a well adjusted, proactive, alerting system which will result in an improvement in system up-time.