Your SQL Server monitoring system should be an ally, another member of your team. It should work away in the background, offering a set of accurate, reliable alerts that inform the rest of the team when an important job fails, or fails to run, when resource usage shows sustained spikes or abnormal trends, when database or server properties deviate from the established “standard”. For each alert, you should be able to take meaningful, corrective action, meaning that you can respond quickly to an emergency, before too many people are affected, and that the overall quality of database systems demonstrably improves over time.
Having painted the idyllic picture of our monitoring systems, let’s now admit the harsh reality that, for most of us, monitoring sucks. The problem is that most monitoring systems, custom-built or third party, are often:
- Over-configured – too many alerts, overwhelming the users with information
- Under-implemented – too few alerts provide enough useful information to pinpoint the problem quickly and take action
It doesn’t have to be like this. What follows are Ten Commandments that will make your monitoring solution work better, and allow you to get more sleep and have an easier time at work.
#1: Alerts must be actionable
How often have you received an alert on your mobile device, thought, “Yep, that’s the alert that comes in every night at 2 AM“, and cleared it (until the next night)? Persistent, non-actionable alerts are the single, biggest problem with most monitoring/alerting systems; every alert that comes through for which we cannot take any meaningful action is a waste of time and erodes our confidence in the monitoring system as a whole. Eventually, we just start ignoring the alerts, and then risk missing the few ‘good’ alerts that we really ought to have noticed.
For every alert that comes in you should be able to do something to fix it. If you can’t, then the alert is meaningless. Common alerts that are often meaningless include:
- CPU percent spike – if CPU spikes to 100% for a 30 seconds at 2 AM, absolutely no one cares…and yet in many systems this will trigger an alert.
- High connection count – for a brief period, the connection count exceeds some threshold level. What does it mean? Were we doing more business? Did something go wrong?
- Long running query – generic alerts for “any query that takes over 3 s to complete”. As a one-time event, a query runs slow (perhaps because auto-stats kicked in) and we get an alert.
- High memory utilization – SQL Server is supposed to use more of our installed memory, as required, and yet each time it does so, we get an alert.
All of these are examples of non-actionable alerts. If CPU spiked at 100% for 30 minutes and correlated with a slowdown in overall system performance, then yes, we want to take action. If a query repeatedly exceeds its normal execution time, significantly, we want to investigate it. In short, we’re looking for alerts based on sustained, high resource usage, and repeated patterns of abnormal behavior, not one off events.
Seeking Out More Meaningful Alerts
In addition to finding durations and thresholds on which to alert, which are meaningful for your system, analyze your system for “missing” alerts that are real indicators of urgent problems. For example, are you monitoring connections timeouts? How about hard page faults, where a request references a page that is not in SQL Server’s working memory set?
If your current monitoring system plagues you with non-actionable alerts, then the cure is simple: turn them off! Once a day for a week or two, examine every alert, mark the ones that were non-actionable, and then tune the monitoring, turning off those alerts that are truly meaningless, adjusting the threshold on others, and so on.
#2: Alerts must contain sufficient information
Alert! SPID 123 has been blocking SPID 456 for x minutes! We have to investigate that, right? What was each session running? Which applications originated each of these sessions? Unfortunately, the alert does not include this information and by the time we’ve logged in to investigate, both SPIDs have completed and so SQL Server retains no history on what either was running. What can we do? The answer is nothing. The alert contained insufficient information to make it actionable.
For every alert of this type, we need:
- Host names – what host (e.g. application servers) originated the SPIDs
- Application (program) names – most databases are shared; is a particular app running “bad” queries
- Query details – what queries was each session running (very easy to get from the DMVs)
- Login details – who ran these queries
- Competing processes – what else was running at the same time? Typically, was a backup program running?
Ultimately, our alerts need to provide enough contextual information that we can understand exactly why they fired, and take action.
#3: Avoid magic thresholds
A “magic” threshold is any fixed metric value that when exceeded is taken, universally, as a “bad sign” that requires immediate action. It also applies to any fixed metric you don’t fully understand but for which an alert exists.
A good and universal threshold value for the power setting of your servers is “on”. Beyond that, I can think of very few other threshold values that aren’t fully dependent on the environment, the applications and the patterns of activity on your servers.
Let’s say you discover an online article proposing that a certain metric should never exceed a certain threshold value. You update your monitoring system to watch that metric, and send an alert every time it exceeds the magic threshold. Two things can go wrong. Firstly, the value may simply be inappropriate for your environment. If you don’t fully understand the metric, or the reasoning behind the established threshold, then when the alert fires, it is not actionable.
The classic example of a ‘magic’ SQL Server metric is Page Life Expectancy, the average time (in seconds) that a page is expected to remain in buffer cache if it hasn’t been referenced. A Microsoft paper, written more than ten years ago, recommended that PLE should always stay above 300 s. In other words, unreferenced pages should remain in the cache for at least 5 minutes. Many DBAs came to treat PLE as a “magic bullet”, the first place to look if they had a performance problem. It began to crop up frequently in interview questions, and many DBAs to this day start to worry, and set alerts for when PLE dips below 300 s.
This threshold value may have relevance for systems built on 10+ years old hardware, where installed memory rarely exceed 4 GB, and so the buffer cache was small, but it is certainly relevant no longer (see the Further Reading section for more details). Secondly, regardless of the exact threshold level, this metric means little to end users, if the disk system is ‘keeping up’ and performing to expectations. It could mean that the system has a longer-term problem in that it is doing more disk reads than necessary, but it’s not something you need an alert for at 2 AM.
Aside from PLE, several other magic metrics give rise to countless non-actionable alerts, but two I see often are:
- Disk Queue Length – on modern disk systems, it is meaningless to set an alert every time the disk queue exceeds a certain length. Disk queues will build up occasionally but ideally will process very quickly, and that is what matters. Stop monitoring disk queue length, and instead monitor disk latency (seconds/read and seconds/write).
- Disk Percent Used – I’ve seen monitoring systems that use a single magic threshold (say, 80%) for the entire environment. This means that a DBA is woken at 2 AM to find out that a 2 TB volume has a paltry 400 GB of free space. A much better option is to set specific alerts for specific drives, with the threshold set to specific amount of remaining free space, rather than a percentage.
If you don’t fully understand a metric, turn it off, at least for the time being! It doesn’t mean it’s not an important alert, but until it is actionable, it is a meaningless alert in your environment. Turn in on at a later point, when you do understand it, you’ve established the right threshold for your environment and you will know what to do when the alert fires. It is always better to start small and build gradually.
Also, get rid of all “magic” and “uniform” thresholds. Instead, take a baseline, establish what is normal for your environment, and alert only when the value deviates significantly from the normal range established by your baseline, and stays there.
If you don’t capture baselines for all of your important metrics, and track their values over time, then you need to start; see the Further Reading section, at the end, for a few pointers. Unfortunately, few third party monitoring solutions do a great job of baselining, although the better ones do at least make it easy to do comparative graphing (comparing values for today vs. yesterday, this week vs. last week, and so on). This of course, is a manual, visual process. Ideally, this would be automated, with statistics algorithms for analyzing time-series measurements, and subsequent alerting.
#4: Alerts should not make excessive noise
As discussed in the first “commandment”, if your monitoring system develops a reputation for non-actionable alerts, it undermines trust. If, in addition, the system sends alerts with alarming frequency, flooding in-boxes, people simply stop paying attention. This is the “boy who cried wolf” syndrome and is a significant problem with most alerting systems.
Some time ago, I was doing some development work for a startup and, during a catch-up with the development manager, the QA person burst into the cube in a panic, screaming that the server was down and no one knew why. The first response of the manager, who remained commendably calm, was “weird…no-one has received an alert“.
As it turned out, they did have monitoring system, and it did send an alert to warn them of this situation. However, there had previously been so many complaints about alert overload that the previous week a sys admin had centrally configured Outlook to divert all alerts to a separate folder…that everyone then completely ignored.
Don’t let your monitoring system reach this sticky end!
#5: Alerts must make enough noise
On the flipside of the fourth commandment, it’s important that alerts make sufficient noise that the DBAs are unlikely to miss them. If your system “fires once and forgets”, and an email glitch prevents the alert reaching the recipient then it could cause obvious problems.
As with all things monitoring, the trick is establishing the right balance. Perhaps you can resend alerts once every 30 minutes or hour for low severity alerts, and more frequently for higher severity alerts.
Critically, though, make sure someone has to acknowledge receipt of alert, and clear it if the problem is fixed, or at least under control.
#6: Set multiple alert thresholds
Let’s walk through the timeline of a real-world disaster, which afflicted one of my clients. At 4.02 PM one Friday (the sort of time all nasty alerts prefer to strike, for some reason), an alert came in warning that a certain disk was 85% full. A DBA acknowledged the alert, and said he would “keep an eye on it”. At 6 PM, his shift ended and the weekend staff (offshore) took over. On Monday morning at 8.35 AM, the server crashed due to a full disk.
What happened was that disk usage crept up very gradually over course of the weekend until the disk was full. On Friday afternoon, the rise was imperceptible to the DBA and it didn’t seem like a pressing issue, so he cleared the alert. Unfortunately, the monitoring system was configured such that once it had sent an alert for an exceeded threshold, and someone had acknowledged and cleared it, it sent no further alerts, even though the value continued to exceed the threshold (unless the value first dipped back down below the threshold and then exceeded it again).
Ultimately, the cause was an anti-virus program, slowly creating file after file on the server. However, the important point from a monitoring perspective is that by setting multiple alert thresholds the DBAs would have spotted the problem before the disk filled up. For example, in addition to the 85% Full alert (“this could become a problem“), there should have been a 90% alert (“look at this very soon“) and a 95% full alert (“the system will go down in an hour if you don’t get out of bed and do something“).
There are many other good candidates for multiple alert thresholds. In the First Commandment, we discussed how “CPU spike” could be a non-actionable alert. However, we do care about sustained periods of 100% CPU, and with multiple alert thresholds, we can for example send a low severity alert, via email, after a few minutes of sustained high CPU usage, and then a higher severity alert to pager for longer periods of high usage.
The “percentage of available SQL Server worker threads” is another good metric to track. If SQL Server runs out of free threads, no one else will be able to connect to it and your phone will start ringing very soon after, so be sure set multiple alert thresholds, as the number rises.
Ultimately, multiple thresholds are potentially relevant for any baseline that is significantly outside normal range, for example setting low severity alert for “2x normal usage, a higher severity alert for “4x”, and so on.
#7: Always test your alerts
How many people would put code into production without testing it (put your hands down at the back)? Your monitoring system is a production system. In fact, it’s one of your most important production systems, since its job is to stop the others from failing your business.
Set your alerts for disk, CPU, network, memory, and so on, and then exercise them. Fill up your disks and make sure the alerts fire at the various thresholds and someone receives them. Exercise the CPU (see Further Reading for a script you can use to exercise CPU within SQL Server). Exercise your network, for example using IOMeter to access a remote disk. Unplug the server! I’ve seen monitoring software stop monitoring when it can’t talk to the server meaning that you won’t even know that it’s down.
If you don’t test your monitoring system, and its alerts, then you are, in effect, putting code into production without testing it.
#8: Avoid the observer effect
Generally, the “observer effect” refers to any means of collecting data for analysis of performance issues, which itself causes potential performance issues. Applied to monitoring systems, the observer effect refers to any performance problem, or any other issue, such as instability or insecurity, which the monitoring system itself introduces.
Most often, “over-monitoring” and putting too much load on server, is the cause. Often, the cause is such practices as capturing every PerfMon counter value every second, or running poorly filtered SQL Server Profiler traces. As we’ve discussed several times by now, don’t monitor anything if you’ve not sure what action you’d take if you got an alert on it. Likewise, set intelligent alert thresholds, designed specifically for your environment, based on careful baseline measurements.
However, I’ve also experienced the observer effect in other forms, caused by the monitoring product doing “unexpected” things. Here are a few red flags:
- Use of
xp_cmdshell– deprecated and locked down since SQL Server 2005, but some monitoring products still require it to be activated. This puts us in the ridiculous situation of having to reduce security in order to monitor! There is nothing the tool should need to do with
xp_cmdshellthat it couldn’t do more securely with WMI
- Use of
*– these object automation stored procedures have been known since SQL Server 2000 to cause memory leaks and other issues, and yet at least one major monitoring product still requires it to be enabled. Again, all the information the product could get this way is available through the public WMI APIs. In a double whammy,
- Badly filtered traces – often a monitoring tool will spin up a trace, in order to collect performance data. As discussed earlier, such traces, if poorly filtered, can cause performance problems. For example, often these traces will capture the
:stmtcompletedevent. Let’s say we have a query that in the
SELECTlist calls a scalar function that contains three statements, to perform a calculation, and the query returns a million rows. This will give rise to three million events in the trace, and if the trace can’t write out fast enough, the whole server will slow to a crawl.
- Any product that requires
sa– an obvious security concern. No product should require
saaccess to SQL Server; SQL Server 2005 introduced a permission called
VIEW SERVER STATEfor this exact purpose.
If you spot any of these red flags, work with your vendor or whoever built the monitoring system to get it changed!
#9: Monitor test and development servers
As a developer, this commandant is near and dear to my heart. Of course, in terms of monitoring priority, the production systems are the most important to any business, but many companies treat test and development servers as if they don’t really matter at all.
I remember an occasion a few years ago, working with a group of other consultants on a data warehouse project. The production server was admirably well-spec’d (32 cores, dedicated SAN and so on) and closely monitored for problems. By contrast, we developed against a single shared copy of the Data Warehouse on a 4-core virtual machine with 32 GB RAM, and no monitoring. I estimate that the total “billable” in the room was about $3000/hour, and yet each time one of us ran a query, we’d leave en-masse for a coffee break since there was nothing anybody else could do until it completed. Despite this, the management seemed uninterested in giving us the hardware upgrade for which we kept begging; it was not “in the budget”.
It’s very important to monitor each development and test server, as well as the production servers, in order to spot a looming problem before it takes down the server and stops your developers and testers dead in their tracks. If you don’t, in effect you are saying that you don’t care if your developers have a problem and don’t care that your company wastes money because the developers can’t do their jobs.
#10: Monitor the monitoring server
What do you do if your monitoring server goes down? Do you need a monitoring server to monitor the monitoring server? And so on?
No, but you definitely do need to know if your production server is no longer being monitored. A possible “lightweight” solution, which I’ve used on a few occasions, and which seems to work quite well, is to monitor the monitoring server with the production database server. This is just a simple ‘ping’ that runs once every 10 minutes or so to make sure the monitoring server is up, and sends a notification via database mail if the monitoring server goes down. While it may seem odd to run monitoring from your production server, this ensures that both of your most important systems are working together to keep each other up and running
Ultimately, the goal of every SQL Server monitoring system is to detect and alert on looming problems before they become serious problems, and to enable us to troubleshoot and tune SQL Server systems quickly and efficiently.
Issues arise when people over-configure the monitoring system, so that alerts fire too frequently–and apparently indiscriminately–based on short-lived transgressions of fixed “magic” thresholds. This can produce a scattershot of non-actionable messages, which waste both time and energy. In such cases, there is a tendency initially to see potential issues everywhere, followed shortly by exhaustion and eventually ignoring the alerts altogether.
At the other end of the spectrum, are under-configured alerting systems, featuring alerts that contain too little contextual detail, or alerts that fail to fire on repeatable thresholds. These can cause teams to miss imminent problems.
The keys to winning this battle, in brief:
- Remove meaningless alerts
- Establish relevant baselines for your system and alert on consistent deviation for the established “norm,” rather than on poorly understood magic values
- Ensure that every alert contains enough information to be fully actionable
- Monitor all relevant systems at the appropriate level; balance is key!
In so doing, you can tame previously unruly monitoring systems, and re-establish the team’s trust. When an alert fires, you’ll know that there is a real issue, requiring real action. Ultimately, the goal of every SQL Server monitoring system is to detect and alert on looming problems before they become serious problems, and to enable us to troubleshoot and tune SQL Server systems quickly and efficiently.
Page Life Expectancy: