As a support DBA, I get to troubleshoot a variety of different setups, from a single database SQL Express instance all the way to multi-instance database clusters. While it’s nice to have third party tools to help diagnose issues, it’s a luxury I don’t always have. So when I get a call that the application is responding slowly, I have to start from scratch to try to track down the problem, knowing very little about the environment I’m troubleshooting. Though it can sometimes be frustrating, I’ve found I really enjoy the challenge of figuring out the cause. In my overly active imagination, I’m like a doctor, but instead of human illnesses, I diagnose diseases of a database nature.
We DBAs may not be dealing with life or death situations, but when the business is suffering, it can get pretty intense. There’s a saying in the medical community, “When you hear hoofbeats, think horses, not zebras,” meaning that although you may want to jump to a rare, unheard of problem, it’s probably the more common alternative. While I do find occasional off-the-wall problems, most of the issues boil down to common root causes, many of which are completely preventable.
One of the biggest obstacles doctors have to face is bad information from patients. Likewise, I often hear the phrase, “Nothing’s changed.” A major part of the whole troubleshooting process is retracing your steps; if the database started to slow down a week ago but was fine before that, it’s highly likely something did change. Was there an unusually large data load or higher number of sessions? A stored procedure or application code change?
One of the first things I do is look through the error logs. In addition to errors, some configuration changes are listed which may provide more insight. If I don’t find anything interesting there, I’ll move on to the Windows logs. If it happened recently, I’ll check out the system_health session (if it’s 2008 or higher) by querying dm_xe_sessions.
On top of all that, I also have some code that I use to look at expensive queries; I’ll even use the GUI Performance reports for Average I/O and Average CPU to see if anything stands out. Of course if it’s happening at that moment, I can find the culprit by querying the DMOs (Dynamic Management Objects), or even easier, just use sp_whoisactive, Adam Machanic’s lovely enhancement to sp_who. If I see a query taking up a lot of resources, I’ll look at its execution plan. I know these can be confusing, but there are a slew of books and articles that help you understand execution plans and tips you can use to reduce execution time.
The ankle bone’s connected to the foot bone
The human body, like a database, has so many different parts interacting with one another that it’s easy to go down the wrong path and misdiagnose the issue. High CPU is the biggest red herring of the database world; in my experience, it rarely is an actual problem with CPU. The trick is, a lot of different things can affect CPU utilization – from indexes and statistics, to configuration settings, and my personal favorite, less than optimal code. Let’s look at some of these culprits in more detail.
Indexes and Statistics
On larger tables that have data with a higher degree of cardinality, indexes are a must. I would go as far as to say that not having a clustered index on these kinds of tables is sheer madness. This is because SQL Server is so much more efficient at retrieving data via a clustered index, as opposed to a heap. Because changing and deleting data also requires reads, even in write-heavy environments, indexing is critical.
To facilitate proper index creation, Microsoft provides a set of DMOs to pinpoint missing indexes. When joined properly, they provide information on possible index candidates for queries run since the instance was last started. Consider these as recommendations; I always advise careful consideration before creating any new indexes.
Index Disclaimer: There can be too much of a good thing – the more indexes you have, the longer inserts and updates will take. The other thing to keep in mind is that indexes are not a one-time add-it-and-forget-it operation. They must be maintained to be useful. I advise a smart-indexing strategy, which uses DMOs to analyze the fragmentation first, then act accordingly. Ola Hallengren developed a very nice solution, which also performs other maintenance tasks as well, and can be found free of charge on his website: http://ola.hallengren.com/.
Statistics in a database are a sampling of the data; they’re used by the query optimizer to choose the most efficient execution plan. When data changes and statistics aren’t updated, you run the risk of the optimizer choosing a bad query plan. If a bad query plan is chosen, SQL Server will work harder than it needs to, which in turn artificially inflates the CPU usage. SQL Server typically creates statistics on its own just fine, and one of the settings you can adjust is Auto Update Statistics (which should be turned on in most cases). However, it’s a good idea to also schedule regular updates of statistics in a job, at least nightly for your busiest databases. When indexes are rebuilt, statistics are automatically updated, but that’s not the case with reorganization, so having it run separately will ensure they are kept up-to-date.
There are a couple of configuration settings that are worth mentioning, depending on the actual problem. Is the largest wait type CPU-related? I regularly encounter a problem with CXPACKET waits, which points to an issue with parallelism.
Nowadays, it’s not uncommon to see dual, quad or hex core CPUs in servers, and if hyper-threading is turned on, that’s upwards of 20 CPUs! On higher-end servers, that number triples. So with this new technology, SQL Server may attempt to run queries in parallel, using multiple threads matching the number of processors to attempt to speed up execution. This incurs some overhead as the threads have to be coordinated, but it often works well, especially with more elaborate queries that return a lot of data.
Where you run into trouble is with smaller queries – they actually run faster on a single thread, without the overhead and management that comes with parallelism. If you’re seeing a high percentage of CXPACKET waits on your system, you may be encountering this problem. There are a couple of different approaches to solve it. The first is to adjust the queries you think may be causing the WAIT, by specifying a MAXDOP query hint. The second is to adjust the max degree of parallelism and cost threshold for parallelism settings at the server level. If your system is more OLTP, then it makes sense to set the max degree of parallelism to a lower number greater than zero (the default is zero, meaning SQL Server will dynamically choose). If your environment is more mixed OLTP/OLAP, then you want to be careful with this setting as you may improve query execution for some while making others worse. Here is where the cost threshold for parallelism may help – you can set this number higher, indicating that the number of estimated seconds should be greater than or equal to this setting before SQL will use parallelism. These settings aren’t one size fits all; it’s more of a discovery process, testing in your own environment to see what works.
Take two of these, and call me in the morning
Bad code is easily the most common root cause of performance problems – the cancer of the database world. And I’m not just saying that because developers and DBAs aren’t known for playing nice together. I have a list of syntax I look for when I come across a problem query: OR, NOT, LIKE, and the presence of seventeen joins (I’m only exaggerating a little). It’s very tempting to code stored procedures a certain way because that’s the way it’s always been done, regardless of how inefficient the method is. I have to confess, often the first thing that pops into my head when coding is a cursor, only because I’ve written them so many times in the past. Luckily, this is easily fixable, in most cases. The prescription is to simply stick to set-based results when possible. Whenever your code has to process one record at a time, it’s going to take longer than evaluating a set of data all at once.
This is also the case sometimes with subqueries, particularly correlated subqueries, in which the inner query executes for every record in the outer query. Often these can be rewritten as joins for faster performance.
And then there are temp tables. While they’re not the root of all evil, they do have to be created, populated, and eventually dropped, plus they exist in a totally separate database. I’m not saying don’t ever use them; they definitely have their place in good code, and it is cool that you can index them when working with larger datasets. I just get concerned about temp tables when I see a lot of contention for the tempdb database. There are other ways to work with temporary data if you’re already overloaded with temp tables – for example, CTEs or table variables might prove a faster alternative, particularly if the data set returned is smaller and doesn’t need an index.
Occam’s Razor – Sometimes the simplest explanation works
There’s an old joke: a man walks into a doctor’s office complaining that it hurts to move his arm a certain way, to which the doctor responds, “Don’t move your arm that way.” While that advice may not be overly useful for DBAs, the truth of the statement remains: Every once in a while, the simplest solution is the right one.
For high I/O issues, I always look at file placement first. Are the data, log, tempdb and backup files on separate arrays? If not, how are they combined and what are the symptoms? I’ve come across situations where the highest wait types were log writes, while at the same time, the stored procedures used temp tables heavily. Looking at the file placement, I discovered the tempdb files and log files were stored on the same array. An analysis of the most heavily written log and data files should confirm the contention. Where do I see I/O contention? I’ll look at the largest wait types, run a couple of DMO scripts, and look in the error log. There are several I/O-related wait types, cleverly named with the letters “IO” somewhere in the title. Be careful with these though, it’s easy to jump to the conclusion that it’s a disk problem, when it usually isn’t. More often, the cause of high I/O goes back to inefficient code and index issues, as described earlier.
Querying the dm_io_virtual_file_stats function will help you see I/O contention at the file level, illustrating which databases are good candidates to have their own separate path. You may also see I/O error messages in the error log: SQL Server has encountered x occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [filename] in database [xxx] (5). Pay close attention to the database listed, and also look at what jobs were running at the time. For example, it’s typical to see a message like this if you’re running a rebuild index command and some other resource-intensive command simultaneously. If the database listed is tempdb, you might check to see if it’s running with the optimal number of files. The general guideline is anywhere from ¼ – 1 identically sized files per CPU, but as with everything else, it’s best to test in your own environment and find your happy medium.
Bring on the Morphine
Back in the day when I took the Windows certification exams, I knew if I ever saw “Add more RAM” as one of the answers, it was the right one. RAM is like the morphine of the SQL Server world, and has been treated almost as if it’s a panacea for all kinds of performance problems. I can’t count the number of times I’ve been told a server needs more memory; and why not? It’s cheap, integral to database performance, and the general consensus is the more, the merrier. However, a careful analysis should be done before the decision to buy more RAM is made or it can end up like an unnecessary medical procedure – a lot of money is spent and nothing is fixed.
A lot of the underlying issues of high CPU and high I/O can manifest themselves as memory issues… did I mention bad code? With memory pressure, it’s a different kind of bad. Poorly written code that returns more rows than needed can consume large amounts of memory, and for no good reason.
There are settings you can tweak in SQL Server to optimize memory. An important one that should always be set is max server memory, particularly if you’re running CLR, linked servers, or other processes external to SQL Server’s memory. A setting new to SQL Server 2008 called ‘optimize for adhoc workloads’ stores compiled stubs of plans initially, so that infrequently used plans take up less space in the cache. In addition to settings, there are some DMO queries you can run to see memory pressure, as well as perfmon counters. I look at page reads and writes per second, as well as page life expectancy. The trick is to not panic with perfmon numbers – without a baseline, it’s difficult to tell what’s normal unless you see something totally off the scale. Not a big fan of the perfmon interface? You can alternatively query this information from within SSMS via the dm_os_performance_counters DMV.
My bag of tricks
While doctors have their stethoscopes and lighted ear-checker gadgets, DBAs have scripts and really cool applications. Here are my top 5 favorite tools I can’t live without, all free and absolutely wonderful:
- Adam Machanic’s sp_whoisactive – really great for servers with thousands of processes
- Glenn Berry’s troubleshooting DMV scripts – where to find the DMO queries mentioned earlier
- SQL Live Monitor – cool free dashboard from Glen Small to see everything that’s going on at that moment
- M8 Free Clipboard – for quick access to code snippets I use daily
- Performance Dashboard Report (2005) (2012) – lots of great info, when I can use it (compatibility level can’t be set too low or it won’t work)
Good doctors look at the whole picture, a practice often overlooked by DBAs. It’s easy to just point to the indexes or code and end the diagnosis there, but a good DBA will go further and look at the whole picture. SQL Server is complex; all the working pieces have to function together in harmony. One poorly configured setting can affect one component, which in turn affects others and causes an overall meltdown of the system. While it’s important to listen to the symptoms reported (even if they are lies) and investigate those particular areas, you must then look at everything else to see what’s really going on. Using a more holistic approach will undoubtedly result in a healthier, more finely tuned system.
So there you have it – all the basic database troubleshooting tips and medical references you can stand. Enjoy and happy databasing!