Database administrators tend to rely more on science than intuition, and may therefore seem exasperatingly cautious to other IT people. A relational database is a complex machine, however, that is asked to deal with intricate problems. When things go wrong, it usually takes knowledge, deduction and a systematic approach to track down the root core of problems and to determine the best fix.
After working in the experimental sciences, I moved careers to working with SQL Server. For me, one of the greatest contrasts between the two methods of work was the more impulsive approach that many who worked in application development used when diagnosing and remedying performance problems in relational databases. Mind you, I’m not referring to the specific metrics collected or the tools used; rather, I'm speaking about the overall strategies. They seemed to me to be haphazard and disjointed. Some examples:
- It’s rare to see a standard set of metrics and data collected when the problem is occurring (and also collection of baseline statistics for comparison to troubled periods).
- The first explanation for the problem which comes to mind is often accepted as being correct without testing it.
- There’s little or no attempt to reproduce the issue outside of the live system, which leads to...
- No effort is made to ensure that the "fix" actually solves the problem in the best and most reliable way possible.
If we don’t work on these inevitable problems in a structured and balanced way, the most likely result is a lot of wasted effort, and therefore money. It becomes a "shoot from the hip" solution at best. While we may temporarily resolve the issue, unless we understand and can quantitatively prove that our fix will repeatedly work, the underlying problem may linger and rear its head at a later time.
Fortunately, there are some simple lessons we can take from the well-proven methods of scientific inquiry. Over the course of many years and countless experiments, the scientific method has, when used properly of course, given reliable, repeatable results. And if we’re not striving to achieve reliability and repeatability in our work of tuning SQL, then we’re likely to merely reduce the symptoms rather than to resolve the problem itself.
I’ll be describing a way of using the lessons learned from experimental science to suggest a general framework for investigating and resolving performance problems. While my examples will focus specifically on SQL Server, I think many of the points would apply equally to other aspects of technology.
I’ll state up front that this will not be a detailed discussion of what pieces of information to collect or what tools to use. If you want to learn more about those specifics, I would heartily recommend starting with the book SQL Server 2012 Performance Tuning, by Grant Fritchey, or viewing Jonathan Kehayias' excellent article A Performance Troubleshooting Methodology for SQL Server.
First, I want to briefly list a series of steps that are commonly referred to in discussions of the scientific method.
- Define a question. What problem are we trying to solve?
- Gather information and resources (observe).
- Form an explanatory hypothesis. What do we believe explains the behavior we are seeing?
- Test the hypothesis by performing an experiment and collecting data in a reproducible manner. The experiment should give us proof that either our hypothesis is correct, or it is not.
- Analyze the data.
- Interpret the data and draw conclusions that the data either supports our previous hypothesis or serves as a starting point for a new hypothesis.
- Publish results.
- Retest (frequently done by other scientists).
While some of the steps may be combined or slightly reworded in our discussion, these provide a good starting point. So, let’s begin!
Step One: Define a Problem
Too often when we’re alerted to performance issues the actual definition of the problem is far too vague and general. We hear cries of "The system is slow!", but further details are difficult to glean. Without a clear and agreed-upon problem statement, little else we do will be helpful.
What exactly constitutes "clear"? I think the best way to illustrate is by example. Here is a well written and specific statement of a problem: "When users attempt to run the TPS report for the Alameda Corp client, the report times out and is not displayed." The key pieces of information here are threefold: 1) it describes the specific behavior which we need to investigate ("the report times out and is not displayed"), 2) it contains the actual steps necessary to produce the problem, 3) it describes how we’ll determine if the problem is resolved (the report runs without timing out).
Step Two: Gather Data and Observations
Before we can start considering ways to correct the problem, we must collect data that will help point us towards the cause of the issue. Some useful examples might include things like the load of the server (CPU, memory, disk usage), examining query plans, and looking at how up-to-date statistics are.
Often when I was investigating problems, other members of the technical teams (though, to their credit, never another DBA) would clamor vigorously for action rather than careful analysis. Nothing would be easier then proceeding directly with the standard panaceas such as clearing the plan cache or rebuilding statistics, but in doing so we’re likely to be temporarily fixing the symptoms rather than the underlying problem. Without capturing the relevant data we make it difficult to truly determine a cause, and therefore an appropriate prescription for correction.
Step Three: Analyze Data and Form a Hypothesis
First, in case some of you out there are not closet nerds versed in scientific terminology, let’s define "hypothesis":
A proposition, or set of propositions, set forth as an explanation for the occurrence of some specified group of phenomena, either asserted merely as a provisional conjecture to guide investigation (working hypothesis) or accepted as highly probable in the light of established facts.
To make that a little clearer, we’re basically trying to state succinctly the cause of the problem at hand.
To take our example of the slow running procedure further, clearing the plan cache might well make the code run faster in the short term. However, in doing so, we’ve destroyed much of the data we need to analyze to find the true cause of the problem. We may also set ourselves up for repeated manual intervention (“Hey Bob, the query’s running slow, can you purge the cache again?”) or, even worse, demands to automate an overly aggressive solution (“We need you to setup a job to clear the plan cache every hour, stat!”).
Whatever we do to try to relieve the symptoms, we need to confirm our ideas of what has caused this by actually looking at our collected data. For example, we might note that we see vastly different numbers on the "Expected" versus "Actual" rowcounts from our plans, or see that while the code generally performs well, certain calls take an inordinate amount of resources. Both of these hints at the real cause: code that’s not written to properly handle vastly differing amounts of data for different sets of parameters.
Here again, it’s important to ensure that our language is clear. If we don't agree on the cause, it’ll be even harder to come to a consensus on the solution. In that vein, here’s an example of what I would consider a well-written hypothesis: "The report is timing out because the query plan is showing a nested loop join that it expects will only run 30 times, when in fact the prior joins result in 1,000,000 rows. This is caused by the query plan being generated with parameters that have a smaller set of data than those it is run with." Again, there are two key elements here: 1) We define the cause of the problem (a bad query plan), 2) we say why the cause is present (the plan was generated with different parameters which results in a lot smaller set of data).
Step Four: Test the Hypothesis and Solution by Performing an Experiment and Collecting Data in a Reproducible Manner
Of all the steps discussed, I think this may be the most important one of all. Too often I see causes for performance problems thrown out on a whim. When I inquire as to how they know the problem has truly been identified, the conversation usually goes something like this:
- Me: So, how do you know that that particular JOIN in the query is the cause of the problem?
- Them: Because we changed it and it went faster.
- Me: Ok, how do you know that the change you made actually made it faster? If you undo your change, is it back to being slow? Did you look at the query statistics to see if the table used in the JOIN had a lot of logical IO against it?
- Them: Well, what else could it be? We didn't try rolling back the change, since it worked so well. Why waste time doing that?
- Me: OK, good luck with that. I'm opening a boutique...
Take our running example of the slow report: to confirm we’re correct, and we’re looking at a problem of bad ‘parameter-sniffing’, we’d try running the procedure first with a parameter set that results in a small amount of data (thereby caching the plan best for that size), then running it for the large one and confirming we observe poor performance. We’d also run the procedure with the “large” parameter set first, confirming that it does in fact perform adequately when a proper plan is compiled. Finally, we’d compare the optimal plans for the “large” and “small” parameter sets to see the differences. If our hypothesis is correct, we’d see significant changes that allow the “large” plan to better handle the size of the result set (perhaps a different join order or join type – nested loops for example generally don’t do well with large input sizes).
Having confirmed our hypothesis about ‘parameter sniffing’ we’d also test our proposed solutions. This includes testing (a) the relative efficacy of the possible solutions, (b) that the solution doesn’t resolve the problem by mere coincidence.
In the first case, we make sure we use the most efficient and robust solution possible. To continue with our example, people may jump immediately to using a procedure wide WITH RECOMPILE hint to resolve the problem. And while it certainly may, we might find we can get the same result with less overhead using a simpler statement level OPTION(RECOMPILE) hint. Or, we may even find that the use of local variables (to force the optimizer into a “best guess” mode) resolves the problem perfectly well without any recompilation overhead at all. Without testing all three and comparing relevant metrics, we’ll never know.
In the second case, we verify that the change we’re making resolves the problem in and of itself, rather than by coincidental side effect. Sticking with our ongoing example, perhaps changing the query to use a scalar subquery rather than a JOIN appears to resolve the problem. But was it that particular change that made the difference, or was it the mere act of altering the stored procedure (thereby causing a new plan to be generated) which made it work (this actually happened to me once, and it was decidedly the latter)? Unless we revert our change and reproduce the problem, we can never truly be sure if what we did actually solved the issue. As all statistics professors are fond of saying, "Correlation does not imply causation" (meaning that just because two things happened together doesn’t mean one caused the other).
Throughout this ongoing testing, we need to be sure that each iteration starts from the same state. For example, before we do any kind of performance test, we always clear the plan cache and buffer cache to ensure that factors like reading data from disks and compilation are always part of the equation (please, please don’t take this as advice to do this in a live system, you’re on your own if you do that). We also want the state of the data to be the same, meaning if we move data from one table to another during the test, it must be moved back before we re-run after the change. Finally, we should endeavor to keep our test system as segregated as possible from other influences, such as other users running code against different databases or other virtual machines hogging the host's bandwidth. In today's world of shared SANs, virtual machines, and shared systems, this can be very difficult, but it’s worthy of effort regardless. You can also mitigate this problem using measures that are more likely repeatable, such as logical reads or CPU ticks, versus pure execution time (i.e. use STATISTICS IO and STATISTICS TIME over simply seeing how long something takes to execute). And for the love of Pete (whoever Pete is), please make sure you test with representative data. Having a handful of rows in a table that will, in the live system, be filled with millions every day isn’t a good way to ensure your test results will apply to the actual behavior of the system in-use.
For brevity's sake, I’m going to lump the original steps five through eight into...
Step Five: Analyze Your Test Results and Decide If the Solution Works
Once we’ve collected all our test results, we’ve enough information to decide whether our original hypothesis is true and if the solution is the appropriate one. Ideally, this shouldn’t be an individual decision, but one made by a group of professionals who have all reviewed the data at hand. At best the answer will be clear and easily agreed upon, but it never hurts to have other eyes (especially ones with different perspectives) look at the situation.
What if our testing failed and our proposed solution has to be rejected? At that point, we’d go back to step three and attempt to determine a new explanation and solution for the behavior we’re seeing. We might also collect new or different data to give more information on which to base our idea of the problem.
If you’ve made it this far, I compliment you on your patience! This may seem like a great deal of effort and time, especially when we have business breathing down our necks for a quick fix. However, I’ve found that it’s far more effective in the long run to take this measured approach. Without a careful, considered process for finding and fixing performance-related issues, we almost invariably set ourselves up for more crisis work in the future.
Now, lest I be accused of sacrificing critical up-time for scientific purity, I’ll say it clearly: there are times when you may need to break these rules in order to get the business back on line. To go back to our example, if you’ve a critical piece of code that is running slow, and you’ve identified that it likely has a bad plan, your first step should not be to insist that we reproduce the problem in a test system and provide a full written analysis. Instead, make sure you’ve got the essential metrics and then purge the plan to see if it gets a better one. Then, when the fire has been sufficiently put out, work on getting all the pieces in place to dig into the problem further. After all, a doctor doesn’t withhold pain-relief until being certain of the underlying cause (but doubtless is annoyed if the patient merely blurts out “Thanks!” and disappears until the next crisis).
Is there a chance that this “fix it first” approach leads to a lost opportunity to properly correct something, since the issue is no longer burning someone's skin and therefore isn’t a priority? Certainly, that could happen. But ultimately we do need to be aware of the priority of getting a system functional. Even if you only follow steps one through three, you’re still doing better than most.
But when at all possible, it’s wise to follow a disciplined and consistent methodology for troubleshooting performance issues. So the next time you find yourself in firefighting mode, do you (and your clients) a favor, and instead of leaping into action, take the time to channel your inner Beaker and put the problem away for good.