Click here to monitor SSC

Crime Scene Investigation: SQL Server

Published 7 November 2013 3:29 pm

“The packages are running slower in Prod than they are in Dev”

My week began with this simple declaration from one of our lead BI developers, quickly followed by an emailed spreadsheet demonstrating that, over 5 executions, an extensive ETL process was running average 630 seconds faster on Dev than on Prod. The situation needed some scientific investigation to determine why the same code, the same data, the same schema would yield consistently slower results on a more powerful server. Prod had yet to be officially christened with a “Go Live” date so I had the time, and having recently been binge watching CSI: New York, I also had the inclination.

An inspection of the two systems, Prod and Dev, revealed the first surprise: although Prod was indeed a “bigger” system, with double the amount of RAM of Dev, the latter actually had twice as many processor cores. On neither system did I see much sign of resources being heavily taxed, while the ETL process was running. Without any real supporting evidence, I jumped to a conclusion that my years of performance tuning should have helped me avoid, and that was that the hardware differences explained the better performance on Dev.

We spent time setting up a Test system, similarly scoped to Prod except with 4 times the cores, and ported everything across. The results of our careful benchmarks left us truly bemused; the ETL process on the new server was slower than on both other systems. We burned more time tweaking server configurations, monitoring IO and network latency, several times believing we’d uncovered the smoking gun, until the results of subsequent test runs pitched us back into confusion.

Finally, I decided, enough was enough. Hadn’t I learned very early in my DBA career that almost all bottlenecks were caused by code and database design, not hardware? It was time to get back to basics. With over 100 SSIS packages and hundreds of queries, each handling specific tasks such as file loads, bulk inserts, transforms, logging, and so on, the task seemed formidable. And yet, after barely an hour spent with Profiler, Extended Events, and wait statistics DMVs, I had a lead in the shape of a query that joined three tables, containing millions of rows, returned 3279 results, but performed 239K logical reads. As soon as I looked at the execution plans for the query in Dev and Test I saw the culprit, an implicit conversion warning on a join predicate field that was numeric in one table and a varchar(50) in another! I turned this information over to the BI developers who quickly resolved the data type mismatches and found and fixed “several” others as well. After the schema changes the same query with the same databases ran in under 1 second on all systems and reduced the logical reads down to fewer than 300.

The analysis also revealed that on Dev, the ETL task was pulling data across a LAN, whereas Prod and Test were connected across slower WAN, in large part explaining why the same process ran slower on the latter two systems. Loading the data locally on Prod delivered a further 20% gain in performance.

As we progress through our DBA careers we learn valuable lessons. Sometimes, with a project deadline looming and pressure mounting, we choose to forget them. I was close to giving into the temptation to throw more hardware at the problem. I’m pleased at least that I resisted, though I still kick myself for not looking at the code on day one. It can seem a daunting prospect to return to the fundamentals of the code so close to roll out, but with the right tools, and surprisingly little time, you can collect the evidence that reveals the true problem. It is a lesson I trust I will remember for my next 20 years as a DBA, if I’m ever again tempted to bypass the evidence.

14 Responses to “Crime Scene Investigation: SQL Server”

  1. brianh123 says:

    Personally, the first thing I would have done is put a copy of the Dev database on Prod (under a different name) and tested performance to confirm the claim that the schema and data were the same. I never trust what people tell me. :-)

  2. Keith Rowley says:

    Maybe you will do better than the rest of us, but humans have a long history of ignoring our history and experience where we should be learning from it.

  3. smilee282000 says:

    Great article! It is always very tempting to throw in more hardware but embarrassing when it doesn’t work out the way you want after spending the money.

  4. Ron Dameron says:

    While I greatly appreciate all the bloggers who write about a new features, my favorite posts are about people’s real world experiences. Thank you.

  5. Natarshia says:

    Hi Rodney,
    Great post. They way you presented it was a great read, and it gave me some things to think about when I have to troubleshoot slow processes. It also reminded me of a time where we had a case of the same query running faster on a dev machine that was slower and had more data than on the production server. Well at the time I was busy and my co-worker came to my office, so I just verbally walked him through some stuff. Finally I just told him to compare the actual query plans from both servers. Turns out because there was more data on the dev server, the optimizer actually had two different plans and the plan for the server with the most data was more efficient. Who would have thought the same query running more data would be faster. The lesson I learned is similar to yours, always start with the basics, checking the query plan and looking at the code.

  6. Natarshia says:

    *by slower i mean , the dev machine was a slower machine in general regarding ram and cores.

  7. tparikh says:

    I had excat situation few years ago. The stored procedure was taking 25-30 minutes to run to process few thousands of row. It was doing aggregate functions like ranking, sum etc. The type was a culprit. It was number in 1 place and varchar in other. Once type was fixed, it ran in less than 5 mintues. I looked like a champ after figuring out and correcting the code.

  8. paschott says:

    We had something similar happen with some SSIS packages and had just purchased Pragmatic Works’ BIxPress. I had added their auditing to all of the packages so I could see what was happening. This made it really easy to look at the last run and realize exactly what was off. Admittedly, there was a lot that was off, but I could pinpoint some of the larger problems and make a lot of headway very quickly.

    It was definitely worth checking the code in this case so we could fix the items that were wrong, affecting too many or too few rows, or just not performing as well as they should have.

  9. pklages says:

    I feel now with virtualization (not sure of your case) that throwing hardware at an issue is often the first reaction and easiest, but usually doesn’t solve anything if the project was originally specked out properly.

    • paschott says:

      Well, throwing more hardware at a problem may mask some of the issues, but it’s definitely not a total solution. We’ve seen some of our reports benefit greatly from introducing an SSD into the environment, but even with that we still have some performance issues from time to time that require us to refactor code. In the short term, the hardware is often cheaper, but long-term there are other considerations and the technical debt needs to be paid.

  10. willliebago says:

    Yes! It is usually the code & the devs :)

  11. BThompson says:

    Ahh so very close to something that was happening sporadically for me. Where the CPU would spike at 100%, then disk IO would go through the roof. Very quickly making the production system unusable. The end result to fix this was throwing lots of cores at it and inputting and index that had previously been removed. Has been stable ever since!

  12. Andrew Watson says:

    Reminds me a bit of when I was trying to tune a particular query for a client… I optimised it on my Dev machine, and was confident that it was faster. Put it into production, and it killed the system. Same data base, different hardware, different results.

    A little investigation showed that on production the old plan used a parallel query plan. On my dev system, parallel wasn’t available, so yes I had found a more efficient serial query, but bad in comparison to the parallel one. (Back to the old query)

    It is the simple things…

  13. DeafProgrammer says:

    Very simple. You cannot ‘mix’ oranges and apples together :)

Leave a Reply