11 September 2012

Why is that SQL Server Instance under stress?

There are several reliable indications, using SQL Queries, of the what is causing SQL Server performance problems. Some of these are fairly obvious, but others aren't. Grant shows how you can get clues from any SQL Server as to the cause of stress.

When you hit performance problems with a SQL Server Instance, a few measures will tell you enough about their general nature that you can then focus in quickly on the actual cause. There are many different metrics that you can use to understand what’s up with your SQL Server instance. Each of these measures will give you a positive indication. It could, perhaps, be internal or external memory pressure, undue or uneven CPU loading, or IO bottlenecks. If you get a positive indication of where to look for the problem, you can then drill in to the detail. If, for example, you determine that you have excessive CPU stress in SQL Server the next step is to determine which query is causing the most CPU.

So what are these general indications? It is useful to get straight-forward information about the total processor time, a percentage measure of how much of the CPUs you’re using on your machine, or average read time for your disks which tells you how long it takes to read information from them. When you’ve done these easy ones, however, there are other methods of finding out what’s happening on your system that aren’t as easy, but which yield even more useful information. Let’s explore a few of these in more detail.

Is your instance experiencing memory stress?

Sys.dm_os_ring_buffers

The ring buffers within the operating system are simply a collection point for certain types of system messages. Largely these can be obscure system messages, many of them related to the management of the ring buffers themselves, but a few of the buffer messages are extremely interesting. For example, the operating system is aware of when it is running low on memory. When this happens, an entry is made into the ring buffers You can query to find out if you have had a memory alert, using the dynamic management object, sys.dm_os_ring_buffers.

There are actually two kinds of memory alerts. You can get a memory alert for the physical memory of the machine you’re running on. From a SQL Server point of view, this is referred to as external memory, since it’s not the memory managed by the SQL Server service. You also get alerts for the virtual memory; that memory that is being managed by the SQL Server service. We usually refer to this as internal memory. When either is running low, you will see an alert posted into the ring buffers. You’ll also see alerts when you have enough memory, or if you have a large increase in memory.

You can simply query the DMO like this:

That will return the information available, but you will soon find that the majority of the interesting information returned is in the “record” column. This is a text column that stores XML. To really pull together interesting information you’ll want to go for a more thorough query along these lines:

With this query I’ve first created a common table expression (CTE) called RingBuffer. In that I only do two things. First filter for a particular ring buffer type, ‘RING_BUFFER_RESOURCE_MONITOR’. These are where the memory messages are going to occur. Second, I CAST the ‘Record’ column from text to XML. From there I use the CTE within the query and use the XQuery commands to pull out all the interesting information from the XML data.

As an extra complication, the timestamp column in sys.dm_os_ring_buffers is actually a datetime value, but it’s based on cpu ticks so you’ll have to use that formula to convert that data to a human readable date and time.

To use sys.dm_os_buffers as part of a monitoring process, you just have to look for the two events, RESOURCE_MEMPHYSICAL_LOW or RESOURCE_MEMVIRTUAL_LOW. These are available in the ResourceMonitor/Notification property within the XML. They are an absolute indicator of a low memory condition on your machine, so if you get the alert, you were low on either external/OS/physical memory, or internal/SQL Server/virtual memory.

Is the system under load?

One question that constantly comes up, ‘is the system under load or not?’

There are a number of different ways to try to understand this, but only a few that will let you know with certainty whether you’re under load.

Sys.dm_os_workers

One of my favorite ways of determining precisely how much work is going on within the system is to look at sys.dm_os_workers. This measure won’t tell you what it is that is placing load on the system; nor will it allow you to understand what the implications of the load are. However, it’s a perfect measure of load on the system.

There is a lot of information returned from sys.dm_os_workers. The DMO is returning information about worker processes within the OS. You can view information about the process such as the last wait type, whether or not the worker has an exception, how many context switches it’s experienced; all sorts of stuff. The Books Online entry for the DMO even shows you how to determine how long a process has been in a runnable state.

But to use this as a measure for load, your query is extremely easy:

This really is as simple as that. As this number goes up or down, the load on your system is going up or down. Where there are lots of ‘RUNNING’ worker processes today that weren’t there yesterday, you have an increased load on the system. But bear in mind, you need to be able to compare this over time. Simply capturing a number won’t tell you anything. You have to be able to compare between two values.

Sys.dm_os_schedulers

Another way to measure system load is to take a look at the schedulers. These are the processes that manage the worker processes. Again, this is an absolute measure of load on your server. It can tell you how much work is being done within the system.

Querying the schedulers yields a lot of interesting information about how the system is being managed. You can see how many workers are being handled by any particular scheduler. You can see how many times that scheduler has yielded the CPU (gave up access to another process, because each process only gets limited access to CPUs), a count of the workers that are currently active within the scheduler and several other details.

But to see a measure of load you can run a very simple query:

Once more, this number is only meaningful if you can compare it to previous values. Using either the workers or the schedulers as a measure of system load is only as accurate as your baseline values, but if you maintain a collection of these values over time, you’ll be able to determine load on the operating system.

Has SQL Server got enough memory?

DBCC MEMORYSTATUS

This is a simply amazing collection of data. What you get is an output of all the various pieces of memory management within SQL Server. You’ll be able to see where every bit of memory has been allocated and managed within SQL Server. The command is frequently used when you’re dealing with customer support from Microsoft to troubleshoot specific issues. But, it’s another way of absolutely determining how well you’re doing with memory on your system.

If you simply run the command:

You will see all the various memory allocation and management processes within SQL Server. All of them. In fact, there’s so much information there that it quickly becomes pointless to try to make sense of it. The good news is, it’s possible to target a few particular pieces of information. If we specifically go after the Target Committed value and the Current Committed value we can make a determination if SQL Server has enough memory. It’s simple. If the Target value is higher than the Current value, you don’t have the memory you need within SQL Server. But retrieving these values is a little bit of a pain. Here’s one way to do it:

I’m creating a table variable and then loading all the output from MEMORYSTATUS using the TABLERESULTS to ensure that the output is a table. By using the Common Table Expression (CTE) to define the information selected from the table variable I can reference it twice with the SELECT statement and just JOIN the two values based on the ROW_NUMBER. It really works. If you get a negative value, you’re looking at memory issues.

Understand, DBCC MEMORYSTATUS is meant as a support mechanism for Microsoft. It’s not a part of the standard tool set. This means that it absolutely is subject to unpublished changes from one version of SQL Server to the next or even one service pack to the next. Use it to diagnose your memory issues with this understanding firmly in place.

Do I need a better, faster disk system?

Sys.dm_io_virtual_file_stats

This dynamic management object returns statistics about the behaviors of the files on your databases. The most interesting pieces of information here are the stalls, or waits, gathered and available. If you simply run the query:

You must pass it two pieces of information, the database id, which I’m getting by using DB_NAME to identify the database that I’m currently attached to and then passing that to DB_ID, and the file id. You can pass the value NULL, as I did, which will return all the files for that database.

The information returned is excellent, especially four columns; sample_ms, io_stall_read_ms, io_stall_write_ms, io_stall. Let’s take a look at what these represent and you’ll quickly understand how interesting they are to you as a DBA. The sample_ms is very straight forward. It’s the time since SQL Server was restarted. It provides you with the measuring information for understanding all the other values. Next is io_stall_read_ms. This represents the amount of time that processes were forced to wait on reads from this device. If you combined is_stall_read_ms and sample_ms, you get a precise measure of the percentage of time that your apps are waiting to read from a particular file on a particular database. You also get io_stall_write_ms, which represents the amount of time that processes have had to wait for writes. You can gather this metric over time to see how it grows, or use sample_ms in the same way you did with the reads. Finally, io_stall shows the amount of time that watis occurred on that file for any io operation. Again, you can gather it over time to see how it’s growing (because it will always only ever grow) or you can get a percentage of time spent waiting on disks by comparing to sample_ms.

These measures will tell you exactly how severe the issues are regarding io waits on your system. But they won’t pin point specific queries. Instead, this measure is primarily focused on determining if there are issues with your system. Do you need more disks, faster disks, etc.

How well is the CPU performing?

Sys.dm_os_wait_stats

I have this listed as an obscure method for gathering performance metrics, but it really shouldn’t be. By now everyone should have heard that understanding what the server is waiting on is a great way to understand what is causing the server to run slow. But I still see lots of people being shocked that they can find this information.

Sys.dm_os_wait_stats shows you an aggregated view of what the server has been waiting on since the last time it was started (or since the wait stats were cleared). This information is broken down by particular wait types, some of which are obscure indeed. I won’t attempt to document them here, even Microsoft doesn’t supply complete documentation on these. You will need to rely on internet searches for identifying what some wait types represent. Others are documented in the books online, so please use that excellent resource.

To query sys.dm_os_wait_stats you run a query like this:

The output is only five columns; wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms and signal_wait_time_ms. The only one of these that is not immediately understandable by it’s name is the signal_wait_time_ms. This column represents the amount of time from when the thread was called and it actually started to execute. This time is included in the total time, wait_time_ms. The signal_wait_time_ms then is actually a measure of the wait time to get at the CPU. This makes it an excellent measure of how much load your CPU is carrying. Because of this, while you should absolutely look at wait statistics in general, you should always focus in on the signal_wait_time_ms separately in order to understand how well your CPU is performing. You can get quite sophisticated with this, or you can simply focus in like the following:

This will represent a cumulative total of all the waits on CPU that have occurred on the system. It’s an excellent indicator of load over time. You will need to compare one measure to another to see how it’s growing.

Conclusion

These are just a few examples of the general areas of ‘stress’ that you can check so that you can then quickly focus in on particular aspects of the system so as to understand what’s going on. With these measures you can quickly confirm or eliminate possible causes of performance problems.

Each of these measures provides a sufficiently positive indication to give you confidence that you have memory pressure or your CPU is under load. Once you understand the general nature of the stress, you’ll need to go to other, more standard metrics, in order to understand specifics such as which query is causing the most CPU.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 28761 times – thanks for reading.

Tags: , , , , , ,

  • Rate
    [Total: 67    Average: 4.4/5]
  • Share

Grant Fritchey is a SQL Server MVP with over 20 years’ experience in IT including time spent in support and development. Grant has worked with SQL Server since version 6.0 back in 1995. He has developed in VB, VB.Net, C# and Java. Grant has authored books for Apress and Simple-Talk, and joined Red Gate as a Product Evangelist in January 2011. Find Grant on Twitter @GFritchey or on his blog.

View all articles by Grant Fritchey

  • Bill Galashan

    A Cautionary Note
    Check out http://support.microsoft.com/kb/2587929 and the patch level of your sql server before running xquery on a production server. If not at least up to SQL 2008 R2 SP1 CU4 then this can cause stack dumps

    We do pretty much all of the details contained in this post with a few variations such as extracting the raw data to a non important server before querying it to get to all the information needed.

  • Anonymous

    MEMORYSTATUS missing current and virtual committed values
    I ran the query to retrieve the MEMORYSTATUS and am not finding any records with a value of ‘Target Committed’ or ‘Current Committed’. My SQL Server is running in a Virtual environment so is that the cause? If so are there other values I should be looking for?

  • Anonymous

    MEMORYSTATUS missing current and virtual committed values
    I ran the query to retrieve the MEMORYSTATUS and am not finding any records with a value of ‘Target Committed’ or ‘Current Committed’. My SQL Server is running in a Virtual environment so is that the cause? If so are there other values I should be looking for?

  • Phil Factor

    Wait stats
    How do wait stats tell you the I/O stress?

  • Grant Fritchey

    RE: MEMORYSTATUS
    It should be there. Remember that there are a large number of different result sets that come from DBCC MEMORYSTATUS. You’re looking for a very particular set.

  • Paulo A. Nascimento

    Alternatives
    Interesting article, queries and DMVs are a good comparison with the more "graphic" use of OS/SQL performance counters.

  • Andy Krafft

    Memory stress with lots of avail RAM?
    Grant, a great article but I’m confused. I’ve got a 32-bit Enterprise with 16GB using AWE. I seem to have lots of free memory but am getting RESOURCE_MEMPHYSICAL_LOW still. Why is this?

    Notification_Time Notification_type MemoryUtilization % Node Id Process_Indicator System_Indicator SQL_ReservedMemory_MB SQL_CommittedMemory_MB SQL_AWEMemory_MB TotalPhysicalMemory_MB AvailablePhysicalMemory_MB TotalPageFile_MB AvailablePageFile_MB TotalVirtualAddressSpace_MB AvailableVirtualAddressSpace_MB SinglePagesMemory MultiplePagesMemory Record Id
    ———————– —————————— ——————– ——————– —————– —————- ——————— ———————- ——————– ———————- ————————– ——————– ——————– ————————— ——————————- ——————– ——————– ——————–
    2012-09-24 10:52:40.577 RESOURCE_MEMPHYSICAL_LOW 100 0 2 0 2114 550 928 16894 15624 16645 15791 2559 834 23792 14944 1
    2012-09-24 10:59:08.360 RESOURCE_MEMPHYSICAL_LOW 100 0 2 0 2152 568 1856 16894 14659 16645 14832 2559 745 38184 15920 4

    bpool_committed_mb bpool_target_mb bpool_visible_mb
    ——————– ——————– ——————–
    3781 10944 1440

    phys_mem_mb user_virtual_address_space_size
    ——————– ——————————-
    16382 2047

    phys_mem_mb avail_phys_mem_mb sys_cache_mb kernel_pool_mb total_virtual_memory_mb available_virtual_memory_mb system_memory_state_desc
    ——————– ——————– ——————– ——————– ———————– ————————— —————————————————————————————————————————————————————————————————————————————————————-
    16382 11662 1428 93 16133 11830 Available physical memory is high

  • Andy Krafft

    Memory stress with lots of avail RAM?
    Grant, a great article but I’m confused. I’ve got a 32-bit Enterprise with 16GB using AWE. I seem to have lots of free memory but am getting RESOURCE_MEMPHYSICAL_LOW still. Why is this?

    Notification_Time Notification_type MemoryUtilization % Node Id Process_Indicator System_Indicator SQL_ReservedMemory_MB SQL_CommittedMemory_MB SQL_AWEMemory_MB TotalPhysicalMemory_MB AvailablePhysicalMemory_MB TotalPageFile_MB AvailablePageFile_MB TotalVirtualAddressSpace_MB AvailableVirtualAddressSpace_MB SinglePagesMemory MultiplePagesMemory Record Id
    ———————– —————————— ——————– ——————– —————– —————- ——————— ———————- ——————– ———————- ————————– ——————– ——————– ————————— ——————————- ——————– ——————– ——————–
    2012-09-24 10:52:40.577 RESOURCE_MEMPHYSICAL_LOW 100 0 2 0 2114 550 928 16894 15624 16645 15791 2559 834 23792 14944 1
    2012-09-24 10:59:08.360 RESOURCE_MEMPHYSICAL_LOW 100 0 2 0 2152 568 1856 16894 14659 16645 14832 2559 745 38184 15920 4

    bpool_committed_mb bpool_target_mb bpool_visible_mb
    ——————– ——————– ——————–
    3781 10944 1440

    phys_mem_mb user_virtual_address_space_size
    ——————– ——————————-
    16382 2047

    phys_mem_mb avail_phys_mem_mb sys_cache_mb kernel_pool_mb total_virtual_memory_mb available_virtual_memory_mb system_memory_state_desc
    ——————– ——————– ——————– ——————– ———————– ————————— —————————————————————————————————————————————————————————————————————————————————————-
    16382 11662 1428 93 16133 11830 Available physical memory is high

  • Anonymous

    Why is SQL Server stressed?
    could it be because they just install Oracle on a different server?

  • Anonymous

    Why is SQL Server stressed?
    🙂

  • Anonymous

    DBCC MEMORYSTATUS() WITH TABLERESULTS
    Is your CTE/query version specific? If I run your CTE/Query, I also do not get anything returned. If I look specifically for ValueName like ‘%Committed%’ I get several VM Committed (110+) and one Committed….

    I am running against SQL 2005 on a physical machine

    Thanks

  • Anonymous

    DBCC MEMORYSTATUS() WITH TABLERESULTS
    Is your CTE/query version specific? If I run your CTE/Query, I also do not get anything returned. If I look specifically for ValueName like ‘%Committed%’ I get several VM Committed (110+) and one Committed….

    I am running against SQL 2005 on a physical machine

    Thanks

  • Jeff Moden

    Cool…
    Nicely done, Grant.

  • Grant Fritchey

    re: Memory Stress
    Hard to know why for sure based on what you’ve got here. Are there other things running on the server? Does it have virus checking or anything like that? You’d need to investigate quite a lot more.

  • Grant Fritchey

    RE: Memory Status on 2005
    I have not tested this query on a 2005 server. The results might be different. I’m honestly unsure.

  • Grant Fritchey

    Jeff
    Thanks!

  • Keith Mac Lure

    Please expound
    HI Grant, Please could you expund a bit on the statement: "If you combined is_stall_read_ms and sample_ms, you get a precise measure of the percentage of time that your apps are waiting to read from a particular file on a particular database."

    What do you mean by ‘combine’ these two values? add them together? or do you devide the is_stall_read_ms by sample_ms to get a ‘percentage of time’?

  • Andy Krafft

    Memory stress with lots of avail RAM
    Hi Grant, no there is nothing else running: it is a dedicated server with no AV, etc. Very weird I agree and I just wanted to get confirmation from an expert that I was not misreading the output. I’ve looked at everything I can and think I’ll just have to rebuild it (on x64 probably). Thanks for the help. Andy

  • Keith Mac Lure

    Please expound
    HI Grant, Please could you expund a bit on the statement: "If you combined is_stall_read_ms and sample_ms, you get a precise measure of the percentage of time that your apps are waiting to read from a particular file on a particular database."

    What do you mean by ‘combine’ these two values? add them together? or do you devide the is_stall_read_ms by sample_ms to get a ‘percentage of time’?

  • Keith Mac Lure

    sorry for double post
    Form refresh error…

  • Anonymous

    RE: Memory Status on 2005
    Hi Grant — Thank you for the reply.

    Out of curiosity, which version did you write against?

    Cheers

  • Grant Fritchey

    re: io_stall_read_ms
    Sorry, yes, I didn’t mean ‘combine’ as in add, I just meant ‘use both numbers’. That wasn’t the clearest way to state that. Apologies.

    The sample time is the amount of time that has been sampled and the stall time is the wait time within the sample. You can divide one from the other to get a percentage.

    Again, sorry that wasn’t clear.

  • Grant Fritchey

    Version
    I’ve done all this work on 2012 and some on 2008R2.

  • Andy Krafft

    RE: Version / DBCC MEMORYSTATUS()
    Hi Grant

    I’ve too have had problems with your MEMORYSTATUS code on 2008R2 and found this article which seems to show the changes made in 2012.

    http://blogs.msdn.com/b/sqlosteam/archive/2012/07/11/memory-manager-surface-area-changes-in-sql-server-2012.aspx

    Andy