Click here to monitor SSC
  • Av rating:
  • Total votes: 65
  • Total comments: 10
Grant Fritchey

Statistics in SQL Server

08 January 2013

SQL Server's Query Optimiser relies on statistics for providing a decent query plan. If the statistics are wrong, outdated or missing then you have little realistic hope of having queries that perform well. It is therefore important to appreciate how SQL Server maintains distribution statistics.

As part of our ongoing discussions comparing SQL Server to Oracle, Jonathan Lewis and I are going to be talking about statistics next. As usual with these talks, I don’t have a clue how things work in Oracle, so I’m looking forward to the education. But, I have something of a handle on how statistics work in SQL Server. As a touch stone for those attending or for future reference, here’s some information about statistics within SQL Server.

What are Statistics?

SQL Server’s query optimizer uses distribution statistics to determine how it’s going to satisfy your SQL query. These statistics represent the distribution of the data within a column, or columns. The Query Optimizer uses them to estimate how many rows will be returned from a query plan. With no statistics to show how the data is distributed, the optimizer has no way it can compare the efficiency of different plans and so will be frequently forced to simply scan the table or index. Without statistics, it can’t possibly know if the column has the data you’re looking for without stepping through it. With statistics about the column, the optimizer can make much better choices about how it will access your data and use your indexes.

Distribution statistics are created automatically when you create an index. If you have enabled the automatic creation of statistics (the default setting of the AUTO_CREATE_STATISTICS database setting ) you’ll also get statistics created any time a column is referenced in a query as part of a filtering clause or JOIN criteria.

Data is measured two different ways within a single set of statistics, by density and by distribution.


Density is the easiest of the two to understand. Density is a ratio that shows just how many unique values there are within a given column, or set of columns. The formula is quite easy:

Density = 1 / Number of distinct values for column(s)

…which allows you to query your table to see exactly what the density might be:

FROM dbo.MyTable;

You can see the density for compound columns too. You just have to modify the query to get a distinct listing of your columns first:

SELECT 1.0 /  COUNT(*)
FROM dbo.MyTable) AS DistinctRows;

…and of course you can add columns to reflect the columns in your index so that you can see the density for yourself.

Density matters because the amount of selectivity of a given index, as determined by its density, is one of the best ways of measuring how effective it will be with your query. A high density (low selectivity, few unique values) will be of less use to the optimizer because it might not be the most efficient way of getting at your data. For example, if you have a column that shows up as a bit, a true or false statement such as, has a customer signed up for you mailing list, then for a million rows, you’re only ever going to see one of two values. That means that using an index or statistics to try to find data within the table based on two values is going to result in scans where more selective data, such as an email address, will result in more efficient data access.

The next measure is a bit more complex, the data distribution.

Data Distribution

The data distribution represents a statistical analysis of the kind of data that is in the first column available for statistics. That’s right, even with a compound index, you only get a single column of data for data distribution. This is one of the reasons why it’s frequently suggested that the most selective column should be the leading edge, or first, column in an index. Remember, this is just a suggestion and there are lots of exceptions, such as if the first column sorts the data in a more efficient fashion even though it’s not as selective. However, on to data distribution. The name of the storage mechanism for this distribution information is a histogram. Now a histogram, by statistician definition, is a visual representation of the distribution of the data; but distribution statistics use the more general mathematical sense of the word.

A histogram is a function that counts the number of occurences of data that fall into each of a number of categories (also known as bins) and in distribution statistics these categories are chosen so as to represent the distribution of the data. It’s this information that the optimizer can use to estimate the number of rows returned by a given value.

The SQL Server histogram consists of up to 200 distinct steps, or ‘bins’. Why 200? 1) it’s statistically significant or so I’m told, 2) it’s small, 3) it works for most data distributions below several hundred million rows, after that, you’ll need to look at stuff like filtered statistics, partitioned tables and other architectural constructs. These 200 steps are represented as rows in a table. The rows represent the way the data is distributed within the column by showing a pieces of data describing that distribution:

RANGE_HI_KEY This is the top value of the step represented by this row within the histogram.
RANGE_ROWS This number shows the number of rows within the step that are greater than the previous top value and the current top value, but not equal to either.
EQ_ROWS This number shows the number of rows within the step that are greater than the previous top value and the current top value, but not equal to either.
DISTINCT_RANGE_ROWS These are the distinct count of rows within a step. If all the rows are unique, then the RANGE_ROWS and the DISTINCT_RANGE_ROWS will be equal.
AVG_RANGE_ROWS This represents the average number of rows equal to a key value within the step.

This data is built one of two ways, sampled or a full scan. When indexes are created or rebuilt, you’ll get statistics created by a full scan by default. When statistics are automatically updated, SQL Server uses the sampled mechanism to build the histogram. The sampled approach makes the generation and updates of statistics very fast, but, can cause them to be inaccurate. This is because the sampling mechanism randomly reads data through the table and then does calculations to arrive at the values for the statistics. This is accurate enough for most data set, but otherwise you’re going to need to make the statistics as accurate as possible by manually update, or create, the statistics with a full scan.

Statistics are updated automatically by default, and these are the three thresholds that automatically cause an update:

  • If you have zero rows in the table, when you add a row(or rows), you’ll get an automatic update of statistics
  • If you have less than 500 rows in a table, when you add more than 500, and this means if you’re 499, you’d have to add rows to 999, you’ll get an automatic update.
  • Once you’re over 500 rows you will have to add 500 additional rows + 20% of the size of the table before you’ll see an automatic update on the stats.

That’s the default behavior. You get variations with filtered indexes and filtered statistics. Also, with SQL Server 2008R2 SP1 and in SQL Server 2012, you can set traceflag 2371 to ‘on’ in order to get a dynamic value instead of the fixed 20%. This will help larger databases get a more frequent update of statistics.

You can also update statistics manually. SQL Server provides two mechanisms for doing this. The first is sp_updatestats. This procedure uses a cursor to walk through all the statistics in a given database. It looks at the row modifier count and if any changes have been made, rowmodctr > 0, and then updates the statistics. It does this using a scanned update. You can also target individual statistics by using UPDATE STATISTICS and providing a name. With this, you can specify that the update use a FULL SCAN to ensure that the statistics are completely up to date, but using this requires that you write your own maintenance code to make this a permanent change.

Enough talking about what statistics are, let’s see how you can look at them and understand the data contained within them.


To see the current status of your statistics you use the DBCC statement SHOW_STATISTICS. The output is in three parts:

  • Header: which contains meta-data about the set of statistics
  • Density: Which shows the density values for the column or columns that define the set of statistics
  • Histogram: The table that defines the histogram laid out above

You can pull individual pieces of this data out by modifying the DBCC statement.

The information in the header can be quite useful (click through to enlarge):

Header Information

The most interesting information is found in the columns:

  • Updated: when was this set of statistics last updated. This is how you can get some idea of the age of a set of statistics. If you know that a table gets thousands of rows of inserts on any given day, but the statistics are from last week, you may need to manually update the statistics.
  • Rows and Rows Sampled: When these values match, you’re looking at a set of statistics that are the result of a full scan. When they’re different, you’re probably looking at a sampled set of statistics.

The rest of the data is nice to have and useful for a quick look at the statistics, but these are the really interesting ones.

The second set of data is the density measures. This set shows the density from a compound index:

The density from a compound index

The information is fairly self-explanatory. The All density column shows the density value as provided by the formula quoted above. You can see that the value gets smaller and smaller for each column. The most selective column is clearly the first. You can also see the Average Length of the values that make up the column and finally the list of Columns that define each level of the density.

Finally, this graphic shows a section of the histogram:

A section of the histogram

You can see how the data is distributed between each of the steps. With this you can see how well SQL Server is distributing your information. Since all the row counts here, except the average, are full numbers, this is another indication that this set of statistics is the result of a full scan. If the range rows are estimates they will include a number of decimal values.

You’re going to look at the histogram when you’re trying to understand why you’re seeing scans or seeks within SQL Server when you expected something else. The way that the data is distributed, showing the average number of rows for a given value within a particular range for example, gives you a good indication of how well this data can be used by the optimizer. If you’re seeing lots of variation between the range rows or the distinct range rows, you may have old or sampled statistics. Further, if the range and distinct rows are wildly divergent but you have up to date and accurate statistics, you may have serious data skew which could require different approaches to indexes and statistics such as filtered statistics.


As you can see there’s a lot to statistics within SQL Server. They are an absolutely vital part of getting the best performance possible out of the system. Understanding how they work, how they’re created and maintained, and how to look at them will help you understand the value of your own statistics.

Grant Fritchey

Author profile:

Grant Fritchey is a SQL Server MVP with over 20 years’ experience in IT including time spent in support and development. Grant has worked with SQL Server since version 6.0 back in 1995. He has developed in VB, VB.Net, C# and Java. Grant has authored books for Apress and Simple-Talk, and joined Red Gate as a Product Evangelist in January 2011. Find Grant on Twitter @GFritchey or on his blog.

Search for other articles by Grant Fritchey

Rate this article:   Avg rating: from a total of 65 votes.





Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.

Posted by: Simon Richards (not signed in)
Posted on: Monday, January 21, 2013 at 6:05 AM
Message: Am I reading that correctly? are the descriptions for RANGE_ROWS AND EQ_ROWS supposed to be the same?

Great article, thanks very much.

Posted by: Grant Fritchey (view profile)
Posted on: Monday, January 21, 2013 at 6:40 AM
Message: No. If it reads that way I screwed up. RANGE_ROWS is the number of rows in the range, not counting the EQ_ROWS. The EQ_ROWS are the number of rows that match the RANGE_HI_KEY value.

Subject: Statistics after restarting server
Posted by: kevswag (view profile)
Posted on: Monday, April 8, 2013 at 4:57 AM
Message: Hi Grant

Is there anything that should be done with statistics after restarting your server? The developers of our ERP are telling me to update the statistics after every restart. Never heard of this before, are you able to clarify for me ? Thanks. On seperate note, just finished reading your book on execution plans. Great book and have made several performance improvements through what I have learned. Thank you.

Subject: Statistics after restarting server
Posted by: kevswag (view profile)
Posted on: Monday, April 8, 2013 at 5:47 AM
Message: Hi Grant

Is there anything that should be done with statistics after restarting your server? The developers of our ERP are telling me to update the statistics after every restart. Never heard of this before, are you able to clarify for me ? Thanks. On seperate note, just finished reading your book on execution plans. Great book and have made several performance improvements through what I have learned. Thank you.

Subject: Trace Flag 2371
Posted by: SQLSoup (view profile)
Posted on: Tuesday, April 9, 2013 at 1:46 PM
Message: Hi Grant,

Great article and great presentation on the topic at SQLSaturday #203 in Boston on 4/6/2013.

Have you seen any stats updates improvements with SQL Server 2008 SP1/ SQL 2012 trace flag 2371?

Changes to automatic update statistics in SQL Server – traceflag 2371:

Subject: Trace Flag 2371
Posted by: Grant Fritchey (view profile)
Posted on: Tuesday, April 23, 2013 at 9:10 AM
Message: I haven't managed a large production system since that traceflag was made available. However, testing suggests it's extremely helpful. I have talked to a few others about it and no one has reported any ill effects. You'll probably still have to do some manual statistics maintenance, but less.

Subject: Statistics after a restart
Posted by: Grant Fritchey (view profile)
Posted on: Tuesday, April 23, 2013 at 9:13 AM
Message: No. No need. Statistics are stored within the database. They're like data in the database. If it's still there after a restart, so will the statistics be. Nothing in the process of shutting down or starting up a server or attaching/detaching a database does anything to statistics. You procedure cache gets completely nuked though and so you'll see all new plans. If the stats are out of date those plans might not be optimal, but the cause of the change is the cache, not the statistics.

Subject: Thanks!
Posted by: RossM (view profile)
Posted on: Tuesday, June 25, 2013 at 11:11 AM
Message: I saw your talk in Birmingham last week on Parameter Sniffing where you mentioned statistics, which led me onto this article.

I've come away knowing a lot more about it than I had when I began reading, thanks a lot!

Going to go set up some scheduled updates right now :)

Subject: Stats Update
Posted by: Grant Fritchey (view profile)
Posted on: Tuesday, June 25, 2013 at 10:27 PM
Message: Good. Glad to hear it. I hope both the talk and this article are useful to you.

Subject: Video on Statistics
Posted by: kutbuddin (view profile)
Posted on: Thursday, September 18, 2014 at 4:10 AM
Message: This article is great, here is another video which i like the most giving more clarity on statistics.

Simple-Talk Database Delivery

Patterns & Practices Library

Visit our patterns and practices library to learn more about database lifecycle management.

Find out how to automate the process of building, testing and deploying your database changes to reduce risk and make rapid releases possible.

Get started

Phil Factor
How to Build and Deploy a Database from Object-Level Source in a VCS

It is easy for someone who is developing a database to shrug and say 'if only my budget would extend to buying fancy... Read more...

 View the blog

Top Rated

Clone, Sweet Clone: Database Provisioning Made Easy?
 One of the difficulties of designing a completely different type of development tool such as SQL Clone... Read more...

Database Lifecycle Management: Deployment and Release
 So often, the unexpected delays in delivering database code are more likely to happen after the... Read more...

SQL Server Security Audit Basics
 SQL Server Server Audit has grown in functionality over the years but it can be tricky to maintain and... Read more...

The PoSh DBA: Assigning Data to Variables Via PowerShell Common Parameters
 Sometimes, it is the small improvements in a language that can make a real difference. PowerShell is... Read more...

Issue Tracking for Databases
 Any database development project will be hard to manage without a system for reporting bugs in the... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

Temporary Tables in SQL Server
 Temporary tables are used by every DB developer, but they're not likely to be too adventurous with... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.