Click here to monitor SSC
  • Av rating:
  • Total votes: 88
  • Total comments: 10
Grant Fritchey

SQL Server Statistics Questions We Were Too Shy to Ask

12 February 2013

If you need to optimise SQL Server performance, it pays to understand SQL Server Statistics. Grant Fritchey answers some frequently-asked questions about SQL Server Statistics: the ones we somehow feel silly asking in public, and think twice about doing so.

Try as I might, I find it hard to over-emphasize the importance of statistics to SQL Server. Bad or missing statistics leads to poor choices by the optimizer: The result is horrific performance. Recently, for example, I watched a query go from taking hours to taking seconds just because we got a good set of statistics on the data. The topic of statistics and their maintenance is not straightforward. Lots of questions occur to people when I’m doing a presentation about statistics, and some get asked in front of the rest of the audience. Then there are other questions that get asked later on, in conversation. Here are some of those other questions….

What’s the difference between statistics on a table and statistics on an index?

There is no essential difference between the statistics on an index and the statistics on a table. They’re created at different points and, unless you’re creating the statistics manually yourself, they’re created slightly differently. The statistics on an index are created with the index. So, for an index created on pre-existing data, you’ll get a full scan against that data as part of the task of creating the index which is also then used to create the statistics for the index. The automatically-created statistics on columns are also usually created against existing data when the column is referenced in one of the filtering statements (such as WHERE or JOIN .. ON). But these are created using sampled data, not a full scan. Other than the source and type of creation, these two types of statistics are largely the same.

How can the Query Optimizer work out how many rows will be returned from looking at the statistics?

This is the purpose of the histogram within the statistics. Here’s an example histogram from the Person.Address table in AdventureWorks2012:

If I were doing a search for the address ‘1313 Mockingbird Lane’ then the query optimizer is going to look at the histogram to determine two things:

  • Is it possible that this set of statistics contains this value
  • If it does contain the value, how many likely rows will be returned

The RANGE_HI_KEY column shows the top of a set of data within the histogram, so a search for ‘1313’ would place it within the step represented by row ten which has a RANGE_HI_KEY value of ‘137 Lancelot Dr.’ So the first question is answered. The optimizer will then look at the AVG_RANGE_ROWS to determine the average number of rows that match any given value within the step. So in this case, the optimizer will assume that there are 1.623188 rows returned for the value. But, these are all estimates. In reality, the database doesn’t contain the value ‘1313 Mockingbird Lane

When data changes, SQL Server will automatically maintain the statistics on indexes that I explicitly create, if that setting is enabled. Does it also maintain the statistics automatically created on columns?

As data changes in your tables, the statistics - all the statistics - will be updated based on the following formula:

  • When a table with no rows gets a row
  • When 500 rows are changed to a table that is less than 500 rows
  • When 20% + 500 are changed in a table greater than 500 rows

By ‘change’ we mean if a row is inserted, updated or deleted. So, yes, even the automatically-created statistics get updated and maintained as the data changes.

Where are the statistics actually stored? Do they take up much space? Can I save space by only having the essential ones?

The statistics themselves are stored within your database in a series of internal tables that include sysindexes. You can view some of the information about them using the system views sys.stats and sys.indexes, but the most detail is gleaned using a function, DBCC SHOW_STATISTICS. The statistics themselves take very little space. The header is a single row of information. The density is a set of rows with only three columns, equal in the number of rows to the number of columns defining the key columns of the statistic. Then you have the histogram. The histogram is up to 200 rows and never exceeds that amount. This means statistics do not require much room at all. While you can save a little bit of space by removing unneeded statistics, the space savings are too small to ever be worthwhile .

How do I know when stats were last updated?

You can look at the header using DBCC SHOW_STATISTICS. This contains a bunch of general information about the statistics in question. This example is from the Person.Address table:

As you can see, the last time the statistics were updated was January 4, 2013 at 7:01AM.

How reliable is the automated update of statistics in SQL Server?

You’d have to define what you mean by reliable. They are very reliable. They’re also sampled and automated to update on the criteria that we outlined in the first question. If the data in your system is fairly well distributed, as is usually the case, then the sampled statistics will work well for you. By ‘well distributed’ I mean that you’ll get a consistent view of all the available data by pulling just a sample of the data. Most systems, most of the time, will have reasonably well distributed data. But, almost every system I’ve worked with has exceptions. There always seems to be that one rogue table or that one index that’s got a very weird distribution of data, or gets updated very frequently, but not frequently enough to trigger an automatic update of the statistics. In this situation, the statistics can get stale or be inaccurate. But the problem isn’t the automated process. It’s that this data is skewed. These are the situations where you’ll need to manually take control of your statistics. This should be an exceptional event in most systems.

Are there any scripts or tools that will help me maintain statistics?

SQL Server provides two basic commands to help you to maintain your statistics, sp_updatestats and UPDATE STATISTICS. Sp_updatestats will look at all the statistics within a database to see if any rows have been modified in the table that the statistics support. If one or more rows have been modified, then you’ll get a sampled update of the statistics. UPDATE STATISTICS will update the statistics in the way that you specify, against the object that you specify; whether it is a table, an index, or a specific set of statistics.

Do we update statistics before or after we rebuild/reorganize indexes?

Just remember that, when you create an index, part of the task of creating that index is to create the statistics using a full scan of the data. Since a rebuild of an index is effectively a ‘drop and recreate’ of the index, the statistics are also recreated. I’ve frequently seen maintenance scripts that rebuild indexes and then update statistics using sp_updatestats. This basically replaces a good set of statistics based on a full scan with a less-accurate set of statistics based on random sampling. So if you are rebuilding indexes, I would not recommend updating the statistics since this is extra work for a less effective statistic.

Now when you reorganize an index, no modifications of any kind are made to the statistics. So if you’ve had enough modifications to the data that you feel you should also update the statistics, go ahead and do it for indexes that you’ve reorganized because you won’t be hurting any other work.

If you absolutely must update the statistics and rebuild the indexes and you’re not going to try to tightly control exactly which indexes and tables you do this on, then the best practice would be to update the statistics first and then rebuild the indexes second.

Is using UPDATE STATISTICS WITH FULL SCAN the same as the statistics update that happens during an index rebuild?

Yes. The keyword and tricky phrase that you have to look at is WITH FULL SCAN. That indicates that the full data set is used to create/update the statistics. Since rebuilding an index is, to a degree, almost the same as recreating the index, you’re getting a full scan of the data set to put the index back together. This full scan also updates the statistics. If you run UPDATE STATISTICS WITH FULL SCAN, once again, it’s looking at all the data.

How do you determine when you need to manually update statistics?

This is one of the hardest questions about statistics to answer, because there is no hard-and-fast formula. The short answer is that you need to determine whether the statistics accurately enough represent the distribution of the data, and update them if it doesn’t. You need to update your statistics manually when the automatic processes are either not occurring frequently enough to provide you with a good set of statistics or because the sampled nature of the automatic updates is causing your statistics to be inaccurate. The only way to know whether you need to do these things is to track the behavior of queries within your system in order to spot when the optimizer starts making bad choices. This is probably because the statistics are not reflecting the data in a way that leads to good execution plans. It’s very easy to say, but a lot of work to do.

How often should I run a manual update of my statistics?

The answer to that depends on many circumstances. I’ve worked on systems that never needed any manual intervention on the statistics at all. At other times I’ve experienced a problem where we were running a manual update of the statistics every two minutes (that was a horrifically broken database and indexing design, not a pattern to be emulated). More usually, you’ll be dealing with a table that gets lots of inserts, maybe with an index on a datetime column, or an identity column, so that the data is outside the range of the statistics, but, there’s not enough activity to cause the automatic update to fire. Then, you’ll need to manually update statistics on a scheduled basis.

But how often? Often enough. You’ll need to determine that period based on your system and your circumstances. There’s not a formula I can provide that will precisely tell you when to run a manual update on your statistics. Further, I can’t tell you how to sample your statistics either. You may be experiencing random sampling that is perfect , or you may need some more specified degree of sampling right up to a full scan. You’ll need to experiment to understand what the right answer is in your circumstances.

Is there a way to change the sample rate for particular tables in SQL Server?

You can define how statistics sample the table when you create them or update them. You can specify either the number of rows to be sampled or the percentage of the table to be sampled. You can even specify that 0 percent or 0 rows be sampled. This will update your statistics, but with no actual statistics data. It’s probably a dangerous choice. If you use sp_update stats or UPDATE STATISTICS you can specify, by using the RESAMPLE command, the sample rate that you specified when you created the statistics should then be reused. If you controlled the sample rate directly, the next time you use RESAMPLE, that sample rate will be used again. This includes statistics created on columns and on indexes.

One point worth noting is that the automatic update of statistics will be a sampled update. If you have a very specific need for a particular sample rate on statistics and there’s a good chance that the automatic maintenance could tax your system too much, you might consider turning off the automatic update for that table or set of statistics. You can do this by specifying NORECOMPUTE option when you either UPDATE or CREATE STATISTICS. You have to use the STATISTICS_NORECOMPUTE option when you create an index. You can specify NORECOMPUTE with a manual update of the statistics. But, if you do this, and the data is changing, you need to plan for manual updates to those statistics.

Can you create a set of statistics in SQL Server like you do in Oracle?

Oracle allows you to create custom statistics all the way down to creating your own histogram. SQL Server doesn’t give you that much control. However, you can create something in SQL Server that doesn’t exist in Oracle; filtered statistics. These are extremely useful when dealing with partitioned data or data that is wildly skewed due to wide ranging data or lots of nulls. Using AdventureWorks2012 as an example, I could create a set of statistics on multiple columns such as TaxAmt and CurrencyRateID in order to have a denser, more unique, value for the statistics than would be created by the optimizer on each of the columns separately. The code to do that looks like this:

CREATE STATISTICS TaxAmtFiltered

ON Sales.SalesOrderHeader (TaxAmt,CurrencyRateID)

WHERE TaxAmt > 1000 AND CurrencyRateID IS NOT NULL

WITH FULLSCAN;

This may help the optimizer to make better choices when creating the execution plan, but you’ll need to test it in any given setting.

Can you have statistics on a View?

No, and yes. No, your basic view is nothing but a query. That query doesn’t have statistics. It’s the same as a SELECT query inside a stored procedure or a batch statement. But, you can create a construct called an indexed view, or materialized view, which is a clustered index based on the query that defines the view. That’s an index, so it gets statistics just like any other index. Further, if you run queries that reference the columns in a filtering clause in the new clustered index, statistics can be created on those columns. While this is, strictly speaking, not the same as statistics on a view, it’s as close as you can get within SQL Server.

Are statistics created on temporary tables?

Yes. The major difference between a table variable and a temporary table is that a temporary table has statistics. The rules for the creation and maintenance of these statistics are exactly the same as for a regular table within SQL Server. So if you reference a column in a temporary table in a filtering command in T-SQL such as WHERE or JOIN, then a set of statistics will get created. Unfortunately, the creation of the statistics causes a statement recompile. This is a potential disadvantage of temporary tables: For small statements this is a cheap operation. For larger queries this can be very expensive. That’s a reason why you have to be careful about how you work with your temporary tables.

How does partitioning affect the statistics created by SQL Server?

Partitioning within SQL Server is defined through the creation of a clustered index. This clustered index has a full set of statistics based on all the data in the partition. If we’re talking about very large partitions and very large amounts of data, then there’s a good chance that the set of statistics may not be terribly accurate for the queries within your system. Oh, it will help you determine which partition to use very accurately, but within the partition you may still be seeing scans where a seek is possible. In order to help the optimizer, it’s a very good idea to create a set of manual statistics on each partition. This ensures that the data distribution of the partition is available to the optimizer to help it with making a good choice of query plan when executing your queries. For some additional details, read this overview from the SQL Server Customer Advisory Team (SQLCAT).

What kind of statistics are provided for SQL Server through a linked server?

None. The data on a linked server will have whatever statistics are provided by the database you’re connecting to through the linked server, on that server, but it won’t pass any statistics back to your system. If you need statistics on the data in the linked server then you’ll need to load that data into a table or temporary table (not a table variable) and create indexes and/or statistics on that table.

Can statistics be imported/exported?

Yes. If you look at the “Generate and Publish Scripts” wizard for a database it is possible to set up a situation where you not only script out the database, but the statistics that define that database as well. Within SQL Server Management Studio (SSMS), right click on the database in question and select “Tasks” from the context menu and then “Generate Scripts” from the sub-menu. This will launch the wizard. You can choose the objects you’re interested in and then click Next until you get to the “Set Scripting Options” step. Here, you want to click the ‘Advanced’ button. Scroll down and you’ll find the option “Script Statistics.” You can select to script out just the base statistics or include the histogram as you can see selected below.

You can then output the scripts. All the objects you selected are then generated out to a script which you can use to create a new database. If you take a look at the script, you can see one of two commands, either an UPDATE STATISTICS command for objects like indexes where a set of statistics are automatically created, or CREATE STATISTICS for sets of statistics that you created manually or were created automatically for you by SQL Server. Each of these has an additional option defined WITH STATS_STREAM and then a binary value:

While SQL Server can generate this binary information, you can’t. If you look for the documentation for STATS_STREAM you won’t find it. Obviously, you can use this feature since it’s supplied by Microsoft. You can even generate your own STATS_STREAM value by using DBCC SHOW_STATISTICS WITH STATS_STREAM. But the documentation there reminds us: Not supported. Future compatibility is not guaranteed. So exercise caution when using this.

Conclusion

Dealing with statistics is definitely one of the more frustrating tasks of maintaining your SQL Server instances: However, as you can see, you have quite a large number of options which will enable you to get your statistics optimized on your server. Just remember that just as your data is not always generic, there is no generic solution for the maintenance of statistics. You’ll need to conform the solution you use to the data and the queries that are running on your system.

Grant Fritchey

Author profile:

Grant Fritchey, SQL Server MVP, works for Red Gate Software as Product Evangelist. In his time as a DBA and developer, he has worked at three failed dot–coms, a major consulting company, a global bank and an international insurance & engineering company. Grant volunteers for the Professional Association of SQL Server Users (PASS). He is the author of the books SQL Server Execution Plans (Simple-Talk) and SQL Server 2008 Query Performance Tuning Distilled (Apress). He is one of the founding officers of the Southern New England SQL Server Users Group (SNESSUG) and it’s current president. He earned the nickname “The Scary DBA.” He even has an official name plate, and displays it proudly.

Search for other articles by Grant Fritchey

Rate this article:   Avg rating: from a total of 88 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: sp_updatestats
Posted by: ChrisPage (view profile)
Posted on: Friday, March 01, 2013 at 2:30 AM
Message: You mention that sp_updatestats will invalidate stats generated during an index rebuild but I was under the impression that sp_updatestats only worked on stats that need updating?

Subject: re: sp_updatestats
Posted by: Grant Fritchey (view profile)
Posted on: Tuesday, March 05, 2013 at 4:10 PM
Message: All it takes is a single row to be changed in any way and sp_updatestats will modify the statistics. Depending on your system, if you run a rebuild and then run sp_updatestats, nothing may happen, or, it'll take those stats created from a full scan and then replace them with sampled statistics.

Subject: Temp Tables vs Table Variables...
Posted by: mbourgon (view profile)
Posted on: Wednesday, March 06, 2013 at 9:14 AM
Message: Grant, great article. However, I was under the impression that both Table Variables and Temp Tables have statistics, but that the stats count for Table Variables was 1 unless you did a recompile after loading it.

Subject: @Chrispage
Posted by: mbourgon (view profile)
Posted on: Wednesday, March 06, 2013 at 9:25 AM
Message: From the BOL: "sp_updatestats updates only the statistics that require updating based on the rowmodctr information in the sys.sysindexes catalog view". That should mean that it uses the standard 20% rule.

However, the code for sp_updatestats looks like it does everything (since rowmodctr counts the number of rows changed).

"IF ( ( @ind_rowmodctr <> 0 )
OR ( ( @is_ver_current IS NOT NULL )
AND ( @is_ver_current = 0 )"

Subject: sp_updatestats
Posted by: Grant Fritchey (view profile)
Posted on: Tuesday, March 12, 2013 at 5:39 AM
Message: As mbourgon has already pointed out. You can look at the code for the procedure and it looks for a single row of changes. One or more is all it takes.

Subject: Table Variables
Posted by: Grant Fritchey (view profile)
Posted on: Tuesday, March 12, 2013 at 5:44 AM
Message: These do not have statistics. Not of any kind. Not ever. There is documentation for it all over MSDN. Do a search for table variables on this article: http://msdn.microsoft.com/en-us/library/dd535534(v=SQL.100).aspx It was the second entry. You can also see the same thing in the entry in MSDN on statistics: http://msdn.microsoft.com/en-us/library/ms190397.aspx Again, search for table variable. It's the first entry when I did the search.

Recompiles will not give them updated statistics. They just don't have statistics. This is on purpose. They are providing a mechanism that won't cause recompiles in areas where that may be an issue. If you find that temp tables are causing excessive recompiles, you can experiment with using table variables. Just know that they're lack of statistics could radically change the execution plans used to access the data within them.

Subject: Synchronous vs Asynchronous
Posted by: Steven Willis (view profile)
Posted on: Tuesday, March 12, 2013 at 2:57 PM
Message: Good article Grant. You didn't mention that the default update statistics mode is synchronous. This can cause things to bog down if a table is getting a lot of changes. Perhaps you could discuss the advantages and/or disadvantages of auto-updating statistics asynchronously?

 

Subject: Rebuilding a Partitioned Index is no longer a FULL SCAN stats update
Posted by: srutzky (view profile)
Posted on: Wednesday, March 20, 2013 at 2:38 PM
Message: One thing that should be mentioned is that as of SQL Server 2012, doing a REBUILD of an index that is partitioned will NO LONGER include a stats update equivalent to FULLSCAN. This is noted in the MSDN page for ALTER INDEX in the Remarks section under "Rebuilding Indexes": http://msdn.microsoft.com/en-us/library/ms188388.aspx

I have tested this on tables with 1 million to 1 billion rows and the sample rate is less than 1%. You will need to do a manual UPDATE STATS.

Subject: Synchronous vs. Asynchronous
Posted by: Grant Fritchey (view profile)
Posted on: Thursday, July 18, 2013 at 4:45 AM
Message: That would be a good additional question to answer. If we update this I'll absolutely put it in there. Thanks for the suggestion.

Subject: Bad DB Design and Manual Stat Updates
Posted by: buddylee17 (view profile)
Posted on: Friday, August 23, 2013 at 5:51 PM
Message: Great topic Grant!

When index maintenance is regularly occurring, I feel there's a common link between bad design and the necessity to manually update statistics.

Unfortunately, many DBA's are trained that high fragmentation is the only indication that there's an issue. Typically, bad database design also goes hand in hand with bad sql as well. I've seen designs that my 6 year old could better. All the index maintenance in the world wouldn't keep that ship afloat as those table scans, hash joins, and sorts need very fresh stats and ad-hoc plans.

A few comments on partitioning:

Your section assumes that all tables are B-trees. I might replace all references to "clustered index" with "HOBT" as Heaps are fairly common on partitioned tables.

I'd also recommend manually updating statistics when switching partitions, as the switch doesn't always (actually I've never seen it) trigger the stats update on the table. You'll switch 50 million rows into a table, but plans from queries on that partition will show an estimated row count of 1.

Great article! Stats are fundamental to query performance.




 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.