12 February 2013

SQL Server Statistics Questions We Were Too Shy to Ask

If you need to optimise SQL Server performance, it pays to understand SQL Server Statistics. Grant Fritchey answers some frequently-asked questions about SQL Server Statistics: the ones we somehow feel silly asking in public, and think twice about doing so.

Try as I might, I find it hard to over-emphasize the importance of statistics to SQL Server. Bad or missing statistics leads to poor choices by the optimizer: The result is horrific performance. Recently, for example, I watched a query go from taking hours to taking seconds just because we got a good set of statistics on the data. The topic of statistics and their maintenance is not straightforward. Lots of questions occur to people when I’m doing a presentation about statistics, and some get asked in front of the rest of the audience. Then there are other questions that get asked later on, in conversation. Here are some of those other questions….

What’s the difference between statistics on a table and statistics on an index?

There is no essential difference between the statistics on an index and the statistics on a table. They’re created at different points and, unless you’re creating the statistics manually yourself, they’re created slightly differently. The statistics on an index are created with the index. So, for an index created on pre-existing data, you’ll get a full scan against that data as part of the task of creating the index which is also then used to create the statistics for the index. The automatically-created statistics on columns are also usually created against existing data when the column is referenced in one of the filtering statements (such as WHERE or JOIN .. ON). But these are created using sampled data, not a full scan. Other than the source and type of creation, these two types of statistics are largely the same.

How can the Query Optimizer work out how many rows will be returned from looking at the statistics?

This is the purpose of the histogram within the statistics. Here’s an example histogram from the Person.Address table in AdventureWorks2012:

1744-892606fd-a2e8-4736-85bd-5ef5debb4f2

If I were doing a search for the address ‘1313 Mockingbird Lane’ then the query optimizer is going to look at the histogram to determine two things:

  • Is it possible that this set of statistics contains this value
  • If it does contain the value, how many likely rows will be returned

The RANGE_HI_KEY column shows the top of a set of data within the histogram, so a search for ‘1313’ would place it within the step represented by row ten which has a RANGE_HI_KEY value of ‘137 Lancelot Dr.’ So the first question is answered. The optimizer will then look at the AVG_RANGE_ROWS to determine the average number of rows that match any given value within the step. So in this case, the optimizer will assume that there are 1.623188 rows returned for the value. But, these are all estimates. In reality, the database doesn’t contain the value ‘1313 Mockingbird Lane

When data changes, SQL Server will automatically maintain the statistics on indexes that I explicitly create, if that setting is enabled. Does it also maintain the statistics automatically created on columns?

As data changes in your tables, the statistics – all the statistics – will be updated based on the following formula:

  • When a table with no rows gets a row
  • When 500 rows are changed to a table that is less than 500 rows
  • When 20% + 500 are changed in a table greater than 500 rows

By ‘change’ we mean if a row is inserted, updated or deleted. So, yes, even the automatically-created statistics get updated and maintained as the data changes.

Where are the statistics actually stored? Do they take up much space? Can I save space by only having the essential ones?

The statistics themselves are stored within your database in a series of internal tables that include sysindexes. You can view some of the information about them using the system views sys.stats and sys.indexes, but the most detail is gleaned using a function, DBCC SHOW_STATISTICS. The statistics themselves take very little space. The header is a single row of information. The density is a set of rows with only three columns, equal in the number of rows to the number of columns defining the key columns of the statistic. Then you have the histogram. The histogram is up to 200 rows and never exceeds that amount. This means statistics do not require much room at all. While you can save a little bit of space by removing unneeded statistics, the space savings are too small to ever be worthwhile .

How do I know when stats were last updated?

You can look at the header using DBCC SHOW_STATISTICS. This contains a bunch of general information about the statistics in question. This example is from the Person.Address table:

1744-d9bfe17f-a3d9-44bd-8d69-eeec81137ef

As you can see, the last time the statistics were updated was January 4, 2013 at 7:01AM.

How reliable is the automated update of statistics in SQL Server?

You’d have to define what you mean by reliable. They are very reliable. They’re also sampled and automated to update on the criteria that we outlined in the first question. If the data in your system is fairly well distributed, as is usually the case, then the sampled statistics will work well for you. By ‘well distributed’ I mean that you’ll get a consistent view of all the available data by pulling just a sample of the data. Most systems, most of the time, will have reasonably well distributed data. But, almost every system I’ve worked with has exceptions. There always seems to be that one rogue table or that one index that’s got a very weird distribution of data, or gets updated very frequently, but not frequently enough to trigger an automatic update of the statistics. In this situation, the statistics can get stale or be inaccurate. But the problem isn’t the automated process. It’s that this data is skewed. These are the situations where you’ll need to manually take control of your statistics. This should be an exceptional event in most systems.

Are there any scripts or tools that will help me maintain statistics?

SQL Server provides two basic commands to help you to maintain your statistics, sp_updatestats and UPDATE STATISTICS. Sp_updatestats will look at all the statistics within a database to see if any rows have been modified in the table that the statistics support. If one or more rows have been modified, then you’ll get a sampled update of the statistics. UPDATE STATISTICS will update the statistics in the way that you specify, against the object that you specify; whether it is a table, an index, or a specific set of statistics.

Do we update statistics before or after we rebuild/reorganize indexes?

Just remember that, when you create an index, part of the task of creating that index is to create the statistics using a full scan of the data. Since a rebuild of an index is effectively a ‘drop and recreate’ of the index, the statistics are also recreated. I’ve frequently seen maintenance scripts that rebuild indexes and then update statistics using sp_updatestats. This basically replaces a good set of statistics based on a full scan with a less-accurate set of statistics based on random sampling. So if you are rebuilding indexes, I would not recommend updating the statistics since this is extra work for a less effective statistic.

Now when you reorganize an index, no modifications of any kind are made to the statistics. So if you’ve had enough modifications to the data that you feel you should also update the statistics, go ahead and do it for indexes that you’ve reorganized because you won’t be hurting any other work.

If you absolutely must update the statistics and rebuild the indexes and you’re not going to try to tightly control exactly which indexes and tables you do this on, then the best practice would be to update the statistics first and then rebuild the indexes second.

Is using UPDATE STATISTICS WITH FULL SCAN the same as the statistics update that happens during an index rebuild?

Yes. The keyword and tricky phrase that you have to look at is WITH FULL SCAN. That indicates that the full data set is used to create/update the statistics. Since rebuilding an index is, to a degree, almost the same as recreating the index, you’re getting a full scan of the data set to put the index back together. This full scan also updates the statistics. If you run UPDATE STATISTICS WITH FULL SCAN, once again, it’s looking at all the data.

How do you determine when you need to manually update statistics?

This is one of the hardest questions about statistics to answer, because there is no hard-and-fast formula. The short answer is that you need to determine whether the statistics accurately enough represent the distribution of the data, and update them if it doesn’t. You need to update your statistics manually when the automatic processes are either not occurring frequently enough to provide you with a good set of statistics or because the sampled nature of the automatic updates is causing your statistics to be inaccurate. The only way to know whether you need to do these things is to track the behavior of queries within your system in order to spot when the optimizer starts making bad choices. This is probably because the statistics are not reflecting the data in a way that leads to good execution plans. It’s very easy to say, but a lot of work to do.

How often should I run a manual update of my statistics?

The answer to that depends on many circumstances. I’ve worked on systems that never needed any manual intervention on the statistics at all. At other times I’ve experienced a problem where we were running a manual update of the statistics every two minutes (that was a horrifically broken database and indexing design, not a pattern to be emulated). More usually, you’ll be dealing with a table that gets lots of inserts, maybe with an index on a datetime column, or an identity column, so that the data is outside the range of the statistics, but, there’s not enough activity to cause the automatic update to fire. Then, you’ll need to manually update statistics on a scheduled basis.

But how often? Often enough. You’ll need to determine that period based on your system and your circumstances. There’s not a formula I can provide that will precisely tell you when to run a manual update on your statistics. Further, I can’t tell you how to sample your statistics either. You may be experiencing random sampling that is perfect , or you may need some more specified degree of sampling right up to a full scan. You’ll need to experiment to understand what the right answer is in your circumstances.

Is there a way to change the sample rate for particular tables in SQL Server?

You can define how statistics sample the table when you create them or update them. You can specify either the number of rows to be sampled or the percentage of the table to be sampled. You can even specify that 0 percent or 0 rows be sampled. This will update your statistics, but with no actual statistics data. It’s probably a dangerous choice. If you use sp_update stats or UPDATE STATISTICS you can specify, by using the RESAMPLE command, the sample rate that you specified when you created the statistics should then be reused. If you controlled the sample rate directly, the next time you use RESAMPLE, that sample rate will be used again. This includes statistics created on columns and on indexes.

One point worth noting is that the automatic update of statistics will be a sampled update. If you have a very specific need for a particular sample rate on statistics and there’s a good chance that the automatic maintenance could tax your system too much, you might consider turning off the automatic update for that table or set of statistics. You can do this by specifying NORECOMPUTE option when you either UPDATE or CREATE STATISTICS. You have to use the STATISTICS_NORECOMPUTE option when you create an index. You can specify NORECOMPUTE with a manual update of the statistics. But, if you do this, and the data is changing, you need to plan for manual updates to those statistics.

Can you create a set of statistics in SQL Server like you do in Oracle?

Oracle allows you to create custom statistics all the way down to creating your own histogram. SQL Server doesn’t give you that much control. However, you can create something in SQL Server that doesn’t exist in Oracle; filtered statistics. These are extremely useful when dealing with partitioned data or data that is wildly skewed due to wide ranging data or lots of nulls. Using AdventureWorks2012 as an example, I could create a set of statistics on multiple columns such as TaxAmt and CurrencyRateID in order to have a denser, more unique, value for the statistics than would be created by the optimizer on each of the columns separately. The code to do that looks like this:

This may help the optimizer to make better choices when creating the execution plan, but you’ll need to test it in any given setting.

Can you have statistics on a View?

No, and yes. No, your basic view is nothing but a query. That query doesn’t have statistics. It’s the same as a SELECT query inside a stored procedure or a batch statement. But, you can create a construct called an indexed view, or materialized view, which is a clustered index based on the query that defines the view. That’s an index, so it gets statistics just like any other index. Further, if you run queries that reference the columns in a filtering clause in the new clustered index, statistics can be created on those columns. While this is, strictly speaking, not the same as statistics on a view, it’s as close as you can get within SQL Server.

Are statistics created on temporary tables?

Yes. The major difference between a table variable and a temporary table is that a temporary table has statistics. The rules for the creation and maintenance of these statistics are exactly the same as for a regular table within SQL Server. So if you reference a column in a temporary table in a filtering command in T-SQL such as WHERE or JOIN, then a set of statistics will get created. Unfortunately, the creation of the statistics causes a statement recompile. This is a potential disadvantage of temporary tables: For small statements this is a cheap operation. For larger queries this can be very expensive. That’s a reason why you have to be careful about how you work with your temporary tables.

How does partitioning affect the statistics created by SQL Server?

Partitioning within SQL Server is defined through the creation of a clustered index. This clustered index has a full set of statistics based on all the data in the partition. If we’re talking about very large partitions and very large amounts of data, then there’s a good chance that the set of statistics may not be terribly accurate for the queries within your system. Oh, it will help you determine which partition to use very accurately, but within the partition you may still be seeing scans where a seek is possible. In order to help the optimizer, it’s a very good idea to create a set of manual statistics on each partition. This ensures that the data distribution of the partition is available to the optimizer to help it with making a good choice of query plan when executing your queries. For some additional details, read this overview from the SQL Server Customer Advisory Team (SQLCAT).

What kind of statistics are provided for SQL Server through a linked server?

None. The data on a linked server will have whatever statistics are provided by the database you’re connecting to through the linked server, on that server, but it won’t pass any statistics back to your system. If you need statistics on the data in the linked server then you’ll need to load that data into a table or temporary table (not a table variable) and create indexes and/or statistics on that table.

Can statistics be imported/exported?

Yes. If you look at the “Generate and Publish Scripts” wizard for a database it is possible to set up a situation where you not only script out the database, but the statistics that define that database as well. Within SQL Server Management Studio (SSMS), right click on the database in question and select “Tasks” from the context menu and then “Generate Scripts” from the sub-menu. This will launch the wizard. You can choose the objects you’re interested in and then click Next until you get to the “Set Scripting Options” step. Here, you want to click the ‘Advanced‘ button. Scroll down and you’ll find the option “Script Statistics.” You can select to script out just the base statistics or include the histogram as you can see selected below.

1744-7edc1b70-6d30-403c-96ad-609a81cfd9a

You can then output the scripts. All the objects you selected are then generated out to a script which you can use to create a new database. If you take a look at the script, you can see one of two commands, either an UPDATE STATISTICS command for objects like indexes where a set of statistics are automatically created, or CREATE STATISTICS for sets of statistics that you created manually or were created automatically for you by SQL Server. Each of these has an additional option defined WITH STATS_STREAM and then a binary value:

1744-59c73712-222d-4a7d-a830-430e61ac5e2

While SQL Server can generate this binary information, you can’t. If you look for the documentation for STATS_STREAM you won’t find it. Obviously, you can use this feature since it’s supplied by Microsoft. You can even generate your own STATS_STREAM value by using DBCC SHOW_STATISTICS WITH STATS_STREAM. But the documentation there reminds us: Not supported. Future compatibility is not guaranteed. So exercise caution when using this.

Conclusion

Dealing with statistics is definitely one of the more frustrating tasks of maintaining your SQL Server instances: However, as you can see, you have quite a large number of options which will enable you to get your statistics optimized on your server. Just remember that just as your data is not always generic, there is no generic solution for the maintenance of statistics. You’ll need to conform the solution you use to the data and the queries that are running on your system.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 81070 times – thanks for reading.

Tags: , , ,

  • Rate
    [Total: 132    Average: 4.5/5]
  • Share

Grant Fritchey

View all articles by Grant Fritchey

  • ChrisPage

    sp_updatestats
    You mention that sp_updatestats will invalidate stats generated during an index rebuild but I was under the impression that sp_updatestats only worked on stats that need updating?

  • Grant Fritchey

    re: sp_updatestats
    All it takes is a single row to be changed in any way and sp_updatestats will modify the statistics. Depending on your system, if you run a rebuild and then run sp_updatestats, nothing may happen, or, it’ll take those stats created from a full scan and then replace them with sampled statistics.

  • mbourgon

    Temp Tables vs Table Variables…
    Grant, great article. However, I was under the impression that both Table Variables and Temp Tables have statistics, but that the stats count for Table Variables was 1 unless you did a recompile after loading it.

  • mbourgon

    @Chrispage
    From the BOL: "sp_updatestats updates only the statistics that require updating based on the rowmodctr information in the sys.sysindexes catalog view". That should mean that it uses the standard 20% rule.

    However, the code for sp_updatestats looks like it does everything (since rowmodctr counts the number of rows changed).

    "IF ( ( @ind_rowmodctr <> 0 )
    OR ( ( @is_ver_current IS NOT NULL )
    AND ( @is_ver_current = 0 )"

  • Grant Fritchey

    sp_updatestats
    As mbourgon has already pointed out. You can look at the code for the procedure and it looks for a single row of changes. One or more is all it takes.

  • Grant Fritchey

    Table Variables
    These do not have statistics. Not of any kind. Not ever. There is documentation for it all over MSDN. Do a search for table variables on this article: http://msdn.microsoft.com/en-us/library/dd535534(v=SQL.100).aspx It was the second entry. You can also see the same thing in the entry in MSDN on statistics: http://msdn.microsoft.com/en-us/library/ms190397.aspx Again, search for table variable. It’s the first entry when I did the search.

    Recompiles will not give them updated statistics. They just don’t have statistics. This is on purpose. They are providing a mechanism that won’t cause recompiles in areas where that may be an issue. If you find that temp tables are causing excessive recompiles, you can experiment with using table variables. Just know that they’re lack of statistics could radically change the execution plans used to access the data within them.

  • Steven Willis

    Synchronous vs Asynchronous
    Good article Grant. You didn’t mention that the default update statistics mode is synchronous. This can cause things to bog down if a table is getting a lot of changes. Perhaps you could discuss the advantages and/or disadvantages of auto-updating statistics asynchronously?

     

  • srutzky

    Rebuilding a Partitioned Index is no longer a FULL SCAN stats update
    One thing that should be mentioned is that as of SQL Server 2012, doing a REBUILD of an index that is partitioned will NO LONGER include a stats update equivalent to FULLSCAN. This is noted in the MSDN page for ALTER INDEX in the Remarks section under "Rebuilding Indexes": http://msdn.microsoft.com/en-us/library/ms188388.aspx

    I have tested this on tables with 1 million to 1 billion rows and the sample rate is less than 1%. You will need to do a manual UPDATE STATS.

  • Grant Fritchey

    Synchronous vs. Asynchronous
    That would be a good additional question to answer. If we update this I’ll absolutely put it in there. Thanks for the suggestion.

  • buddylee17

    Bad DB Design and Manual Stat Updates
    Great topic Grant!

    When index maintenance is regularly occurring, I feel there’s a common link between bad design and the necessity to manually update statistics.

    Unfortunately, many DBA’s are trained that high fragmentation is the only indication that there’s an issue. Typically, bad database design also goes hand in hand with bad sql as well. I’ve seen designs that my 6 year old could better. All the index maintenance in the world wouldn’t keep that ship afloat as those table scans, hash joins, and sorts need very fresh stats and ad-hoc plans.

    A few comments on partitioning:

    Your section assumes that all tables are B-trees. I might replace all references to "clustered index" with "HOBT" as Heaps are fairly common on partitioned tables.

    I’d also recommend manually updating statistics when switching partitions, as the switch doesn’t always (actually I’ve never seen it) trigger the stats update on the table. You’ll switch 50 million rows into a table, but plans from queries on that partition will show an estimated row count of 1.

    Great article! Stats are fundamental to query performance.

  • Yaniv Etrogi

    DMV: sys.dm_db_stats_Properties + trace flag 2371
    Great article,
    I would like to point out the new DMV: sys.dm_db_stats_Properties. Starting SQL Server 2008 R2 SP2 and SQL Server 2012 Service Pack 1, a new DMV sys.dm_db_stats_Properties is introduced to more accurately track the number of rows changed in a table. You can use this DMV to detect number of rows changed in a table and decide if you wish to update statistics manually.

    In addition be aware of trace flag 2371:
    http://blogs.microsoft.co.il/yaniv_etrogi/2011/09/26/trace-flag-2371/

  • Jagdev

    Create Date
    I am looking for a query that when do my statistics were created, specially hypothetical stats.

  • Grant Fritchey

    Create Date
    The date that statistics were created is a part of the stats and is listed in DBCC SHOW_STATISTICS. You can also see this in the DMVs around stats.

  • Jagdev

    Create Date
    I am looking for a query that when do my statistics were created, specially hypothetical stats.

  • Jagdev

    Create Date
    But this given information of stats update date only with DBCC SHOW_STATISTICS however I am looking for information regarding creation date.

  • Grant Fritchey

    Create Date
    Since stats are effectively recreated each and every time they get updated, the update date is the create date. An actual create date wouldn’t really mean anything for statistics. If it’s index level statistics, the create date is the same as the index itself, but, the stats have been recreated X number of times since then. There isn’t a create date that is only for stats.

  • Jagdev

    Create Date
    But this given information of stats update date only with DBCC SHOW_STATISTICS however I am looking for information regarding creation date.

  • Jagdev

    Create Date
    First of all I am not sure why my comments are displaying 2 times.
    Secondly if this is the scenario for stats create date then I am not convinced with MS approach.

    As per my understanding stats are like other objects, So in case we modifyalter anything for SP, tables, create date remains same.

    Any how thanx for your answer. hope MS can clarify it further.

  • Grant Fritchey

    Create Date
    Well, stats aren’t exactly objects. If you look at the system views, there is no object_id for stats. They belong to an object, but don’t have listings as objects. I suspect that Microsoft just hasn’t seen the need for a separate create date. The use case has to be pretty limited. I’ve only ever cared when they were last updated in comparison to the changes to the data within the tables.

  • Jagdev

    Create Date
    Problem with me is that I have some stats created and some and still hypothetical stats are coming up. So just want to segregate when hypothetical stats are created so that I can get exact idea weather it was after manual creation by me or before script execution by me.

    I know its a requirement where you will be thinking un-necessary however I have this requirement as of now.