01 September 2011

Statistics on Ascending Columns

It comes as rather a shock to find out that one of the commonest circumstances in an OLTP database, an ascending primary key with most querying on the latest records, can throw the judgement of the Query Optimiser to the extent that perfomance nose-dives. Fabiano once again puts on snorkel and goggles to explore the murky depths of execution plans to find out why.

First of all, a warning:

Much of what you will see here is ‘undocumented’. That means we can discuss it, but you can’t trust that this will not change in a service pack or a new product release. Also you will not be able to call Microsoft support for help or advice on the topic, or if you get an error when using the following techniques. Use with precaution and with extra attention.

Introduction

A very common problem relating to distribution statistics is associated with what we call “ascending value columns” in a table. This generally happens when a large table has ascending values and the most recent rows are the ones most commonly being accessed. When data values in a column ascend, most new insertions are beyond the range covered by the distribution statistics. This can lead to poorly performing plans since they inaccurately predict from the statistics that filters selecting recent data would exclude the entire relation.

As we already know, a column which belongs to a statistic has to get to a threshold quantity of modifications after the previous update in order to trigger the auto-updated statistics, and for certain cases, this threshold is too high. In other words, a column table requires too many modifications before the distribution statistics are rebuilt.

This delay in auto-updating a statistic can be a problem for queries that are querying the newest data in the table. That is very likely to happen when we are using date or identity columns. As you know, the use of an outdated statistic can be very bad for performance because SQL Server will not be able to estimate precisely the number of rows a table will return, and is likely to use a poor execution plan.

To simulate this problem I’ll start by creating a table called Pedidos (which means Orders in Portuguese) and I’ll make the identity column a primary key.

Here is what the data looks like:

1348-image001.png

As you can see the orders are added sequentially and that’s what usually happens with this sort of table. The date_order (column data_pedido) is the date of the order.

Now let’s suppose a query that is searching for the orders of the day will look like the following:

1348-image002.png

1348-image003.png

In the beginning of the day SQL Server can estimate that 0 (even it shows 1 on the plan, it is estimating 0) rows will be returned. But what will happen in the end of the day when thousands of orders were inserted?

It’s likely that an execution plan using the Key Lookup will not be a good option since the lookup operation will demand a lot of pages to be read.

Let’s simulate the problem by inserting 5001 new orders in the table. Because this is below the autoupdate threshold it will not trigger the auto-updated statistics.

Now the table has lots of new orders, and each order was inserted in ascending order. That means the column Data_Pedido is increasing.

Let’s run the same query and check how many pages SQL Server has to read to execute this plan.

1348-image004.png

1348-image005.png

Note: Notice that I’m using the hint RECOMPILE to avoid query plan reuse. Also the auto create and auto update statistics are enabled on the tempdb database.

We can see here that the estimation of how many rows will be returned is very wrong. The 5001 insertions were not enough to trigger the auto-updated statistics, and the query optimizer still thinks that the table has no orders greater than GetDate().

What if we force a scan on the base table?

1348-image006.png

1348-image015.png

Note: The hint FORCESCAN is new on SQL Server 2008 R2 SP1. You can read more this hint http://technet.microsoft.com/en-us/library/ms187373.aspx.

As we can see, a scan on the clustered index requires much fewer page-reads, 200 pages on clustered index versus 10017 on the non-clustered one, plus the lookup on the clustered index.

The estimate is still wrong because the statistics are outdated and therefore don’t represent the reality of what is in the table.

Branding

SQL Server can detect when the leading column of a statistics object is ascending and can mark or “brand” it as ascending. A statistics object that belongs to an ascending column is branded as “ascending” after three updates on the statistics. It’s necessary to update it with ascending column values so that when the third update occurs, SQL Server brands the statistics object as ascending.

It’s possible to check the statistics’ brand using the trace flag 2388, when turned on it changes the result of the DBCC SHOW_STATISTICS, and then we can see a column called “Leading column type” with the brand of the column.

For instance:

1348-image008.png

As we can see, the column now is “Unknown”. Let’s insert 10 rows with ascending orders and update the statistics to see what will happen.

1348-image010.png

As I said before, the statistics have to be updated three times to be branded as ascending, so let’s do it.

1348-image012.png

Trace Flags 2389 and 2390

By default, the query optimizer keeps the information about the branding of statistics, but doesn’t make use of it. The optimizer won’t choose a different plan based on whether the column is ascending or not. To change this, you need to use the trace flags 2389 or 2390.

When the trace flag 2389 is enabled, the statistics are ascending and you have a covered index on the ascending leading key of the statistic, then the query optimizer will query the table to compute the highest value of the column. This value is then used in the estimation of how many rows will be returned for the predicate.

The trace flag 2390 works similarly to flag 2389, the main difference being that, with this flag set, the query optimizer doesn’t care if the column was branded as ascending or not. In other words, even if the column is marked as “Unknown” it will query the table to find the highest value.

To see this in practice we’ll use an undocumented query hint called QUERYTRACEON. With this query hint we can enable a trace flag in a statement scope.

Here we have a command that is using the trace flags 2389 and 2390:

1348-image014.png

1348-image015.png

SQL Server, we can see, now has a new estimation and it was enough to avoid the bad plan that was using the key lookup.

Note: Internally QUERYTRACEON will run DBCC TRACEON command and you’ll need a sysadmin privilege to run this hint. A good alternative is to use a stored procedure and run it as an administrator user.

Note: There is another brand for a statistics object called “Stationary”. The query optimizer will not trigger the query on the table to compute the value if the brand is stationary. In other words, the data in the leading column(s) is not ascending.

Conclusion

Remember that this is just an alternative way to fix a problem that is seen with large tables with ascending columns. The best option is to update your statistics periodically. In this case, it’s just a matter of creating a job to update the statistics more frequently for the tables with ascending columns up to several times a day, there is no fixed number of updates per day here, you’ll need to figure out which is the appropriate number of updates necessary for your scenario. You also have query hints, plan guides and other alternatives to fix this problem.

Here are some questions I asked myself when I first read about these trace flags.

Are trace flags always good?

Probably not, you have to find the scenario where to test them, and test, test, test.

Will SQL Server always query the table to find the highest value on in the column?

Not always. It depends on the predicate and how you are querying the table.

Can I use the trace flag 2389 alone?

Yes, but it will work only for columns branded as ascending.

Can I use the trace flag 2390 alone?

You can, but it doesn’t make sense to do so because it will stop working when the column turns out to be ascending.

As with everything, you need to test it before you use it and it always depends on the scenario. It’s also worth mentioning that this method is not supported; so don’t call Microsoft if you have any doubts about this, or me either, come to think of it.

That’s all folks, see you.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 19552 times – thanks for reading.

Tags: , , , , , , ,

  • Rate
    [Total: 44    Average: 4.4/5]
  • Share

Fabiano is fascinated by the SQL Server Query Processor and the way it works to optimize queries, procedures and functions. He graduated as a Technical Processor from Colégio Bezerra de Menezes, SP- Brazil, and has worked for several years with SQL Server, focusing in creating Data Warehouses and optimizing T-SQL codes for many companies in Brazil and Argentina. Fabiano is a SQL Server MVP, MCP for SQL Server 2000, MCTS and MCITP Data Base Developer for SQL Server 2005 and 2008. He also is actively involved in SQL Server community though forums such as MSDN and TechNet Brazil, writes articles for Simple-Talk and SQL Server Magazine Brazil, and he also presents online Webcasts and In-Person events for Microsoft Brazil. His blog is on http://blogfabiano.com

View all articles by Fabiano Amorim

  • GrumpyOldDBA

    Cool
    I’ve been convinced that ascending keys in tables that have high volumes of inserts could cause performance probelms but despite my best efforts I could never find a way to prove it – which is tricky when someone asks you to explain why you want to change a key from ascending to descending.
    I’m very grateful – now I have a reference point next time. Many thanks, very interesting indeed.

  • Creative byte

    Cool
    Hi GrumpyOldDBA,
    The problem is not with the identity or the ascending value but the fact that SQL Server only updates statistics when 20% (by default) of the number of records on the table have been changed.
    So if you use an identity with ascending values you can use the trace flags or you can update statistics more frequently as Fabiano said on the article. If you don’t use ascending values then you can’t use the trace flag because you don’t know if the last record inserted is the record with the higher value, so the only solution is to update the statistics.

    With ascending values you have 2 alternatives to solve the problem without ascending values you will have only one alternative.

    And if the index is a clustered index using identity with ascending values the possibility of index fragmentation is very low because all the new records will be inserted at the end and not in the middle.

    “Example 2
    Consider a second table, t2, that has a cardinality of 1,000. For tables with greater than 500 rows, SQL Server will UPDATE STATISTICS when (500 + 20 percent) changes have been made. Doing the math, 20 percent of 1,000 is 200, so you can expect to see AutoStat start after approximately 700 modifications have been made to the table. ”
    from : Statistical maintenance functionality (autostats) in SQL Server
    http://support.microsoft.com/kb/195565/en-us

  • mcflyamorim

    Tks
    Hi GrumpyOldDBA thanks for the comment, I’m glad it will help you :-).

    Creative byte, thanks a lot for the comment, indeed it will only works for ascending columns, because SQL will get the MAX value, so if it is not sequential doesn’t make any sense to get the max value. I’m still thinking that the best solution is to update the statistics of the table as many times it requires to be really “updated”.
    Regards

  • denalipower

    Thx
    Very helpful article to understand inner workings. I also manually update statistics especially on large tables before it reaches 20%.

  • AlexK

    Common problem
    This is surely a common problem. Alternatively, we can use index covering with Data_Pedido as leading column – that will result in predictable performance even if the optimizer’s estimate is wrong.

  • Robert Matthew Cook

    Good Stuff
    Great details and very informative, thank you Fabiano!

  • Hardik Doshi

    Nice article
    Very nice article. Thanks Fabiano

  • biill-k

    Learned something new
    i definitely learned something new by reading this article. Good job!

  • hanspo

    Another new TF that might help here
    Also, note that in SQL Server 2008 R2 SP1 a new trace flag, 2371, has been added that dynamically reduces the 20% statistics recompute threshold on large tables. See http://blogs.msdn.com/b/saponsqlserver/archive/2011/09/07/changes-to-automatic-update-statistics-in-sql-server-traceflag-2371.aspx

  • mcflyamorim

    .
    Hi hanspo,
    Thanks a lot for let me know about this trace flag, I didn’t knew and for sure this is something to be considered on this scenario.

    Regards
    Fabiano

  • Anonymous

    humor
    FORCESCAN may be new to SQL Server 2008 R2 SP1, but baby boys have had it since the beginning.

  • byo

    Amazing as always
    Ótimo artigo (como sempre), Fabiano! Não sabia da existência da query hint QUERYTRACEON. O bom é que seu exemplo demonstra o problema na prática. Posso dizer que aprendi algo novo.

    @hanspo: Thank you for that information and for the link. Very interesting new flag for large OLTPs.

    Atenciosamente,

    Andre Guerreiro Neto