Click here to monitor SSC
  • Av rating:
  • Total votes: 38
  • Total comments: 12
Fabiano Amorim

Statistics on Ascending Columns

01 September 2011

It comes as rather a shock to find out that one of the commonest circumstances in an OLTP database, an ascending primary key with most querying on the latest records, can throw the judgement of the Query Optimiser to the extent that perfomance nose-dives. Fabiano once again puts on snorkel and goggles to explore the murky depths of execution plans to find out why.

First of all, a warning:

Much of what you will see here is ‘undocumented’. That means we can discuss it, but you can’t trust that this will not change in a service pack or a new product release. Also you will not be able to call Microsoft support for help or advice on the topic, or if you get an error when using the following techniques. Use with precaution and with extra attention.

Introduction

A very common problem relating to distribution statistics is associated with what we call “ascending value columns” in a table. This generally happens when a large table has ascending values and the most recent rows are the ones most commonly being accessed. When data values in a column ascend, most new insertions are beyond the range covered by the distribution statistics. This can lead to poorly performing plans since they inaccurately predict from the statistics that filters selecting recent data would exclude the entire relation.

As we already know, a column which belongs to a statistic has to get to a threshold quantity of modifications after the previous update in order to trigger the auto-updated statistics, and for certain cases, this threshold is too high. In other words, a column table requires too many modifications before the distribution statistics are rebuilt.

This delay in auto-updating a statistic can be a problem for queries that are querying the newest data in the table. That is very likely to happen when we are using date or identity columns. As you know, the use of an outdated statistic can be very bad for performance because SQL Server will not be able to estimate precisely the number of rows a table will return, and is likely to use a poor execution plan.

To simulate this problem I’ll start by creating a table called Pedidos (which means Orders in Portuguese) and I’ll make the identity column a primary key.

USE Tempdb

GO

SET NOCOUNT ON;

GO

IF OBJECT_ID('Pedidos') IS NOT NULL

  DROP TABLE [Pedidos]

GO

CREATE TABLE [dbo].[Pedidos]

    (

      [ID_Pedido] [int] IDENTITY(1, 1) NOT NULL,

      [ID_Cliente] [int] NOT NULL,

      [Data_Pedido] Date NOT NULL,

      [Valor] [numeric](18, 2) NOT NULL,

      CONSTRAINT [xpk_Pedidos] PRIMARY KEY CLUSTERED(ID_Pedido)

    )

GO

 

CREATE INDEX ix_Data_Pedido ON Pedidos(Data_Pedido)

GO

 

INSERT INTO Pedidos (ID_Cliente, Data_Pedido, Valor)

SELECT ABS(CONVERT(Int, (CheckSUM(NEWID()) / 10000000))),

       '18000101',

       ABS(CONVERT(Numeric(18,2), (CheckSUM(NEWID()) / 1000000.5)))

GO

-- Inserting 50000 rows in the table

INSERT INTO Pedidos WITH(TABLOCK) (ID_Cliente, Data_Pedido, Valor)

SELECT ABS(CONVERT(Int, (CheckSUM(NEWID()) / 10000000))),

       (SELECT DATEADD(d, 1, MAX(Data_Pedido)) FROM Pedidos),

       ABS(CONVERT(Numeric(18,2), (CheckSUM(NEWID()) / 1000000.5)))

GO 50000

 

SELECT * FROM Pedidos

GO

Here is what the data looks like:

Data showing sequential orders

As you can see the orders are added sequentially and that’s what usually happens with this sort of table. The date_order (column data_pedido) is the date of the order.

Now let’s suppose a query that is searching for the orders of the day will look like the following:

SET STATISTICS IO ON

SELECT * FROM Pedidos

WHERE Data_Pedido >= Convert(date, GetDate())

OPTION (RECOMPILE)

SET STATISTICS IO OFF

In the beginning of the day SQL Server can estimate that 0 (even it shows 1 on the plan, it is estimating 0) rows will be returned. But what will happen in the end of the day when thousands of orders were inserted?

It’s likely that an execution plan using the Key Lookup will not be a good option since the lookup operation will demand a lot of pages to be read.

Let’s simulate the problem by inserting 5001 new orders in the table. Because this is below the autoupdate threshold it will not trigger the auto-updated statistics.

INSERT INTO Pedidos WITH(TABLOCK) (ID_Cliente, Data_Pedido, Valor)

SELECT ABS(CONVERT(Int, (CheckSUM(NEWID()) / 10000000))),

       GetDate(),

       ABS(CONVERT(Numeric(18,2), (CheckSUM(NEWID()) / 1000000.5)))

GO

INSERT INTO Pedidos (ID_Cliente, Data_Pedido, Valor)

VALUES  (ABS(CONVERT(Int, (CheckSUM(NEWID()) / 10000000))),

         (SELECT DateAdd(d, 1, MAX(Data_Pedido)) FROM Pedidos),

         ABS(CONVERT(Numeric(18,2), (CheckSUM(NEWID()) / 1000000.5))))

GO 5000

Now the table has lots of new orders, and each order was inserted in ascending order. That means the column Data_Pedido is increasing.

Let’s run the same query and check how many pages SQL Server has to read to execute this plan.

SET STATISTICS IO ON

SELECT * FROM Pedidos

WHERE Data_Pedido >= Convert(date, GetDate())

OPTION (RECOMPILE)

SET STATISTICS IO OFF

GO

Execution Plan

Note: Notice that I’m using the hint RECOMPILE to avoid query plan reuse. Also the auto create and auto update statistics are enabled on the tempdb database.

We can see here that the estimation of how many rows will be returned is very wrong. The 5001 insertions were not enough to trigger the auto-updated statistics, and the query optimizer still thinks that the table has no orders greater than GetDate().

What if we force a scan on the base table?

SET STATISTICS IO ON

SELECT * FROM Pedidos WITH(FORCESCAN, INDEX(0))

WHERE Data_Pedido >= Convert(date, GetDate())

OPTION (RECOMPILE)

SET STATISTICS IO OFF

Execution Plan

Execution Plan

Note: The hint FORCESCAN is new on SQL Server 2008 R2 SP1. You can read more this hint http://technet.microsoft.com/en-us/library/ms187373.aspx.

As we can see, a scan on the clustered index requires much fewer page-reads, 200 pages on clustered index versus 10017 on the non-clustered one, plus the lookup on the clustered index.

The estimate is still wrong because the statistics are outdated and therefore don’t represent the reality of what is in the table.

Branding

SQL Server can detect when the leading column of a statistics object is ascending and can mark or “brand” it as ascending. A statistics object that belongs to an ascending column is branded as “ascending” after three updates on the statistics. It’s necessary to update it with ascending column values so that when the third update occurs, SQL Server brands the statistics object as ascending.

It’s possible to check the statistics’ brand using the trace flag 2388, when turned on it changes the result of the DBCC SHOW_STATISTICS, and then we can see a column called “Leading column type” with the brand of the column.

For instance:

-- Enable trace flag 2388

DBCC TRACEON(2388)

GO

-- Look at the branding

DBCC SHOW_STATISTICS (Pedidos, [ix_Data_Pedido])

GO

-- Disable trace flag 2388

DBCC TRACEOFF(2388)

GO

Column type unknown

As we can see, the column now is “Unknown”. Let’s insert 10 rows with ascending orders and update the statistics to see what will happen.

-- Insert 10 rows

INSERT INTO Pedidos (ID_Cliente, Data_Pedido, Valor)

VALUES  (ABS(CONVERT(Int, (CheckSUM(NEWID()) / 10000000))),

         (SELECT DateAdd(d, 1, MAX(Data_Pedido)) FROM Pedidos),

         ABS(CONVERT(Numeric(18,2), (CheckSUM(NEWID()) / 1000000.5))))

GO 10

-- Update statistics

UPDATE STATISTICS Pedidos [ix_Data_Pedido] WITH FULLSCAN

GO

DBCC TRACEON(2388)

DBCC SHOW_STATISTICS (Pedidos, [ix_Data_Pedido])

DBCC TRACEOFF(2388)

Showing Null column type

As I said before, the statistics have to be updated three times to be branded as ascending, so let’s do it.

-- Insert 10 rows

INSERT INTO Pedidos (ID_Cliente, Data_Pedido, Valor)

VALUES  (ABS(CONVERT(Int, (CheckSUM(NEWID()) / 10000000))),

         (SELECT DateAdd(d, 1, MAX(Data_Pedido)) FROM Pedidos),

         ABS(CONVERT(Numeric(18,2), (CheckSUM(NEWID()) / 1000000.5))))

GO 10

-- Update statistics

UPDATE STATISTICS Pedidos [ix_Data_Pedido] WITH FULLSCAN

GO

-- Insert 10 rows

INSERT INTO Pedidos (ID_Cliente, Data_Pedido, Valor)

VALUES  (ABS(CONVERT(Int, (CheckSUM(NEWID()) / 10000000))),

         (SELECT DateAdd(d, 1, MAX(Data_Pedido)) FROM Pedidos),

         ABS(CONVERT(Numeric(18,2), (CheckSUM(NEWID()) / 1000000.5))))

GO 10

-- Update statistics

UPDATE STATISTICS Pedidos [ix_Data_Pedido] WITH FULLSCAN

GO

DBCC TRACEON(2388)

DBCC SHOW_STATISTICS (Pedidos, [ix_Data_Pedido])

DBCC TRACEOFF(2388)

Ascending Column Type

Trace Flags 2389 and 2390

By default, the query optimizer keeps the information about the branding of statistics, but doesn’t make use of it. The optimizer won’t choose a different plan based on whether the column is ascending or not. To change this, you need to use the trace flags 2389 or 2390.

When the trace flag 2389 is enabled, the statistics are ascending and you have a covered index on the ascending leading key of the statistic, then the query optimizer will query the table to compute the highest value of the column. This value is then used in the estimation of how many rows will be returned for the predicate.

The trace flag 2390 works similarly to flag 2389, the main difference being that, with this flag set, the query optimizer doesn’t care if the column was branded as ascending or not. In other words, even if the column is marked as “Unknown” it will query the table to find the highest value.

To see this in practice we’ll use an undocumented query hint called QUERYTRACEON. With this query hint we can enable a trace flag in a statement scope.

Here we have a command that is using the trace flags 2389 and 2390:

SET STATISTICS IO ON

SELECT * FROM Pedidos

WHERE Data_Pedido >= Convert(date, GetDate())

OPTION(QUERYTRACEON 2389, QUERYTRACEON 2390, RECOMPILE)

SET STATISTICS IO OFF

GO

Execution Plan

SQL Server, we can see, now has a new estimation and it was enough to avoid the bad plan that was using the key lookup.

Note: Internally QUERYTRACEON will run DBCC TRACEON command and you’ll need a sysadmin privilege to run this hint. A good alternative is to use a stored procedure and run it as an administrator user.

Note: There is another brand for a statistics object called “Stationary”. The query optimizer will not trigger the query on the table to compute the value if the brand is stationary. In other words, the data in the leading column(s) is not ascending.

Conclusion

Remember that this is just an alternative way to fix a problem that is seen with large tables with ascending columns. The best option is to update your statistics periodically. In this case, it’s just a matter of creating a job to update the statistics more frequently for the tables with ascending columns up to several times a day, there is no fixed number of updates per day here, you’ll need to figure out which is the appropriate number of updates necessary for your scenario. You also have query hints, plan guides and other alternatives to fix this problem.

Here are some questions I asked myself when I first read about these trace flags.

Are trace flags always good?

Probably not, you have to find the scenario where to test them, and test, test, test.

Will SQL Server always query the table to find the highest value on in the column?

Not always. It depends on the predicate and how you are querying the table.

Can I use the trace flag 2389 alone?

Yes, but it will work only for columns branded as ascending.

Can I use the trace flag 2390 alone?

You can, but it doesn’t make sense to do so because it will stop working when the column turns out to be ascending.

As with everything, you need to test it before you use it and it always depends on the scenario. It’s also worth mentioning that this method is not supported; so don’t call Microsoft if you have any doubts about this, or me either, come to think of it.

That’s all folks, see you.

Fabiano Amorim

Author profile:

Fabiano is fascinated by the SQL Server Query Processor and the way it works to optimize queries, procedures and functions. He graduated as a Technical Processor from Colégio Bezerra de Menezes, SP- Brazil, and has worked for several years with SQL Server, focusing in creating Data Warehouses and optimizing T-SQL codes for many companies in Brazil and Argentina. Fabiano is a SQL Server MVP, MCP for SQL Server 2000, MCTS and MCITP Data Base Developer for SQL Server 2005 and 2008. He also is actively involved in SQL Server community though forums such as MSDN and TechNet Brazil, writes articles for Simple-Talk and SQL Server Magazine Brazil, and he also presents online Webcasts and In-Person events for Microsoft Brazil. His blog is on http://blogfabiano.com

Search for other articles by Fabiano Amorim

Rate this article:   Avg rating: from a total of 38 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Cool
Posted by: GrumpyOldDBA (view profile)
Posted on: Monday, September 05, 2011 at 1:07 AM
Message: I've been convinced that ascending keys in tables that have high volumes of inserts could cause performance probelms but despite my best efforts I could never find a way to prove it - which is tricky when someone asks you to explain why you want to change a key from ascending to descending.
I'm very grateful - now I have a reference point next time. Many thanks, very interesting indeed.

Subject: Cool
Posted by: Creative byte (not signed in)
Posted on: Monday, September 05, 2011 at 7:09 AM
Message: Hi GrumpyOldDBA,
The problem is not with the identity or the ascending value but the fact that SQL Server only updates statistics when 20% (by default) of the number of records on the table have been changed.
So if you use an identity with ascending values you can use the trace flags or you can update statistics more frequently as Fabiano said on the article. If you don't use ascending values then you can't use the trace flag because you don't know if the last record inserted is the record with the higher value, so the only solution is to update the statistics.

With ascending values you have 2 alternatives to solve the problem without ascending values you will have only one alternative.

And if the index is a clustered index using identity with ascending values the possibility of index fragmentation is very low because all the new records will be inserted at the end and not in the middle.

"Example 2
Consider a second table, t2, that has a cardinality of 1,000. For tables with greater than 500 rows, SQL Server will UPDATE STATISTICS when (500 + 20 percent) changes have been made. Doing the math, 20 percent of 1,000 is 200, so you can expect to see AutoStat start after approximately 700 modifications have been made to the table. "
from : Statistical maintenance functionality (autostats) in SQL Server
http://support.microsoft.com/kb/195565/en-us

Subject: Tks
Posted by: mcflyamorim (view profile)
Posted on: Monday, September 05, 2011 at 10:33 AM
Message: Hi GrumpyOldDBA thanks for the comment, I'm glad it will help you :-).

Creative byte, thanks a lot for the comment, indeed it will only works for ascending columns, because SQL will get the MAX value, so if it is not sequential doesn't make any sense to get the max value. I'm still thinking that the best solution is to update the statistics of the table as many times it requires to be really "updated".
Regards

Subject: Thx
Posted by: denalipower (view profile)
Posted on: Tuesday, September 06, 2011 at 6:47 AM
Message: Very helpful article to understand inner workings. I also manually update statistics especially on large tables before it reaches 20%.

Subject: Common problem
Posted by: AlexK (view profile)
Posted on: Tuesday, September 06, 2011 at 1:46 PM
Message: This is surely a common problem. Alternatively, we can use index covering with Data_Pedido as leading column - that will result in predictable performance even if the optimizer's estimate is wrong.

Subject: Good Stuff
Posted by: Robert Matthew Cook (view profile)
Posted on: Tuesday, September 06, 2011 at 7:48 PM
Message: Great details and very informative, thank you Fabiano!

Subject: Nice article
Posted by: Hardik Doshi (not signed in)
Posted on: Thursday, September 08, 2011 at 11:48 PM
Message: Very nice article. Thanks Fabiano

Subject: Learned something new
Posted by: biill-k (view profile)
Posted on: Friday, September 09, 2011 at 12:34 PM
Message: i definitely learned something new by reading this article. Good job!

Subject: Another new TF that might help here
Posted by: hanspo (not signed in)
Posted on: Tuesday, September 13, 2011 at 2:02 AM
Message: Also, note that in SQL Server 2008 R2 SP1 a new trace flag, 2371, has been added that dynamically reduces the 20% statistics recompute threshold on large tables. See http://blogs.msdn.com/b/saponsqlserver/archive/2011/09/07/changes-to-automatic-update-statistics-in-sql-server-traceflag-2371.aspx

Subject: .
Posted by: mcflyamorim (view profile)
Posted on: Tuesday, September 13, 2011 at 3:24 PM
Message: Hi hanspo,
Thanks a lot for let me know about this trace flag, I didn't knew and for sure this is something to be considered on this scenario.

Regards
Fabiano

Subject: humor
Posted by: Anonymous (not signed in)
Posted on: Friday, September 16, 2011 at 5:59 AM
Message: FORCESCAN may be new to SQL Server 2008 R2 SP1, but baby boys have had it since the beginning.

Subject: Amazing as always
Posted by: byo (view profile)
Posted on: Saturday, September 17, 2011 at 1:05 PM
Message: Ótimo artigo (como sempre), Fabiano! Não sabia da existência da query hint QUERYTRACEON. O bom é que seu exemplo demonstra o problema na prática. Posso dizer que aprendi algo novo.

@hanspo: Thank you for that information and for the link. Very interesting new flag for large OLTPs.

Atenciosamente,

Andre Guerreiro Neto

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.