Click here to monitor SSC
  • Av rating:
  • Total votes: 14
  • Total comments: 5
Alex Kuznetsov

Tuning SQL Queries with the Help of Constraints

07 December 2010

The use of constraints is a valuable way of improving query performance as well as maintaining the integrity of the data, but this is, inevitably,  a trade-off: The data uses up more storage, and the modifications are slower and more difficult.  In SQL Programming, there are few 'best-practices' that are universally appropriate.

Sometimes a query may be very simple, yet tuning it will be quite a challenge if you are querying large amounts of data; it requires some 'out-of-the-box' thinking to make sure that the performance is reasonable. For example, suppose that we store data in the following tasks table:

 CREATE TABLE dbo.Tasks

    (

      TaskId INT NOT NULL IDENTITY,

      TaskDescription VARCHAR(50) NOT NULL ,

      StartedAt SMALLDATETIME NOT NULL ,

      FinishedAt SMALLDATETIME NOT NULL ,

      CONSTRAINT PK_Tasks_TaskId

           PRIMARY KEY ( TaskId ) ,

      CONSTRAINT CHK_Tasks_StartedAt_Before_FinishedAt

           CHECK ( StartedAt < FinishedAt )

    ) ;  

Imagine that we need to retrieve all the tasks that were happening at some particular point in time. It is simple, for example, to retrieve all the tasks that are happening at midnight of Dec 3rd, 2010.

SELECT  TaskId ,

        TaskDescription ,

        StartedAt ,

        FinishedAt

FROM    dbo.Tasks

WHERE   '20101203' BETWEEN StartedAt AND FinishedAt

However, if the table grows large, this query will perform very poorly for some dates, no matter which indexes we build on the table. The reason is simple: to satisfy the query, SQL Server may have to scan a large range of an index. For example, if we have a covering index with StartedAt as its first column, and we have a lot of tasks beginning from 1970, SQL Server will have to scan a range beginning from the earliest StartedAt and ending at midnight of Dec 3rd, 2010. This can require a lot of pages to be scanned, and so it will inevitably  be slow.

If all tasks are short, we can speed up the query

Suppose, however, that we know with absolute certainty that all tasks last no longer than 24 hours.  If this is the case, then  we can narrow the range that SQL Server has to scan from 40 years to just one day, as follows:

SELECT  TaskId ,

        TaskDescription ,

        StartedAt ,

        FinishedAt

FROM    dbo.Tasks

WHERE   '20101203' BETWEEN StartedAt AND FinishedAt

        AND StartedAt >= '20101202'

        AND FinishedAt <= '20101204' ;

Inevitably, this query runs dramatically faster than the previous one, but can we guarantee that it is correct, and that it always returns all the rows it should? In fact, the query is correct as long as no task stored in the table lasts more than 24 hours. We can enforce this business rule with a simple CHECK constraint. Our first attempt at doing this constraint looks OK, but allows tasks longer than 24 hours:

ALTER TABLE dbo.Tasks

   ADD CONSTRAINT CHK_Tasks_TaskDuration

   CHECK(DATEDIFF(hour, StartedAt, FinishedAt)<=24);

GO

SET LANGUAGE US_English;

GO

INSERT dbo.Tasks

        ( TaskDescription ,

          StartedAt ,

          FinishedAt

        )

VALUES  ( 'sample task' , -- TaskDescription - varchar(50)

          '2010-12-03 ' , -- StartedAt - datetime

          '2010-12-04 00:59'  -- FinishedAt - datetime

        )        

GO

SELECT  TaskDescription ,

        StartedAt ,

        FinishedAt

FROM    dbo.Tasks

        

GO

ALTER TABLE dbo.Tasks

   DROP CONSTRAINT CHK_Tasks_TaskDuration

However, our second version of the constraint does not allow tasks longer than 1440 minutes, which is exactly 24 hours:

ALTER TABLE dbo.Tasks

   ADD CONSTRAINT CHK_Tasks_TaskDuration

   CHECK(DATEDIFF(minute, StartedAt, FinishedAt)<=1440);

As long as we have this constraint, we can safely tell the optimizer to scan only a narrow range, only 24 hours of data – our query will read only a small part of the data and run fast.

If some tasks can last for a long time, we need a different approach

The query optimization we have just discussed can only be used if all tasks are short. If some tasks last for a long time, we have to scan a large part of an index. If there is no limit on tasks duration, we may have to scan even wider ranges of data. Either way, if we really need to speed up the query, we have to try out other approaches.

Suppose that instead of storing one long task as one row, we store that task as a sequence of short periods of time, each period being no longer than 24 hours, as follows:

CREATE TABLE dbo.Tasks

    (

      TaskId INT NOT NULL IDENTITY,

      TaskDescription VARCHAR(50) NOT NULL ,

      CONSTRAINT PK_Tasks_TaskId

           PRIMARY KEY ( TaskId )

    ) ;  

GO

 

CREATE TABLE dbo.TasksByDay

    (

      TaskId INT NOT NULL,

      StartedAt SMALLDATETIME NOT NULL ,

      FinishedAt SMALLDATETIME NOT NULL ,

      CONSTRAINT PK_TasksByDay_TaskId

           PRIMARY KEY ( TaskId, FinishedAt ) ,

      CONSTRAINT PK_TasksByDay_Tasks

           FOREIGN KEY ( TaskId )

           REFERENCES dbo.Tasks( TaskId ),

      CONSTRAINT CHK_TasksByDay_StartedAt_Before_FinishedAt

           CHECK ( StartedAt < FinishedAt ),

      CONSTRAINT CHK_TasksByDay_TaskDuration

          CHECK(DATEDIFF(minute, StartedAt, FinishedAt)<=1440)

    ) ;  

GO

If all our data is correct, we can safely tell the optimizer to scan just a narrow interval, just 24 hours worth of data, as discussed before. Let us keep the constraint that makes sure tasks cannot last longer than 24 hours, and add the columns and the constraints to make sure our periods of time have no gaps, as described in this article by Joe Celko:

CREATE TABLE dbo.TasksByDay

    (

      TaskId INT NOT NULL,

      StartedAt SMALLDATETIME NOT NULL ,

      FinishedAt SMALLDATETIME NOT NULL ,

      PreviousFinishedAt SMALLDATETIME NULL ,

      CONSTRAINT PK_TasksByDay_TaskId

           PRIMARY KEY ( TaskId, FinishedAt ) ,

      CONSTRAINT PK_TasksByDay_Tasks

           FOREIGN KEY ( TaskId )

           REFERENCES dbo.Tasks( TaskId ),

      CONSTRAINT CHK_TasksByDay_StartedAt_Before_FinishedAt

           CHECK ( StartedAt < FinishedAt ),

      CONSTRAINT CHK_TasksByDay_TaskDuration

          CHECK(DATEDIFF(minute, StartedAt, FinishedAt)<=1440),

      CONSTRAINT UNQ_TasksByDay_TaskId_PreviousFinishedAt

           UNIQUE ( TaskId, PreviousFinishedAt ) ,

      CONSTRAINT FK_TasksByDay_TaskId_PreviousFinishedAt

           FOREIGN KEY ( TaskId, PreviousFinishedAt )

           REFERENCES dbo.TasksByDay ( TaskId, FinishedAt ) ,

      CONSTRAINT CHK_TasksByDay_NoGaps

           CHECK ( PreviousFinishedAt = StartedAt )

    ) ;  

GO 

 This solution is not for all situations

As we have seen, to speed up a query we have added columns and constraints. After the change, the data uses up more storage, and the modifications are slower and more difficult.

In other words, the price tag for optimizing a query is quite hefty.

However, in some cases it is worth it. Depending on business requirements, some queries may have to return fast. Still, when we decide whether to use expensive solutions like this one, we need to carefully considered whether it is worth it.

Security Video vs. Blue-Ray Disc

The ways in which data is stored and consumed may be very different.

Let us consider security video scenario: the data must be written real time, and its storage must be cheap. Most of the time the data is disposed of without ever reading it even once. The quality of data must be acceptable, but it does not have to be perfect.

Blue-Ray Disc scenario is on the opposite extreme of the spectrum. The data must be of very high quality, and as such it is very expensive. It can take a lot of time to write it, to improve data quality, but reading must be fast and convenient. In many cases the data is read millions of times, which pays for the expensive and slow writing of it.

Most real life situation are somewhere in between these two extremes. When we decide whether to use an expensive solution like the one described in this article, we need to consider the actual business requirements. More specifically, we need to know if the business:

  • Needs 100% correctness of query results.
  • Needs fast response all the time

If the answer to both questions is yes, than the solution described in this article fits the bill and is recommended to use. Otherwise we can just use the original query and tolerate its slowness, or not use constraints and tolerate the possibility of incorrect data. Anyway, no one solution is best under all circumstances, which is usual.

Alex Kuznetsov

Author profile:

Alex Kuznetsov has been working with object oriented languages and databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. Alex has published multiple articles on simple-talk.com and sqlblog.com and wrote a book entitled Defensive Database Programming with SQL Server. Currently he works in an agile team in Chicago. In his leisure time, Alex prepares for and runs ultramarathons.

Search for other articles by Alex Kuznetsov

Rate this article:   Avg rating: from a total of 14 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Good Read
Posted by: TechMate (view profile)
Posted on: Sunday, December 12, 2010 at 8:18 AM
Message: Closing arguments are well-said. Very good article, thanks for the post

Subject: The GIS types can help for this
Posted by: Anonymous (not signed in)
Posted on: Tuesday, December 14, 2010 at 10:19 AM
Message: see http://stackoverflow.com/questions/1232770/fastest-way-for-this-query-what-is-the-best-strategy-given-a-date-range/1233139#1233139

You a "line" for your date range, then look for overlapping lines...

Subject: I knew about those GIS types and such
Posted by: Anonymous (not signed in)
Posted on: Tuesday, December 14, 2010 at 12:28 PM
Message: Thanks for the link, but I have participated in that discussion on stackoverlfow. The solution I described here allows much simpler selects against smaller amounts of data, and as such typically perform faster, and its performance is more stable.

Subject: DATEDIFF counts boundaries
Posted by: Craig Beere (not signed in)
Posted on: Tuesday, December 14, 2010 at 8:43 PM
Message: The reason the first constraint is that DATEDIFF counts the number of boundaries between two dates, not the elapsed time.
For example, DATEDIFF(year,31 dec 2010,1 jan 2011) returns 1 because when travelling from the first date to the second date along the timeline, exactly one year boundary is passed.
Better would be
CHECK ( FinishedAT <= DATEADD( day, StartedAt, 1 ) )
(I'm typing this from memory, syntax might be wrong).

Subject: Craig Beere's suggestion makes sense
Posted by: Anonymous (not signed in)
Posted on: Tuesday, December 14, 2010 at 8:51 PM
Message: Craig Beere's suggestion makes sense, we can run this and see for ourselves:

SELECT CASE WHEN ( '2010-12-04 00:59' <= DATEADD( day, 1, '2010-12-03' ) ) THEN '<=' ELSE '>' END

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Searching for Strings in SQL Server Databases
 Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.