Tuning SQL Queries with the Help of Constraints

The use of constraints is a valuable way of improving query performance as well as maintaining the integrity of the data, but this is, inevitably, a trade-off: The data uses up more storage, and the modifications are slower and more difficult. In SQL Programming, there are few 'best-practices' that are universally appropriate.

Sometimes a query may be very simple, yet tuning it will be quite a challenge if you are querying large amounts of data; it requires some ‘out-of-the-box’ thinking to make sure that the performance is reasonable. For example, suppose that we store data in the following tasks table:

Imagine that we need to retrieve all the tasks that were happening at some particular point in time. It is simple, for example, to retrieve all the tasks that are happening at midnight of Dec 3rd, 2010.

However, if the table grows large, this query will perform very poorly for some dates, no matter which indexes we build on the table. The reason is simple: to satisfy the query, SQL Server may have to scan a large range of an index. For example, if we have a covering index with StartedAt as its first column, and we have a lot of tasks beginning from 1970, SQL Server will have to scan a range beginning from the earliest StartedAt and ending at midnight of Dec 3rd, 2010. This can require a lot of pages to be scanned, and so it will inevitably  be slow.

If all tasks are short, we can speed up the query

Suppose, however, that we know with absolute certainty that all tasks last no longer than 24 hours.  If this is the case, then  we can narrow the range that SQL Server has to scan from 40 years to just one day, as follows:

Inevitably, this query runs dramatically faster than the previous one, but can we guarantee that it is correct, and that it always returns all the rows it should? In fact, the query is correct as long as no task stored in the table lasts more than 24 hours. We can enforce this business rule with a simple CHECK constraint. Our first attempt at doing this constraint looks OK, but allows tasks longer than 24 hours:

However, our second version of the constraint does not allow tasks longer than 1440 minutes, which is exactly 24 hours:

As long as we have this constraint, we can safely tell the optimizer to scan only a narrow range, only 24 hours of data – our query will read only a small part of the data and run fast.

If some tasks can last for a long time, we need a different approach

The query optimization we have just discussed can only be used if all tasks are short. If some tasks last for a long time, we have to scan a large part of an index. If there is no limit on tasks duration, we may have to scan even wider ranges of data. Either way, if we really need to speed up the query, we have to try out other approaches.

Suppose that instead of storing one long task as one row, we store that task as a sequence of short periods of time, each period being no longer than 24 hours, as follows:

If all our data is correct, we can safely tell the optimizer to scan just a narrow interval, just 24 hours worth of data, as discussed before. Let us keep the constraint that makes sure tasks cannot last longer than 24 hours, and add the columns and the constraints to make sure our periods of time have no gaps, as described in this article by Joe Celko:

 This solution is not for all situations

As we have seen, to speed up a query we have added columns and constraints. After the change, the data uses up more storage, and the modifications are slower and more difficult.

In other words, the price tag for optimizing a query is quite hefty.

However, in some cases it is worth it. Depending on business requirements, some queries may have to return fast. Still, when we decide whether to use expensive solutions like this one, we need to carefully considered whether it is worth it.

Security Video vs. Blue-Ray Disc

The ways in which data is stored and consumed may be very different.

Let us consider security video scenario: the data must be written real time, and its storage must be cheap. Most of the time the data is disposed of without ever reading it even once. The quality of data must be acceptable, but it does not have to be perfect.

Blue-Ray Disc scenario is on the opposite extreme of the spectrum. The data must be of very high quality, and as such it is very expensive. It can take a lot of time to write it, to improve data quality, but reading must be fast and convenient. In many cases the data is read millions of times, which pays for the expensive and slow writing of it.

Most real life situation are somewhere in between these two extremes. When we decide whether to use an expensive solution like the one described in this article, we need to consider the actual business requirements. More specifically, we need to know if the business:

  • Needs 100% correctness of query results.
  • Needs fast response all the time

If the answer to both questions is yes, than the solution described in this article fits the bill and is recommended to use. Otherwise we can just use the original query and tolerate its slowness, or not use constraints and tolerate the possibility of incorrect data. Anyway, no one solution is best under all circumstances, which is usual.

For more articles like this, sign up to the fortnightly Simple-Talk newsletter.

Tags: ,

  • 12629 views

  • Rate
    [Total: 15    Average: 4.1/5]
  • TechMate

    Good Read
    Closing arguments are well-said. Very good article, thanks for the post

  • Anonymous

    The GIS types can help for this
    see http://stackoverflow.com/questions/1232770/fastest-way-for-this-query-what-is-the-best-strategy-given-a-date-range/1233139#1233139

    You a “line” for your date range, then look for overlapping lines…

  • Anonymous

    I knew about those GIS types and such
    Thanks for the link, but I have participated in that discussion on stackoverlfow. The solution I described here allows much simpler selects against smaller amounts of data, and as such typically perform faster, and its performance is more stable.

  • Craig Beere

    DATEDIFF counts boundaries
    The reason the first constraint is that DATEDIFF counts the number of boundaries between two dates, not the elapsed time.
    For example, DATEDIFF(year,31 dec 2010,1 jan 2011) returns 1 because when travelling from the first date to the second date along the timeline, exactly one year boundary is passed.
    Better would be
    CHECK ( FinishedAT <= DATEADD( day, StartedAt, 1 ) )
    (I’m typing this from memory, syntax might be wrong).

  • Anonymous

    Craig Beere’s suggestion makes sense
    Craig Beere’s suggestion makes sense, we can run this and see for ourselves:

    SELECT CASE WHEN ( ‘2010-12-04 00:59’ <= DATEADD( day, 1, ‘2010-12-03’ ) ) THEN ‘<=’ ELSE ‘>’ END