Click here to monitor SSC
Av rating:
Total votes: 25
Total comments: 8


Joe Celko
Contiguous Time Periods
22 November 2010

It is always better, and more efficient, to maintain referential integrity  by using constraints rather than triggers. Sometimes it is not at all obvious how to do this, and the history table, and other temporal data tables, presented problems for checking data that were difficult  to solve with constraints. Suddenly, Alex Kuznetsov came up with a good solution, and so now history tables can benefit from more effective integrity checking. Joe explains...

Alex Kuznetsov is the creator of a clever technique for creating a history table that I'll be describing in this article. It solves a common problem with declarative referential integrity constraints and needs to be known more widely, especially now that we have a DATE data type in SQL Server.

History tables reflect the nature of time in the ISO model. That is, time is a continuum and not a set of discrete points. Continua have an infinite number of elements, but subsets can be bounded. To show a duration in SQL, we store the starting point in time of a duration and the limit the duration with an end point that is not actually in the duration. This is called a half-open interval as we usually draw as a line with a solid dot on the starting point and an open dot on the end.

Half-open intervals have nice properties. You can subtract a half-open interval from the middle of another half-open interval and you get half-open intervals as a result. You can also abut half-open intervals and get a new half-open interval. That means I can chain them together and not worry about having a duplicated or missing point in a time-line.

In particular, we really like to have contiguous time periods in a history table. We would also prefer that this constraint be enforced with declarative instead of using procedural code in triggers, functions or procedures. Let's go right to a example and discuss the code. We are going to create a table of some kind of vague Tasks that are done one right after the other, without gaps or overlaps.

Abutting half-open intervals yield a new half-open interval:

It is worth mentioning that the new DATE data type is only 3 bytes, so reading it off a disk is cheap. You will see why this is important.

A Note on Temporal and Floating Point Data

I mentioned that time is a continuum. The real numbers are also a continuum, which we model with floating point numbers in computers. Technically, there is a difference between REAL and FLOAT in Standard SQL, but nobody cares. The reason that nobody cares is that the IEEE 754 Standard is burned into chips and universally understood.

Floating point math is fuzzy and has learned to live with its imprecision. When you compare two floats for equality, the software knows about an "epsilon" -- a small wiggle factor. The numbers are treated as equal if they differ by that epsilon; or less.

Temporal data in T-SQL used to be a prisoner of UNIX system clock ticks and could only go to three decimal seconds with rounding errors. The new ANSI/ISO data types can go to seven decimal seconds, have a true DATE and TIME data types. Since they are new, most programmers are not using them yet.

One of the idioms of T-SQL dialect has been to get just the date out of a DATETIME column. The traditional solution in the Sybase days was casting the column to FLOAT, taking a FLOOR() or CEILING() then casting it back. This depended on the internal representation of the DATETIME values. The rounding error was why T-SQL could not represent some times accurately. Here is the code for the date at zero hour.

CAST(FLOOR(CAST(@in_date AS FLOAT)) AS DATETIME)

To get the following day, we used:

CAST (CEILING(CAST(@in_date AS FLOAT)) AS DATETIME)

The reason was simple; a lot of UNIX machines had floating point hardware. Today the equally awful idiom is

DATEADD(DD, DATEDIFF(DD, 0, @in_date),0)

This leaves a DATETIME with a time of zero hour. Today you can simply write "CAST (my_date_column AS DATE)" which is a lot easier to read and maintain. It also gives you a real date.

Likewise, "CAST (my_date_column AS TIME)" will return just the time fields (yes, the ANSI Standards calls them fields to make sure your do not confuse them with columns or substrings).

Rounding DATETIME and DATETIME2 columns is done the same way. This gives you the ability to do the temporal version of the floating point epsilon. Insert the data with a high precision, but compare it at a courser precision. Here is a quick example to play with.

SELECT

CAST ('2010-10-26 12:12:12.8888888' AS DATETIME2(5)), -- nanoseconds

CAST ('2010-10-26 12:12:12.8888888' AS DATETIME2(3)),

CAST ('2010-10-26 12:12:12.8888888' AS DATETIME2(1)),

CAST ('2010-10-26 12:12:12.8888888' AS DATETIME2(0)); -- whole seconds

First Attempt: Single Time Table

Here is the DDL for the usual attempt that the less-experienced SQL Server developers often make to record a history in a table. I am assuming that we measure time in days and not have to worry about minutes or fractional seconds.

CREATE TABLE Tasks

(task_id INTEGER NOT NULL,

 task_score CHAR(1) NOT NULL,

 task_start_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,

 PRIMARY KEY (task_id, task_start_date),

 etc);

The table has the starting date of the events and nothing else. The assumption is that the implicit task_end_date is the next task_start_time in temporal sequence. You can see the problems immediately. In order to compute simple durations, you need to use a self-join, which probably ought to be in a VIEW.

CREATE VIEW ContigousTasks (task_id, task_score, task_start_date, task_end_date)

AS

SELECT T1.task_id, MAX(T1.task_score), T1.task_start_date,

       DATEADD (DD, -1, MIN(T2.task_start_date))

  FROM Tasks AS T1

       LEFT OUTER JOIN

       Tasks AS T2

       ON T1.task_id = T2.task_id

         AND T1.task_start_date < T2.task_start_date

 GROUP BY T1.task_id, T1.task_start_date;

Here is some test data to play with:

INSERT INTO Tasks (task_id, task_score, task_start_date)

  VALUES(1, 'A', '2010-11-01'),

        (1, 'B', '2010-11-05'),

        (1, 'C', '2010-11-10'),

        (1, 'D', '2010-11-15'),

        (1, 'E', '2010-11-20');

The LEFT OUTER JOIN gives a NULL for the last row's task_end_date. But a self-join becomes expensive as a table get bigger and bigger. Computing date math is also expensive. More than that, gaps are undetectable and create false data. Likewise overlaps are hidden. A row is supposed to model a complete fact. These rows have only half a fact . You will be doing this self join constantly to re-assemble what is split apart.

A clustered index on (task_id, task_start_time) will help, but you still have to do computations and grouping.

Second Attempt: Simple History Table

Here is the DDL for a classic History table. I am assuming that we measure time in days and not have to worry about minutes or fractional seconds.

CREATE TABLE Tasks

(task_id INTEGER NOT NULL,

 task_score CHAR(1) NOT NULL,

 task_start_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,

 task_end_date DATE, -- null means unfinished current task

 CONSTRAINT end_and_start_dates_in_sequence

   CHECK (task_start_date <= task_end_date),

 PRIMARY KEY (task_id, task_start_date),

 etc);

The half-open interval model requires that we know the starting time of an event. The ending time cannot be known if the event it still in process, so it is NULL-able. There is a simple CHECK() constraint to assure that the two dates are in order: But nothing prevents gaps or overlaps.

Even worse, we can have multiple NULL task_end_date rows. What we want is that each task has at most one NULL task_end_date. In Standard SQL, this can be forced with a CREATE ASSERTION statement. But in SQL Server, we can use a TRIGGER or a WITH CHECK OPTION than allow access to the table only thru VIEWs.

CREATE VIEW SafeTasks (task_id, task_score, task_start_date, task_end_date)

AS

SELECT task_id, task_score, task_start_date, task_end_date

  FROM Tasks

 WHERE 1 <= ALL (SELECT COUNT(*)

                   FROM Tasks

                  WHERE task_end_date IS NULL

                  GROUP BY task_id)

WITH CHECK OPTION;

A lot of SQL programmers do not know about the WITH CHECK OPTION, but combined with the INSTEAD OF trigger, you can do a lot of complex programming in a single block of DDL.

Kuznetsov's History Table

Kuznetsov's History Table gets around the weaknesses of the simple history table schema. It builds a temporal chain from the current row to the previous row. This is easier to show with code:

CREATE TABLE Tasks

(task_id INTEGER NOT NULL,

 task_score CHAR(1) NOT NULL,

 previous_end_date DATE, -- null means first task

 current_start_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,

 CONSTRAINT previous_end_date_and_current_start_in_sequence

   CHECK (prev_end_date <= current_start_date),

 current_end_date DATE, -- null means unfinished current task

 CONSTRAINT current_start_and_end_dates_in_sequence

   CHECK (current_start_date <= current_end_date),

 CONSTRAINT end_dates_in_sequence

   CHECK (previous_end_date <> current_end_date)

 PRIMARY KEY (task_id, current_start_date),

 UNIQUE (task_id, previous_end_date), -- null first task

 UNIQUE (task_id, current_end_date), -- one null current task

 FOREIGN KEY (task_id, previous_end_date)  -- self-reference

   REFERENCES Tasks (task_id, current_end_date));

Well, that looks complicated! Let's look at it column by column. Task_id explains itself. The previous_end_date will not have a value for the first task in the chain, so it is NULL-able. The current_start_date and current_end_date are the same data elements, temporal sequence and PRIMARY KEY constraints we had in the simple history table schema.

The two UNIQUE constraints will allow one NULL in their pairs of columns and prevent duplicates. Remember that UNIQUE is not like PRIMARY KEY, which implies UNIQUE NOT NULL.

Finally, the FOREIGN KEY is the real trick. Obviously, the previous task has to end when the current task started for them to abut., so there is another constraint. This constraint is a self-reference that makes sure this is true. Modifying data in this type of table is easy, but requires some thought.. Fortunately, Alex has written a Simple Talk article to explain in more detail how it is done.

Disabling Constraints

Just one little problem with that FOREIGN KEY constraint. It will not let you put the first task into the table. There is nothing for the constraint to reference. In Standard SQL, we can declare constraints to be DEFERABLE with some other options. The idea is that you can turn a constraint ON or OFF during a session so the database can be in state that would otherwise be illegal. But at the end of the session all constraints have to be TRUE. or UNKNOWN.

In SQL Server, you can disable constraints and then turn them back on. It actually is restricted to disabling FOREIGN KEY constraint, and CHECK constraints. PRIMARY KEY, UNIQUE, and DEFAULT constraints are always enforced. The syntax for this is part of the ALTER TABLE statement. The syntax is simple:

ALTER TABLE <table name> NOCHECK CONSTRAINT [<constraint name> | ALL];

This is why you want to name the constraints; without user given names, you have to look up what the system gave you and they are always long and messy.. The ALL option will disable all of the constraints in the entire schema. Be careful with it.

To re-enable, the syntax is similar and explains itself:

ALTER TABLE <table name> CHECK CONSTRAINT [<constraint name> | ALL];

When a disabled constraint is re-enabled, the database does not check to ensure any of the existing data meets the constraints. So for this table, The body of a procedure to get things started would look like this:

BEGIN

ALTER TABLE Tasks NOCHECK CONSTRAINT ALL;

INSERT INTO Tasks (task_id, task_score, current_start_date, current_end_date, previous_end_date)

  VALUES (1, 'A', '2010-11-01', '2010-11-03', NULL);

ALTER TABLE Tasks CHECK CONSTRAINT ALL;

END;

You will want to confirm that the constraints are all working by trying to add bad data.

  1. There is only one first event per task. A UNIQUE will take care of that.
  2. There are no duplicates allowed.
  3. No events overlap in a task.
  4. There are no gaps allowed

Conclusion

You can avoid procedural code, but it is not always immediately obvious how. Start with a list of constraints and some test data. The use of the (current_start_date , current_end_date) has been the standard idiom for history tables in SQL. Until we got the DATE data type, the use of a BETWEEN predicate was a problem. The rounding errors that occurred in the fractional seconds could cause overlaps in the old DATETIME data type. The solution was to write two predicates:

 (my_time >= current_start_date

AND my_time < COALESCE (current_end_date, CURRENT_TIMESTAMP))

It works, but it hides the BETWEEN-ness of the three-way relationship, With the new DATE data type, the BETWEEN works just fine.

The previous_end_date column feels strange at first. It seems redundant. But I will argue that it is not. A row in a table should model a complete fact. That is why we have (current_start_date , current_end_date) pairs; together, they tell us about the duration of the task as a whole thing; there is no need to read another row to find the ending date.

 A Quick Note on Induction

 
Since one of the business rules is that we do not allow gaps, we need to have data for determining that. If I depended on the previous row to provide the previous_end_date, I would need self-joins or loops that eventually go back to the first event in the sequence of tasks. In short, it would be a lot of procedural code in a trigger. But I have a nice mathematical property to use: induction.

 
With the constraints, I can prove that I have no gaps using induction. t his done by proving that the first statement in the infinite sequence of statements is true, and then proving that if any one statement in the infinite sequence of statements is true, then so is the next one. If the idea of math bothers you, think about a chain of (n) dominoes set up for toppling. If I push one domino over, it topples. That is my starting point or induction basis. Assume that if I push any single domino over in the chain, then his neighbor also falls. That is the inductive hypothesis. So when I push the first domino in the chain, all of them fall down. 



This article has been viewed 8489 times.
Joe Celko

Author profile: Joe Celko

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 25 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Are these errors?
Posted by: MisterMagoo (not signed in)
Posted on: Tuesday, November 23, 2010 at 5:21 PM
Message: Hi Joe, I liked this article, but have a couple of questions...

In the table creation code for Second Attempt: Simple History Table

Is this constraint round the wrong way? shouldn't the end date come after the start date?

CONSTRAINT end_and_start_dates_in_sequence
CHECK (task_end_date <= task_start_date)

Also, in Kuznetsov's History Table is the foreign key messed up? should the previous end date not be matched to the current start date?

MM

Subject: Is there a need to diable the Foreign Key constraint
Posted by: Arik (not signed in)
Posted on: Tuesday, November 23, 2010 at 5:24 PM
Message: Hi Joe,
Thank you for this article.
Is there really a need to disable the foreign key for the first insert?
previous_end_date is NULL so there is no violation.
I tried without disabling the constraint and had no problem. unless I was missing something.
Thanks,
Arik.

Subject: Are these errors?
Posted by: Celko (view profile)
Posted on: Wednesday, November 24, 2010 at 7:27 AM
Message: Oh Yeah, they are errors. Editor, can we fix the text? That is the good part about electronic publishing.

Subject: Thank you much!
Posted by: Alex_Kuznetsov (view profile)
Posted on: Saturday, November 27, 2010 at 4:08 PM
Message: Joe,

Thank you for describing the approach. I really appreciate it and feel honored.

Subject: Why do you say that BETWEEN works just fine ?
Posted by: Stefan_G (not signed in)
Posted on: Tuesday, November 30, 2010 at 1:55 AM
Message: You write:

Until we got the DATE data type, the use of a BETWEEN predicate was a problem. The rounding errors that occurred in the fractional seconds could cause overlaps in the old DATETIME data type. The solution was to write two predicates:
(my_time >= current_start_date
AND my_time < COALESCE (current_end_date, CURRENT_TIMESTAMP))
It works, but it hides the BETWEEN-ness of the three-way relationship, With the new DATE data type, the BETWEEN works just fine.

So, if I understand you correctly we should be able to write the following predicate:

my_time BETWEEN current_start_date COALESCE (current_end_date, CURRENT_TIMESTAMP)

But since when using half-open intervals current_start_date of one row is exactly equal to current_end_date of the previous row this will return two rows if my_time happens to exactly match an interval border.

Or am I missing something ?

Also, what does this have to do with "The rounding errors that occurred in the fractional seconds" ?

/SG

Subject: no gaps ?
Posted by: Anonymous (not signed in)
Posted on: Tuesday, November 30, 2010 at 2:55 AM
Message: The current constraints does not prevent creating gaps in the sequence.

To prevent gaps you need to modify the following constraint:

CONSTRAINT previous_end_date_and_current_start_in_sequence

CHECK (previous_end_date = current_start_date),

Subject: The design does allow gaps
Posted by: Seven24 (view profile)
Posted on: Wednesday, January 05, 2011 at 12:12 AM
Message: Insert Tasks(task_id, task_score, previous_end_date, current_start_date, current_end_date)
Values(1, 'A', Null, '2010-01-01', '2010-02-01')
, (1, 'A', '2010-02-01', '2010-03-01', '2010-04-01')

Any date between 2010-02-01 and 2010-03-01 would fail the test for gaps.

Forcing the previous_end_date = current_start_date also will not work. What is need is for previous_end_date_and_current_start_in_sequence constraint to defined as follows:

CONSTRAINT previous_end_date_and_current_start_in_sequence CHECK (previous_end_date = DateAdd(d, -1, current_start_date))

This presumes that a "gap" is defined as there existing a Date value greater than the previous_end_date and less than the current_start_date.

Subject: renabling check constraints without 'with check'....
Posted by: mjfii (view profile)
Posted on: Thursday, January 13, 2011 at 3:43 PM
Message:
the constraints become 'not trusted' when you renable them without the 'with check' clause; this will hinder the the query optimizer, e.g.

ALTER TABLE Tasks CHECK CONSTRAINT ALL;
vs.
ALTER TABLE Tasks with check CHECK CONSTRAINT ALL;

you can see this here:
select name,is_not_trusted from sys.check_constraints

 










Phil Factor
Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL Server... Read more...



 View the blog
SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

How to Import Data from HTML pages
 It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Hosted Team Foundation Server 2010 Review
 Team Foundation Server (TFS) has expanded its remit to support the whole software development process,... Read more...

The Parodist: A SQL Server Application
 Every year, we ask Phil Factor to celebrate the holiday season with an article on SQL Server... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk