Contiguous Time Periods

It is always better, and more efficient, to maintain referential integrity by using constraints rather than triggers. Sometimes it is not at all obvious how to do this, and the history table, and other temporal data tables, presented problems for checking data that were difficult to solve with constraints. Suddenly, Alex Kuznetsov came up with a good solution, and so now history tables can benefit from more effective integrity checking. Joe explains...

Alex Kuznetsov is the creator of a clever technique for creating a history table that I’ll be describing in this article. It solves a common problem with declarative referential integrity constraints and needs to be known more widely, especially now that we have a DATE data type in SQL Server.

History tables reflect the nature of time in the ISO model. That is, time is a continuum and not a set of discrete points. Continua have an infinite number of elements, but subsets can be bounded. To show a duration in SQL, we store the starting point in time of a duration and the limit the duration with an end point that is not actually in the duration. This is called a half-open interval as we usually draw as a line with a solid dot on the starting point and an open dot on the end.

Half-open intervals have nice properties. You can subtract a half-open interval from the middle of another half-open interval and you get half-open intervals as a result. You can also abut half-open intervals and get a new half-open interval. That means I can chain them together and not worry about having a duplicated or missing point in a time-line.

In particular, we really like to have contiguous time periods in a history table. We would also prefer that this constraint be enforced with declarative instead of using procedural code in triggers, functions or procedures. Let’s go right to a example and discuss the code. We are going to create a table of some kind of vague Tasks that are done one right after the other, without gaps or overlaps.

Abutting half-open intervals yield a new half-open interval:

1183-JC1.jpg

It is worth mentioning that the new DATE data type is only 3 bytes, so reading it off a disk is cheap. You will see why this is important.

A Note on Temporal and Floating Point Data

I mentioned that time is a continuum. The real numbers are also a continuum, which we model with floating point numbers in computers. Technically, there is a difference between REAL and FLOAT in Standard SQL, but nobody cares. The reason that nobody cares is that the IEEE 754 Standard is burned into chips and universally understood.

Floating point math is fuzzy and has learned to live with its imprecision. When you compare two floats for equality, the software knows about an “epsilon” — a small wiggle factor. The numbers are treated as equal if they differ by that epsilon; or less.

Temporal data in T-SQL used to be a prisoner of UNIX system clock ticks and could only go to three decimal seconds with rounding errors. The new ANSI/ISO data types can go to seven decimal seconds, have a true DATE and TIME data types. Since they are new, most programmers are not using them yet.

One of the idioms of T-SQL dialect has been to get just the date out of a DATETIME column. The traditional solution in the Sybase days was casting the column to FLOAT, taking a FLOOR() or CEILING() then casting it back. This depended on the internal representation of the DATETIME values. The rounding error was why T-SQL could not represent some times accurately. Here is the code for the date at zero hour.

To get the following day, we used:

The reason was simple; a lot of UNIX machines had floating point hardware. Today the equally awful idiom is

This leaves a DATETIME with a time of zero hour. Today you can simply write “CAST (my_date_column AS DATE)” which is a lot easier to read and maintain. It also gives you a real date.

Likewise, “CAST (my_date_column AS TIME)” will return just the time fields (yes, the ANSI Standards calls them fields to make sure your do not confuse them with columns or substrings).

Rounding DATETIME and DATETIME2 columns is done the same way. This gives you the ability to do the temporal version of the floating point epsilon. Insert the data with a high precision, but compare it at a courser precision. Here is a quick example to play with.

First Attempt: Single Time Table

Here is the DDL for the usual attempt that the less-experienced SQL Server developers often make to record a history in a table. I am assuming that we measure time in days and not have to worry about minutes or fractional seconds.

The table has the starting date of the events and nothing else. The assumption is that the implicit task_end_date is the next task_start_time in temporal sequence. You can see the problems immediately. In order to compute simple durations, you need to use a self-join, which probably ought to be in a VIEW.

Here is some test data to play with:

The LEFT OUTER JOIN gives a NULL for the last row’s task_end_date. But a self-join becomes expensive as a table get bigger and bigger. Computing date math is also expensive. More than that, gaps are undetectable and create false data. Likewise overlaps are hidden. A row is supposed to model a complete fact. These rows have only half a fact . You will be doing this self join constantly to re-assemble what is split apart.

A clustered index on (task_id, task_start_time) will help, but you still have to do computations and grouping.

Second Attempt: Simple History Table

Here is the DDL for a classic History table. I am assuming that we measure time in days and not have to worry about minutes or fractional seconds.

The half-open interval model requires that we know the starting time of an event. The ending time cannot be known if the event it still in process, so it is NULL-able. There is a simple CHECK() constraint to assure that the two dates are in order: But nothing prevents gaps or overlaps.

Even worse, we can have multiple NULL task_end_date rows. What we want is that each task has at most one NULL task_end_date. In Standard SQL, this can be forced with a CREATE ASSERTION statement. But in SQL Server, we can use a TRIGGER or a WITH CHECK OPTION than allow access to the table only thru VIEWs.

A lot of SQL programmers do not know about the WITH CHECK OPTION, but combined with the INSTEAD OF trigger, you can do a lot of complex programming in a single block of DDL.

Kuznetsov’s History Table

Kuznetsov’s History Table gets around the weaknesses of the simple history table schema. It builds a temporal chain from the current row to the previous row. This is easier to show with code:

Well, that looks complicated! Let’s look at it column by column. Task_id explains itself. The previous_end_date will not have a value for the first task in the chain, so it is NULL-able. The current_start_date and current_end_date are the same data elements, temporal sequence and PRIMARY KEY constraints we had in the simple history table schema.

The two UNIQUE constraints will allow one NULL in their pairs of columns and prevent duplicates. Remember that UNIQUE is not like PRIMARY KEY, which implies UNIQUE NOT NULL.

Finally, the FOREIGN KEY is the real trick. Obviously, the previous task has to end when the current task started for them to abut., so there is another constraint. This constraint is a self-reference that makes sure this is true. Modifying data in this type of table is easy, but requires some thought.. Fortunately, Alex has written a Simple Talk article to explain in more detail how it is done.

Disabling Constraints

Just one little problem with that FOREIGN KEY constraint. It will not let you put the first task into the table. There is nothing for the constraint to reference. In Standard SQL, we can declare constraints to be DEFERABLE with some other options. The idea is that you can turn a constraint ON or OFF during a session so the database can be in state that would otherwise be illegal. But at the end of the session all constraints have to be TRUE. or UNKNOWN.

In SQL Server, you can disable constraints and then turn them back on. It actually is restricted to disabling FOREIGN KEY constraint, and CHECK constraints. PRIMARY KEY, UNIQUE, and DEFAULT constraints are always enforced. The syntax for this is part of the ALTER TABLE statement. The syntax is simple:

This is why you want to name the constraints; without user given names, you have to look up what the system gave you and they are always long and messy.. The ALL option will disable all of the constraints in the entire schema. Be careful with it.

To re-enable, the syntax is similar and explains itself:

When a disabled constraint is re-enabled, the database does not check to ensure any of the existing data meets the constraints. So for this table, The body of a procedure to get things started would look like this:

You will want to confirm that the constraints are all working by trying to add bad data.

  1. There is only one first event per task. A UNIQUE will take care of that.
  2. There are no duplicates allowed.
  3. No events overlap in a task.
  4. There are no gaps allowed

Conclusion

You can avoid procedural code, but it is not always immediately obvious how. Start with a list of constraints and some test data. The use of the (current_start_date , current_end_date) has been the standard idiom for history tables in SQL. Until we got the DATE data type, the use of a BETWEEN predicate was a problem. The rounding errors that occurred in the fractional seconds could cause overlaps in the old DATETIME data type. The solution was to write two predicates:

It works, but it hides the BETWEEN-ness of the three-way relationship, With the new DATE data type, the BETWEEN works just fine.

The previous_end_date column feels strange at first. It seems redundant. But I will argue that it is not. A row in a table should model a complete fact. That is why we have (current_start_date , current_end_date) pairs; together, they tell us about the duration of the task as a whole thing; there is no need to read another row to find the ending date.

A Quick Note on Induction
Since one of the business rules is that we do not allow gaps, we need to have data for determining that. If I depended on the previous row to provide the previous_end_date, I would need self-joins or loops that eventually go back to the first event in the sequence of tasks. In short, it would be a lot of procedural code in a trigger. But I have a nice mathematical property to use: induction.
With the constraints, I can prove that I have no gaps using induction. t his done by proving that the first statement in the infinite sequence of statements is true, and then proving that if any one statement in the infinite sequence of statements is true, then so is the next one. If the idea of math bothers you, think about a chain of (n) dominoes set up for toppling. If I push one domino over, it topples. That is my starting point or induction basis. Assume that if I push any single domino over in the chain, then his neighbor also falls. That is the inductive hypothesis. So when I push the first domino in the chain, all of them fall down. 

Tags: , , ,

  • 26712 views

  • Rate
    [Total: 33    Average: 4.2/5]
  • MisterMagoo

    Are these errors?
    Hi Joe, I liked this article, but have a couple of questions…

    In the table creation code for Second Attempt: Simple History Table

    Is this constraint round the wrong way? shouldn’t the end date come after the start date?

    CONSTRAINT end_and_start_dates_in_sequence
    CHECK (task_end_date <= task_start_date)

    Also, in Kuznetsov’s History Table is the foreign key messed up? should the previous end date not be matched to the current start date?

    MM

  • Arik

    Is there a need to diable the Foreign Key constraint
    Hi Joe,
    Thank you for this article.
    Is there really a need to disable the foreign key for the first insert?
    previous_end_date is NULL so there is no violation.
    I tried without disabling the constraint and had no problem. unless I was missing something.
    Thanks,
    Arik.

  • Celko

    Are these errors?
    Oh Yeah, they are errors. Editor, can we fix the text? That is the good part about electronic publishing.

  • Alex_Kuznetsov

    Thank you much!
    Joe,

    Thank you for describing the approach. I really appreciate it and feel honored.

  • Stefan_G

    Why do you say that BETWEEN works just fine ?
    You write:

    Until we got the DATE data type, the use of a BETWEEN predicate was a problem. The rounding errors that occurred in the fractional seconds could cause overlaps in the old DATETIME data type. The solution was to write two predicates:
    (my_time >= current_start_date
    AND my_time < COALESCE (current_end_date, CURRENT_TIMESTAMP))
    It works, but it hides the BETWEEN-ness of the three-way relationship, With the new DATE data type, the BETWEEN works just fine.

    So, if I understand you correctly we should be able to write the following predicate:

    my_time BETWEEN current_start_date COALESCE (current_end_date, CURRENT_TIMESTAMP)

    But since when using half-open intervals current_start_date of one row is exactly equal to current_end_date of the previous row this will return two rows if my_time happens to exactly match an interval border.

    Or am I missing something ?

    Also, what does this have to do with “The rounding errors that occurred in the fractional seconds” ?

    /SG

  • Anonymous

    no gaps ?
    The current constraints does not prevent creating gaps in the sequence.

    To prevent gaps you need to modify the following constraint:

    CONSTRAINT previous_end_date_and_current_start_in_sequence

    CHECK (previous_end_date = current_start_date),

  • Seven24

    The design does allow gaps
    Insert Tasks(task_id, task_score, previous_end_date, current_start_date, current_end_date)
    Values(1, ‘A’, Null, ‘2010-01-01’, ‘2010-02-01’)
    , (1, ‘A’, ‘2010-02-01’, ‘2010-03-01’, ‘2010-04-01’)

    Any date between 2010-02-01 and 2010-03-01 would fail the test for gaps.

    Forcing the previous_end_date = current_start_date also will not work. What is need is for previous_end_date_and_current_start_in_sequence constraint to defined as follows:

    CONSTRAINT previous_end_date_and_current_start_in_sequence CHECK (previous_end_date = DateAdd(d, -1, current_start_date))

    This presumes that a “gap” is defined as there existing a Date value greater than the previous_end_date and less than the current_start_date.

  • mjfii

    renabling check constraints without ‘with check’….

    the constraints become ‘not trusted’ when you renable them without the ‘with check’ clause; this will hinder the the query optimizer, e.g.

    ALTER TABLE Tasks CHECK CONSTRAINT ALL;
    vs.
    ALTER TABLE Tasks with check CHECK CONSTRAINT ALL;

    you can see this here:
    select name,is_not_trusted from sys.check_constraints