Declarative SQL: Using References

There are several ingenious ways of using SQL References to enforce integrity declaratively. Declarative Referential Integrity (DRI) is more effective than using procedural code in triggers, procedures or application layers because it uses the SQL paradigm, thereby making optimisation easier and providing clearer expression of the rules underlying the data.

In this series, there are …

A Bit of History

It’s time for another history lesson. It’s time to return once more to the late 1960s with hip-hugger bellbottoms and psychedelic music. This is also when commercial disk drives just started to appear. The disk drives made random access possible in a way that that you could not do on magnetic tape.

Charles Bachman of IBM invented a system called IDS for General Electric. IBM had a hierarchical file system called IMS (Integrated Management System), but it lacked flexibility of Bachman’s more general data model. Bachman’s model was so successful for commercial products that it won him the ACM Turing award and pretty much determined how databases were built for the next few decades. That view was navigational, not relational.

Bachman’s talk for his ACM award was entitled “Programmer as Navigator” and that describes the sort of products that came out of his model. The records are strung together by access paths made up of pointer chains. The programmer imagined that he was on a scooter traveling on those paths and picking up the data as he went. We still processed data one record at a time, but the way we fetched that data was very different from tape files and sequential files.

In the 70s, CODASYL (“Conference/Committee on Data Systems Languages”, the creators of COBOL) attempted to standardize some of the terms and to get the ANSI X3H2 database standards committee to rubber stamp what they were doing. This led to a standard, called NDL (Network Database Language), that was passed, but never implemented and then left to expire. We never expected it to become a product, but we wanted to get the terminology in a standard, so that various products could then be compared objectively.

The method used in all of these early network database products involved linked lists based on pointers. Hashing, RAID, and other more exotic access methods were simply not implemented.

Unfortunately, a lot of the old network terminology and data modeling still exists in the cultural mindset of new SQL programmers. The term “link”, for example, refers to a pointer which can be traversed in either direction. A “slave – master” or “child – parent” was term for a pointer that could be traversed in only one direction. A “junction” table was a pointer structure that went from base file records to many different subordinate files. A “chain” was a link list of subordinate records in sorted order that had an end-of-the-chain marker of some kind.

Pointers used to be implemented as integers in the hardware. This is probably why IDENTITY and other “pseudo-pointers” in bad SQL are INTEGERS, SMALLINTS and BIGINT. The reason for this exact numeric data type was that, before relational databases, we did arithmetic on pointers, adding and subtracting constants to manipulate them inside the storage of the machine.

When we got to the C programming language, pointer arithmetic became possible but a really, really bad idea. Ask anyone who cheerfully overwrote the operating system on a DEC PDP-11 with it.

These links, chains, slaves, or whatever name these had depending on the particular vendor’s terminology, were not part of the data. They were part of the access method to get to the data.

References and pointers

In RDBMS and SQL, we do not have pointers. We have references. A reference is not a pointer; it’s a very different concept. It is more abstract than the pointer. There are only two kinds of tables in this model: referenced and referencing. When you say this out loud, you had to be careful because the two words sound so much alike, but they are descriptive. The referencing table has to match a subset of columns in the referenced table. In the full ANSI standard there are a lot of extra options on matching criteria, but they are seldom used and do not exist in SQL Server. SQL Server uses a simple grouping equivalence relation for matching.

The referencing table has a FOREIGN KEY constraint which lists the columns of a UNIQUE or PRIMARY KEY constraint in the referenced table, in the order they were declared in the referenced table.It is simply good practice that columns do not change names from table to table. Oh, the referencing and reference tables can be the same table, but this kind of self-reference in SQL Server will probably require turning off the check constraint, doing an action, and then turning the check constraint back on. In the full ansi version, constraints can be declared deferrable,  non-deferrable |  Initially deferred, and so on; it can be pretty complicated, but it saves a lot of procedural code.

Pointers are passive, but in contrast references are active. In the early SQL products, we used triggers because we had no choice. Remember that everything was still built on old file systems. Approximately 80% to 95% of the triggers enforced referential integrity with explicit procedural code.

Today, we have declarative referential integrity actions (DRI) in SQL. The SQL engine looks for a database event which is defined as a delete and update or an insertion to a table. The DRI actions do not care about insertions, but only deletions and updates. New SQL programmers are occasionally surprised when they find rows disappearing or changing in tables that they have not touched. This can mean that another session has fired a DRI action and they had no way of knowing it.

The syntax is a clause added with the list of columns involved in the referenced table. The default assumption is that you will use the PRIMARY KEY, but it is good practice to be explicit. The basic syntax is:

A FOREIGN KEY can have only one update rule and only one deletion rule. Here is a painfully detailed explanation of each.

  • ON UPDATE CASCADE: any change to a referenced column in the referenced table causes the same change to the corresponding referencing column in matching rows of the referencing table.
  • ON UPDATE SET NULL: any change to a referenced column in the referenced table causes the corresponding referencing column in matching rows of the referencing table to be set to NULL. All the columns that make up the foreign key are set to NULL
  • ON UPDATE SET DEFAULT: any change to a referenced column in the referenced table causes the corresponding referencing column in matching rows of the referencing table to be set to its default value.
  • ON UPDATE NO ACTION (the default): there is no referential update action; the referential constraint only specifies a constraint check. The ON UPDATE NO ACTION does not perform its constraint check until the entire set of rows to be updated has been processed.
  • ON UPDATE RESTRICT: any change to a referenced column in the referenced table is prohibited if there is a matching row. SQL Server does not have this option.

The deletion rule has the same options.

  • ON DELETE CASCADE: if a row of the referenced table is deleted, then all matching rows in the referencing table are deleted.
  • ON DELETE SET NULL: if a row of the referenced table is deleted, then all referencing columns in all matching rows of the referencing table to be set to NULL.
  • ON DELETE SET DEFAULT: if a row of the referenced table is deleted, then all referencing columns in all matching rows of the referencing table to be set to the column’s default value.
  • ON DELETE NO ACTION (the default): there is no referential delete action; the referential constraint only specifies a constraint check.
  • ON DELETE RESTRICT: it is prohibited to delete a row of the referenced table if that row has any matching rows in the referencing table. SQL Server does not have this option.

To be honest, the cascades are the most used option in this repertoire. This enforces a data-modeling principle. An entity is classified as weak or strong. A strong entity exist by itself, on its own its own merits a weak entity exist only because it is protected by a strong entity. For example, an Orders table is a strong entity that has to exist for the weak entity, Order_Details, to exist.

If we delete an order, then we expect that all of its details will also be deleted. In a badly designed system, you might be able to delete the order, but the order detail rows will continue to exist in their table; these unmatched rows are orphans.

The bad news is that you will get information about the cascaded action. The statement is done on the referenced table, so the message deals only with that table. You do not get a message about the referencing table(s)

But what if the weak entity has even weaker entities, subordinated to it? The action continues to cascade down the referential chain. This means that if you have got a spider’s web of DRIactions, they can go very deep, take some time to complete and tie up your machine. This is actually a small trade-off for data integrity. Otherwise, you would have to put this in the application layers, handle all of the table, locking yourself, and if the action fails, then do the ROLLBACK manually.

In the ANSI standards, table A can reference table B and vice versa. This is probably not a good idea, because SQL Server is very sensitive to cycles in the relation in the DRI chains. Consider how you would resolve the situation where table A cascades to table B and also to table C, but table B also cascades to table C. Which action (A or B) has an effect on table C? Since SQL is supposed to be a set oriented language, they both should happen at the same time. Obviously this does not work when both actions hit the same row and column, so it is disallowed.

Another version of the same problem would be if table A cascades to table B twice; that is, we have two separate foreign key references in A to table B. SQL Server disallows this, but some other SQL products will allow it, but only as long as there is no contradiction. Do not write code like this. It is just too hard to maintain..

SQL Server worries about cycles more than it should. Here is a skeleton that you should play with to get feel for how the cascades work.

Gives the error message that:

Here the weaker_items reference both the strong_items and the weak_items. But this will compile, since it has no cycles:

Take this skeleton and replace the CASCADE with “SET NULL“, “DEFAULT“, and ‘”NO ACTION” when you play with it. Finally, put some actions in a transaction and try a ROLLBACK. It is a good idea to draw a referential diagram (circles for tables, arrows for references) when things get tricky. As soon as you see a cycle (directed or undirected), try again.

Finally, we can have a self-reference. The best example of how to do this is Alexander Kuznetsov’s contiguous dates constraint idiom.

In a full ANSI/ISO Standard SQL, we could have declared the constraints to be initially deferred, but putting it into SQL Server it take a little work. We have to use a procedure with explicit code in it.

Conclusion

While there are limitations, I hope I have convinced you that DRI actions are a better choice than trying to do this with procedural code in triggers, procedures or application layers. This is one of the fundamental differences between a file system (passive dumb storage) and a database (active, smart storage).

Tags: ,

  • 14382 views

  • Rate
    [Total: 33    Average: 4.4/5]
  • Anonymous

    SQL Server does not have this option?
    I’m slightly confused by this one… surely the SQL Server default is actually ‘restrict’? I never specify any of these options but delete statements always terminate if they would create orphan rows in the referencing table.

  • Ross Presser

    You need a proofreader.
    "Non-deplorable"?? Seriously, Joe, a spellcheck isn’t good enough when you’re going to publish. Get an editor.

  • Anonymous

    You spelled rows rose
    You spelled rows rose

    Pretty nice article, but you have more than spelling errors

    Thanks

  • Anonymous

    Proofreading Guys
    You "proofreading" guys must be the people that make me intentionally leave out a colon in the label before the text box. You have a need to find something wrong, are not smart enough to make an intelligent contribution, so I make you feel better by letting you point out that I omitted the colon. Maybe Joe was doing this just for you. You should thank him.

  • Anonymous

    Proofreading Guys
    Or maybe you just listed to Thumper’s mother. You did see Bambi, right? "If you can’t say something nice, don’t say anything at all."

  • Steve K

    “Obviously this does not work, so it is disallowed.”
    "Consider how you would resolve the situation where table A cascades to table B and also to table C, but table B also cascades to table C."

    It is not obvious that this does not work. In fact, it should be relatively simple to make this work. The only limitation should be that the referencing columns in C are distinct* (i.e,, the columns in C which refer to A don’t refer to B as well). This is logical because breaking this rule would require either repeating unique keys between A and B or creating a multi-use column in C, which would be better off separated to begin with.Then each key’s columns can be updated simultaneously and independently.

    The only potential conflict would be where one key has ON DELETE CASCADE and the other has ON (action) SET (something), but since logically these two actions happen at the same time (this is declarative programming after all), the net result would be that the row is deleted. While this is something to be careful about, any competent developer would see this as a result of creating the ON DELETE CASCADE rule.

    *Sadly, SQL Server doesn’t enforce distinct referencing columns (as of SQL Server 2014). It will happily allow you to create a set of columns in table C which references column sets in both table A and table B. The only case where I can see this would be useful is if you wanted to restrict C to a subset of the intersection of A and B where neither A nor B is constrained to be a subset of the other, but also allow rows in C with no relation (i.e, NULL in the referencing column set). I struggle to find a real-world scenario where this would be the case.

  • ChrisCarsonSQL

    Sad truth about DRI
    Joe, your analysis is excellent, but one thing I’ve seen continually is that many many installations simply refuse to use DRI, and it’s enforced via standards.

    Many large shops have very little control over the knowledge level of their developers, and they opt to "keep things simple" by disallowing any background processes like DRI to enforce data integrity.

    These shops are content to enforce RI via code, often with spectacular failures as a result.

    I’d appreciate some feedback on ways to evangelize for this to potential clients.

    Common arguments against using DRI include too much background overhead, additional complexity that new developers may not see, and insufficient documentation once the DRI is in place.

  • Robert young

    Bank of Allen
    Allen Holub (my coding heeero) dealt with the issue, although not specific to RDBMS, in his "Bank of Allen" series on OOD/OOP in the mid to late 90s. here:
    http://www.drdobbs.com/what-is-an-object/184410076

    "The only recourse is to change the ROMs in every ATM in the world (since there’s no telling which one Bill will use), to use 64-bit doubles instead of 32-bit floats to hold account balances, and to 32-bit longs to hold five-digit PINs. That’s an enormous maintenance problem, of course. "

    Holub illustrates the problem with client code doing what should be done on the server/datastore, so it’s not specific to RDBMS, but I am among those who contend that the RM/RDBMS is the first implementation of "objects": they are data and method encapsulated. The coder class of folks simply refuse to cede control/LoC to the datastore. Too much moolah to loose. Their managers are clueless, and desirous of keeping their bloated headcount, so they demur.

  • Anonymous

    Bloated headcount
    Good code and design does not lead to reduced headcount, it leads to more and better software. Bad design may take a lot of people to maintain, but it is on its way to the ashcan.

  • Joe Celko

    disallowing any background processes like DRI to enforce data integrity.
    "Against stupidity the gods themselves contend in vain." – Friedrich Schiller.

    >> Common arguments against using DRI include too much background overhead, additional complexity that new developers may not see, and insufficient documentation once the DRI is in place.<<

    If you have insufficient documentation for the business rules, you are screwed. But those diagrams with the pretty arrows and boxes for RI seem to be easy enough for management.

    Maybe a simple example with Orders & Order_Details that fails with the orphans? My wife is currently working with a retail app that has to handle returns & refunds outside the database. Instead of re-stocking being automatic, it has to run overnight in batch processing. It is full of orphans and duplicate orders! ARRGH!

  • Robert young

    Bring Me The Head of Alfredo Garcia
    — Good code and design does not lead to reduced headcount, it leads to more and better software. Bad design may take a lot of people to maintain, but it is on its way to the ashcan.

    In and Eden of development, certainly. But when it comes to software development where there is a RDBMS involved, the RM/DRI replaces some (or, a lot) of LoC. And, if your wallet is filled by a LoC metric, you’re not going to embrace the RDBMS-centric model. I’ve worked in half a dozen staff startups to multiple thousands over dozens of location Fortune 100 corporations. The fact remains: more than 3 is a bureaucracy, whether public or private. And in any bureaucracy, counting coup is the currency that matters. Budget and headcount is that currency. A product that doesn’t get you sued is "good enough". You will create it in the manner which also increases your stature in the bureaucracy. Managers seek to increase their span of control. Being more efficient doesn’t serve that purpose.

    One need only look back, not all that long, at the mess created by mortgage companies and banks seeking to expand their mortgage portfolios. One would think they’d have survival as Job One, but they didn’t. They all tried to "out mortgage" their competition, funded by The Giant Pool of Money. Short term advantage always, in our kind of capitalism, destroys long term (national, too) wellbeing.