Declarative SQL: Using UNIQUE Constraints

In SQL, you can express the logic of what you want to accomplish without spelling out the details of how the database should do it. Nowhere is this more powerful than in constraints. SQL is declarative, and Joe Celko demonstrates, in his introduction to Declarative SQL, how you can write portable code that performs well and executes some complex logic, merely by creating unique constraints.

Edited: 21 December 2015

In this series, there are …

A Bit of History

Back in the days of punch cards and magnetic tapes, the only possible ‘file’ structure was a physically sequential one. Usually the records (not rows!) in a ‘file’ were sorted on a subset of the fields (not columns!). In RDBMS, tables (not ‘files’!) have no ordering. You fetch a row using a key and then access the columns within each row by name, not by a relative position in a record. RDBMS is much more abstract.

But let’s go back to the early days of SQL. Nobody had a product built for a relational database from scratch. We had to build on the existing ‘file’ systems, and the IBM products in particular. Sequential ‘files’ (punch cards, tapes or disk ‘files’) depend on knowing the sorted order of the ‘file’, they use it to locate records; a physical read/write head in the hardware has to be moved in a physical ‘file’ by counting records. Think about trying to do random access on a mag tape. The tape drives cannot spin fast enough for this approach, but if you know if a record is ahead or behind your read/write head’s current position you can make this work. I can merge two or more tapes easily. Even disk ‘files’, back then, were sorted so that we could use ISAM (Indexed Sequential Access Method).

The PRIMARY KEY construct came directly from the ‘file’ sort key. This keyword told the early SQL engines about how to lay the data on the physical storage. This is why original Sybase/SQL Server used a clustered index by default on a PRIMARY KEY . Welcome to the ISAM model!

Even Dr. Codd fell into this mindset when SQL was being created. He initially did not realize that a key is a key by its nature and none of them are ‘more key-ish’ than any other key. Then we added hashing and other access methods to disk system;. By the time that Dr. Codd had dropped the PRIMARY KEY from the Relational Model, SQL was already set in its ways.

To be a valid relational table, you must have a key. If the table has only one key, then we like to declare it as the PRIMARY KEY for documentation: But the underlying ‘file’ systems had no requirement for sorting or for UNIQUE ness of records. In fact, a lot of machine time back then was spent on sorting and dropping duplicates records from the ‘files’. So SQL followed its roots and allows an SQL table to be a pile (slang for an SQL table without a key, but unlike a simple flat ‘file’, it has strong typing and perhaps constraints).

A key is a subset of columns of a table such that it is UNIQUE for every row in the table. This is straight out of a textbook. But a UNIQUE value does not make a data element subset a key. To be a PRIMARY KEY , subset cannot have any NULL values. We’ll get into what happens with a UNIQUE constraint when there are NULLs in some of the columns.

What is NOT a Key

You cannot use an IDENTITY table property as a key. It is not a column. It is a count of the physical insertion attempts to one table, on one disc in one schema. Not even successful inserts! A rollback does not re-set the internal counter on that table. In English this is like trying to identify your car by giving the insurance company parking space number you happen to be using today. They want the VIN number instead. Why? Because a VIN identifies the automobile itself no matter where is it is in the Universe.

The GUID and UUID also do not work. The ‘G’ stands for global and the ‘U’ stands for universal. They are intended to locate things that are external to the schema, somewhere in cyberspace. They are not by definition, attributes inside the table or the schema.

Programmers who use GUID, UUID, identity or other auto increment machine generated UNIQUE values are usually faking the old pointer chains and record numbers they had in ‘file’ systems from the 1970s or the early network databases. I call these things exposed physical locators or fake pointers.

UNIQUE and PRIMARY KEY Constraints

There are some important differences between UNIQUE and PRIMARY KEY constraints. There can be only one PRIMARY KEY per table but many UNIQUE columns. A PRIMARY KEY is automatically declared to have a NOT NULL constraint on it, but a UNIQUE column can have a NULL in a column unless you explicitly add a NOT NULL constraint. Adding the NOT NULL whenever possible is a good idea, as it makes the column into a proper relational key. I also add NOT NULL to PRIMARY KEY to document the table and to be sure it stays there when the key changes.

‘File’ system programmers understand the concept of a PRIMARY KEY , but for the wrong reasons. Their mindset is a sequential ‘file’, which can have only one key because that key is used to determine the physical order of the records within the ‘file’. There is no ordering in a table; the term PRIMARY KEY in SQL has to do with defaults in referential actions, which we will discuss later.

Super Keys

A super key is a key with too many columns in it. Somewhere in it, you can find a proper subset of columns that is already a key. So why would you want to waste resources on such a thing? In theory, it is a bad idea, but in practice it can be handy. If your SQL engine allows you to include non-indexed columns in an index, but carry them in the leaf nodes of the tree, you can get what is called a “covering index” for the queries. Instead of having to get the non-indexing columns from the base table, you already have it covered.

For example, we know that the model year of something is not going to be a key and serial numbers are keys. But if we almost always want to have the model year for other thing, we might use:

This has to be done with an index and not a UNIQUE constraint, which does not have the extra syntax. Do not confuse this with a multiple-column key.

NULLs in UNIQUE Constraints

SQL has two kinds of equivalence operation.

The first is a version of the usual equal (=) for numeric, string and temporal values. Well, not quite the usual operator because all data types are defined as NULL-able in SQL. The rule is that any theta operator (=, >, <, =>, =<., etc) returns UNKNOWN when applied to a NULL. This is why a NULL is not equal to anything, even itself.

The second equivalence operator is grouping, as in GROUP BY. We debated this in the original X3H2 Database Standards; should each NULL form its own group or not? We decided to put all the NULLs into one group. This principle appears is several places in SQL constructs, including the UNIQUE constraint.

Multiple-Column Keys

There is also a multiple-column form of the <UNIQUE specification> <column list>; it means that the combination of those columns is UNIQUE . It says nothing about each column separately. For the math majors, we can write some rules. Notice which way the implications go.

The multiple-column uniqueness has some basic flavors. Here is classification of them:

Co-ordinates

The most familiar coordinate system are (longitude, latitude) geographical data or spatial data with (x, y) or (x, y, z) grids. The defining characteristic of a coordinate system is that you need to have all the columns in subset to get any information. Each column is independent of the others.

Please do not think that the dimensions have to be spatial. In fact, any physical entity has to exist in time, as well as space.

Trees

The most familiar trees system are organizational charts, parts explosions and hierarchies The defining characteristic of a tree system is that you do not need to have all the columns in subset to get information. Consider the pair (branch_office , department_name ) as the key of a table. So we can have (‘Chicago’, ‘Accounting’), (‘Atlanta’, ‘Accounting’), (‘Chicago’, ‘HR’), etc. The branch_office column is information about where the company does business in itself.

Quorums

This is a weird situation. Given a set of (n) columns, we can make a key from a subset of any (k < n) columns. This is not quite the same as a super key, which has key and non-key members in its subset. There is a famous programming problem called the eight queens problem. Given a chessboard and eight queens, you are supposed to place the queens on the board in such a way that none of them capture any other queen. This means that each “board_rank ‘ (a thru h) has one and only one queen, and each ‘file’ (1 thru 8) also has one and only one queen. To locate a particular, I can use the usual (“board_rank “, ‘file’) pair, but because of the restrictions we could use (“board_rank ‘) or “board_file “) as a key.

One solution for the Eight Queens:

Any of the following statements will remove the same row. I only need 1 of 3 columns to locate a unique row.

Once you have the row out, you can try to replace it with these statements and results:

The constraints allow only one value for the board_file in this simple table. But the insertion cannot figure out missing value for “board_file ” automatically, nor can it be done in more general cases.

Reducing Redundant & Incomplete Keys

My favorite example is a teacher’s schedule kept in a skeleton table like this:

That choice of a primary key is the most obvious one — use all the columns. The uniqueness rules we want to enforce are simple to understand.

  1. A teacher is in only one room each period.
  2. A teacher teaches only one class each period.
  3. A room has only one class each period.
  4. A room has only one teacher in it each period.

A sample row might look like this. You get senior citizen bonus points if you remember “Room 222” from the last 1960’s.

Stop reading and see what you come up with for an answer. Okay, now consider using one constraint for each rule in the list, thus.

We know that there are four ways to pick three things from a set of four things; it is called a combination.

I could drop the PRIMARY KEY as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY KEY and then one of the constraints?

I can now insert these rows in the second version of the table:

This gives me a very tough sixth period teaching load since I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:

Ms. Shields and I are both in room 223, trying to teach different classes at the same time. The trick is to combine the rules

If a teacher is in only one room each period, then given a period and a teacher I should be able to determine only one room, i.e. room is functionally dependent upon the combination of teacher and period. Likewise, if a teacher teaches only one class each period, then class is functionally dependent upon the combination of teacher and period. The same thinking holds for the last two rules: class is functionally dependent upon the combination of room and period, and teacher is functionally dependent upon the combination of room and period.

With the constraints that were provided in the first version, you will find that the rules are not enforced. For example, I could enter the following rows:

These rows violate rule #1 and rule #2.

However, the UNIQUE constraints first provided in Schedule_2 do not capture this violation and will allow the rows to be entered.

The constraint …

… is checking the complete combination of teacher, room, and period, and since (‘Mr. Celko’, 222, 6) is different from (‘Mr. Celko’, 223, 6), the DDL does not find any problem with both rows being entered, even though that means that Mr. Celko is in more than one room during the same period.

does not catch its associated rule either since (‘Mr. Celko’, ‘Database 101’, 6) is different from (‘Mr. Celko’, ‘Database 102’, 6), and so, Mr. Celko is able to teach more than one class during the same period, thus violating rule two. It seems that we would also be able to add the following row:

… which violates rules #3 and #4.

Try to imagine the task of enforcing this with procedural code. This is why I say that most of the work in SQL is done in the DDL.

Conclusions

UNIQUE constraints can provide a great deal of conditional logic to maintain rules. These patterns are completely portable across relational databases: The indexes they create can be used by the optimizer to improve performance. But the real advantage is that they eliminate the need for procedural code in the database and the application layers.

In the second part of this article, we will discuss using the REFERENCES clause to further eliminate the need for procedural code in the database and the application layers.

Tags: , ,

  • 19385 views

  • Rate
    [Total: 43    Average: 4.6/5]
  • pgeerkens

    Natural Key constraint?
    What if we imagine the existence of a constraint "NATURAL KEY" that imposed UNIQUEness, required constituent columns to be NOT NULL, and could be specified multiple times on a single table. Could we not now allow the PRIMARY KEY constraint to return to its roots in the physical design domain, whence it originated and still retains a vestigial value? Much of the angst over the PRIMARY KEY constraint seems to me to arise from the origin of the phrase itself in the logical-design domain, yet being co-opted into the physical domain of the SQL Server engines.

  • tonyrogerson

    No mention of surrogate keys?
    You fail to mention that auto-generated numbers are a good means of providing a non-mutable (i.e. stable) method of detaching the natural key from coding logic on joins. The natural key ought only to be used for look up – specified from the user application.

    Consider the real problem of using Social Security Number as the natural key in your database schema, consider the 5 tables that might foreign key with the base table. You have SSN located in 6 tables.

    Now consider privacy requirements. What an absolute nightmare it would be to anonymise data for say a test or development environment.

    What a nightmare it is when legislation is passed preventing you from storing the SSN unencrypted. You then end up in a position where you need to decrypt key columns so you can use them – yes what a nighmare of world that would be if we didn’t use surrogate keys.

    We’ve had this conversation in the past, surrogate keys are fine in the RM as defined by both Codd and Date so long as they are non-mutable which in SQL Server terms the IDENTITY property provides.

  • Celko

    NATURAL JOIN
    ANSI standard SQL has a natural join is one of the infix join operators. It works by matching columns on the quality in both tables which have the same column name. There is no ON clause.

    It is really a bad idea. If either of the tables changes the column name or adds new columns read, recompiling the SQL will result in a different join.

  • Celko

    Encryption and perfect hashing.
    The SSN would appear in one and only one table, where it identifies the recipient. The other tables would be foreign key references back to that one base table. In T-SQL, you do have to change all five tables, when the key changes. However, that is a dialect problem. In Sybase SQL anywhere and other products.

    The foreign key is not a repetition of a value; it is a pointer chian back to the base table where the referenced value appears once. This model of RDBMS storage comes from the junction tables in the old network databases. This model makes joins and cascades incredibly fast. It also allows the referenced table data to be fully encrypted. The encryption has to be perfect (i.e. each value in the referenced table maps to one value in the referencing tables). Fortunately perfect hashing has been a hot topic in database research for many years and you can even find minimal perfect hashing algorithms for databases.

  • Robert young

    a cluster duck
    Quack? there is a very good benefit to a clustered index: it provides a true performance improvement (less so with SSD, btw) for read/write/update, if one knows enough to set Phil Factor (by its various names across engines) optimally. kind of like the embedded access path of IMS/xml and such. the whole point, beyond simple identity, of a PK is just this access path (particularly with compound PKs). in cases where the PK is different from the clustered index, one should think a bit on why that has happened. likely answer: one’s data structure isn’t quite right.

    now, if only the coding class will just admit that "logic" really, really is better executed within the data store… we’d certainly need fewer coders.

  • Anonymous

    Eight Queens
    The eight queens problem also includes a diagonal check. The solution given is incorrect as the queen on a1 and h1 would capture each other.

    https://en.wikipedia.org/wiki/Eight_queens_puzzle

  • David

    Lies, Lies, and Celko
    "You cannot use an IDENTITY table property as a key. It is not a column" – "The GUID and UUID also do not work"

    Uhhhh, I guess every database I’ve ever seen in the last 20 years doesn’t exist.

    One more Celko article that has zero basis in reality or practicality.

  • Celko

    Only 20 years 🙂
    It is quite possible (well, actually, highly likely) that you have been working with bad systems for 20 years. Much of my work for the last 40 years has been repairing such databases.

    A key is a subset of the columns of a table that are 1) NOT NULL 2) unique for each row in the table. The important thing is that the columns are actual attributes of the entity being modeled. Where is things like identity or other external physical locators (like the count of insertion attempts in one database system, on one machine in one table) lacks two important properties. Validation: does the data look like it is, in the same domain as the column? Verification: there is an external method of determining if the entity actually exist.

    For example, consider a friend of mine who had a car in Los Angeles. The records were in a network database (which is what you are mimicking) and there was no check to see if the VIN numbers were duplicated or not. When he went to register it again, he was told that the car had been seized in a drug bust and the DMV wanted to confiscate it ..again. According to their database, they had already destroyed the vehicle once, and wanted to do it again. Oh, and there was a small matter of trying to arrest him.

    A fundamental problem with pointers (or tricks to mimic them) is that a pointer can point to anything. And they do and they do 🙂

  • Robert young

    RBAR holds the silo together
    viz, Celko: Much of my work for the last 40 years has been repairing such databases.

    CS training remains in love with low level looping, so flat/sequential files are the definitional datastore. alas, no SQL engine demands Organic Normal Form&trade;, so any byte dump is accepted. bad as they are, IMS/IDMS make structural requirements of their "schemas". too bad they’re silly schemas.