SQL and the Snare of Three-Valued Logic

The whole subject of the Three-Valued (also known as ternary, trivalent or 3VL) Logic of SQL tends to trip people up. This is hardly surprising in view of the fact that it involves an esoteric Polish mathematician and because it behaves differently in the DDL (Data Declaration Language) and the DML (Data Manipulation Language). In response to requests, Joe Celko comes to the rescue and makes it all seem simple.

At the end of my article on Relational Division, one of the comments, by a reader named Dean, asked if the sentence “There ain’t no planes in this hangar that I can’t fly!” is a double negative or a triple?   And he asked for an article on Boolean logic next.

Here it is!  But the problem is that SQL has a three valued logic (3VL) and not the two values of Mr. George Boole.  In fact, SQL logic is complicated. 

NULLs

Like many problems, this starts with NULLs.  Dr. Codd put the concept into the first version of the Relational Model.  It is not a value; it is a “place holder” for a value.  The history of the NULL is interesting.  When Dr. Codd and Chris Date were business partners, Date was opposed to NULLs and this was a great debate in those days. 

Later Dr. Codd did a second version of the Relational Model which has two kinds of NULLs.  One kind of NULL marks values which are missing because the value is unknown, and the other kind marks values that are missing because the attribute is missing.  For example, my hat size exists (I have a head) but might not be known so it is the first kind of NULL; my hair color does not exist since I am bald as a cue ball, so it is the second kind of NULL.  FirstSQL is the only product which implemented Dr. Codd’s second model. 

SQL followed Dr. Codd’s first model and has a single NULL. But much like FORTRAN follows algebra; SQL had to make adjustments from the mathematical model.  In the Relational Model, a NULL has no data type.  In SQL, the compiler has to know about storage requirements so you can write “CAST (NULL AS <data type>)” to pass that information. 

NULLs have certain basic properties:

  1. All SQL data types are NULL-able.  This is one reason why IDENTITY is a table property and not a data type.  The other reason is that a table can have only one IDENTITY column in it.  The count of PHYSICAL insertion attempts (NOT  successes) is not an attribute; it is audit meta-data and has no place in RDBMS. 

    This is why the first implementations of BIT which were assembly language bits were not a data type.  When BIT became a numeric data type, then things were Kosher. 
     

  2. NULLs propagate.  If you use a NULL in an expression, then result is a NULL.  In numeric expressions, we had questions about priorities, but propagation is vital.  In particular:
    “NULL / 0” = NULL or “division by zero” error?

    “0 / NULL” = NULL or 0, since this can only resolve to zero for any value.  No, I am not going to tell you; test it yourself in Query Analyzer or SSMS!  But can you figure it out from principles? 

TRUE, FALSE and UNKNOWN

A comparison between known values gives you a result of TRUE or FALSE.  This is Boolean logic.  If you get a chance look at some of the Syd Harris cartoons about George Boole at S. Harris Computer Cartoons

But that is classic two-valued logic (2VL); all the values are known.  When you do a comparison with a NULL, you cannot get a Boolean (i.e. TRUE or FALSE) result. 

This is where we invented the logical value UNKNOWN.  Well, re-discovered it.  There were already a lot of multi-valued logics in the mathematical literature.  Some of them are based on discrete values and some on continuous values (i.e. fuzzy logic).  SQL looks like the system developed by Polish logician Jan Åukasiewicz (the L-bar is a “W” sound in English; you can get the rest of it at Wikipedia: Jan Åukasiewicz).  Programmers know him as the guy who invented Polish Notation and inspired Reverse Polish Notation for HP calculators and stack architecture (aka zero address machines) computers like the old Burroughs machines.  SQL has three logical operations, which are found in all programming languages.  These are extended to three values:  

OR

TRUE

FALSE

UNKNOWN

TRUE

TRUE

TRUE

TRUE

FALSE

TRUE

FALSE

UNKNOWN

UNKNOWN

TRUE

UNKNOWN

UNKNOWN

AND

TRUE

FALSE

UNKNOWN

TRUE

TRUE

FALSE

UNKNOWN

FALSE

FALSE

FALSE

UNKNOWN

UNKNOWN

UNKNOWN

UNKNOWN

UNKNOWN

NOT

TRUE

FALSE

FALSE

TRUE

UNKNOWN

UNKNOWN

In the Åukasiewicz multi-valued logic systems, the AND, OR and NOT are almost the same as in SQL’s three valued logic.  The general case is based on the following Polish notation formulas in which 1 is TRUE, 0 is FALSE and fractions are the other values.

  • Cpq = 1          for p <= q
  • Cpq = 1 - p + q  for (p > q)
  • Np = 1 - p

N is the negation operator, C is the implication operator.

From this pair, we define all other operators

  • Opq = CCpqq     Or
  • Apq = NONpNq    And
  • Epq = ACpqCqp   Equivalence
  • Mp = CNpp       maybe (not FALSE)

Then Tarski and Lukasiewicz had a lot of rules of inference to prove theorems.  David McGoveran pointed out that SQL is not really a logic system because it lacks rules for deductions and proofs.  The idea is that the UNKNOWN might be resolved to {TRUE, FALSE}, so can we sometimes determine the result regardless of how one value resolves.  If we cannot determine a result, then return UNKNOWN. 

Let me sum up.  SQL is not a formal logical system.  We have no inference rules and what we informally call “predicates” are actually “search conditions” in the ANSI/ISO Standards.  What we have is a collection of look-up tables to compute a value of {TRUE, FALSE, UNKNOWN}. 

This is a subtle difference for anyone who is not a math major, but an inference system is like those axioms, postulates and theorems you had in High School geometry.  You need an implication operation.  In regular 2VL, this is shown as a double tailed right arrow ” =>” and it is defined by the look-up table in 2VL, where the column is the first operand:

IMPLIES

TRUE

FALSE

TRUE

TRUE

FALSE

FALSE

TRUE

TRUE

Or informally, a TRUE premise cannot imply a FALSE conclusion.  A FALSE premise can imply anything.  This 2VL model can also be written as:

  • (A IMPLIES B) = NOT (A AND NOT B)

But if you apply De Morgan’s Law, it can also be written as

  • (A IMPLIES B) = (NOT(A) OR B)

Now expand these 2VL rules into 3VL rules.

IMPLIES-1

TRUE

FALSE

UNKNOWN

TRUE

TRUE

FALSE

UNKNOWN

FALSE

TRUE

TRUE

UNKNOWN

UNKNOWN

UNKNOWN

FALSE

UNKNOWN

IMPLIES-2

TRUE

FALSE

UNKNOWN

TRUE

TRUE

TRUE

TRUE

FALSE

TRUE

TRUE

TRUE

UNKNOWN

UNKNOWN

UNKNOWN

UNKNOWN

Oops!  They are not the same.  The premise “X IMPLIES {FALSE, UNKNOWN}” could argue that a FALSE premise X can imply anything, TRUE, FALSE or UNKNOWN.  Or that propagation of unknown values applies, so a FALSE premise can lead to a TRUE conclusion. Make a decision and design a logical system for yourself. 

We avoided it in the SQL Standards. We did not do implication and the other rules.  This is a problem for those of us doing an optimizer for SQL. 

BIT Flags versus Predicates

One of the major problems for those people learning SQL is that they have not un-learned a mindset based originally on punch cards and magnetic tapes instead of RDBMS.  Even if they never saw a punch card or a magnetic tape, they think of data processing as a sequence of procedural steps that depend on sequentially ordered data.  At each step the data is moved to another file, scratch tape or temporary table.

Did you ever think about the people who use a singular table name instead of a collective or plural name?  In a file system, you process one record at time.  Calling the current record “Employee” makes sense.  But in RDBMS, we work with the entire table (set) all at once.  Therefore we call that table “Personnel” instead. 

We used bit flags in the Dark Ages because we lacked storage and computing power.  We would discover a fact that was TRUE at one point in time and then set a flag to mark that state within the system.  A classic example is a deletion flag at the start of a tape file record.  When the system read it, it would skip forward on the tape to the next record.  Eventually, the deleted records would be removed when the data was copied over to a new tape. 

An embarrassing story of my own was writing a small restaurant scheduling program in BASIC on an early Personal Computer.  I used a flag to show if the staff member was at least 18 years old which was the legal age to consume and serve alcoholic beverages at the time.  The legal age then changed to 21 and my program cheerfully scheduled an entire shift of now under-aged servers on the night that the state Alcoholic Beverage Control agency decides to check on the new law.  My client was not happy. 

If I had SQL back then, I would have used the employee birthdates to set up the shifts.  A simple two keystrokes in a VIEW declaration would have done it. 

SQL is a “predicate language” by its nature in the same way that FORTRAN is an “algebra language” by its nature.  Because the data is shared, and in constant motion, you cannot ever trust that the flags will actually show the current state of the data.  You have to test it when you do something. Silly bit-flags work in file systems because nobody else can physically read the tape as it goes thru its sequential process. 

DDL versus DML

As if all of this was not weird enough, SQL’s 3VL behaves differently in the DDL (Data Declaration Language) and the DML (Data Manipulation Language). 

When you declare a column in a table in the DDL, you have the options of making in NOT NULL, giving it a DEFAULT value, using it in a CHECK() constraint and/or adding DRI actions.  Those options are one of the many reasons that a column in RDBMS is different from a field in a file system. 

When I use a CHECK() constraint in the DDL and some or all of the columns are NULL-able, the search condition can return UNKNOWN.  This is fine because DDL accepts {TRUE, UNKNOWN} as the same and will permit the data to persist. 

We did that because adding the extra search conditions to test for NULLs would be a screaming nightmare.  Try it. 

In DML, we are not so forgiving.  DML sees {FALSE, UNKNOWN} as the same and will reject the data, keeping only that which tests TRUE. 

This can be weird.  You put something into a table and you need to work extra hard to get it back out because of NULLs and 3VL.  That is why we SQL pros keep telling you to not make a column NULL-able unless it makes sense in the data model. 

Shorthand

SQL is a higher level language so it has shorthands in it.  But they can all be expanded out into combinations of {AND, OR, NOT}.  Beginners in SQL often do not know how to use them and so tend to stick with the familiar {AND, OR, NOT} constructs they know from earlier languages like FORTRAN or BASIC. This is a bad way to write code.  It hides the higher level logic and in better SQL engines, the shorthands are optimized.  

BETWEEN

This search condition is equivalent to two comparisons:

(x BETWEEN a AND b) = ((a < = x) AND (x < = b))

The early optimizers simply expanded it out this way.  More sophisticated optimizers look at it as a 3-ary logical operator and optimize for the “between-ness” that seems to be expected

IN()

The IN()n search condition has two forms.    

<expression> [NOT] IN (<expression list>)
<expression> [NOT] IN (<single column select stmt>)

This is a bit tricky.  With either version we get a single column of expressions and compare the left hand expression to each of those values with an OR.  In effect,

x IN (1, 2, 3)

…is shorthand for …

((x = 1) OR (x = 2) OR (x = 3))

Likewise,

x NOT IN (1, 2, 3)

… is shorthand for …

NOT ((x = 1) OR (x = 2) OR (x = 3))

Which can be re-written with DeMorgan’s law to:

((x <> 1) AND (x <> 2) AND (x <> 3))

Now try to follow the same rule with

x IN (1, 2, NULL)

…is shorthand for…

((x = 1) OR (x = 2) OR (x = NULL))
((x = 1) OR (x = 2) OR UNKNOWN)  -- propagate NULLs

Let’s assume that one of the ORs is TRUE. So the other is FALSE

(TRUE OR FALSE OR UNKNOWN) 
(TRUE OR UNKNOWN) 
(TRUE) 

The only way this can return FALSE would be for x to be something other than {1, 2}.  The only way for it to return UNKNOWN would be for x to be NULL. 

Repeat the exercise with

x NOT IN (1, 2, NULL)

…is shorthand for…

NOT ((x = 1) OR (x = 2) OR (x = NULL))
NOT ((x = 1) OR (x = 2) OR UNKNOWN) -- propagate NULLs
((x <> 1) AND (x <> 2) AND NOT UNKNOWN) -- DeMorgan
(UNKNOWN) -- definition of AND

So if this was used in DDL, rows would be allowed in the table, but will always be rejected by the DML.

SOME|ANY()

There is an underused generalization of the IN() of the form:

<expression> <theta op> [SOME | ANY] (<single column select stmt>)

This is shorthand for a chain of ORs:

(<expression> <theta op> <expression_1>
  OR <expression> <theta op> <expression_2>
  OR...
<expression> <theta op> <expression_n>)

There is no difference between SOME and ANY, but sometimes the search condition reads a little better to a human. 

The IN() search condition is the case where the <theta op> is equality.  All of the 3VL problems you had with IN() are also here.

ALL

There is really, really underused (almost unknown) search condition that generalized a chain of ANDs:

<expression> <theta op> ALL (<single column select stmt>)

This is shorthand for a chain of ANDs:

(<expression> <theta op> <expression_1>
AND <expression> <theta op> <expression_2>
AND...
<expression> <theta op> <expression_n>)

The ALL search condition does not play well with the extrema functions (e.g. MAX, MIN). 

It is counter-intuitive at first that these two search condition are not the same in SQL:

x >= (SELECT MAX(y) FROM Foobar)
x >= ALL (SELECT y FROM Foobar)

But you have to remember the rules for the extrema functions — they drop out all the NULLs before returning the greater, greatest, or least values.  The (<single column select stmt>) does not drop NULLs, so you can get them in the results.  This can leave a NULL to give us an UNKNOWN. 

EXISTS()

The EXISTS() search condition is on of the very few Boolean operator we have in SQL.  It returns {TRUE, FALSE} and never UNKNOWN. This is because it works at the table level and not at the column level.  Compare this to COUNT(*) which returns the cardinality of the whole table as opposed to COUNT(<expression>).  They look alike but are totally different. 

Summary

I tell people that they need to learn to think in sets to get good with SQL.  Give up your sequential, procedural record-at-a-time mindset and trade up to declarative set processing!  See the light!  Leave the darkness! 

I probably ought to spend more time on 3VL since it is also something they never saw before.  And it is not easy. 

Tags: , , ,

  • 32898 views

  • Rate
    [Total: 1    Average: 1/5]
  • Anonymous

    And again.. and again…
    “All SQL data types are NULL-able. This is one reason why IDENTITY is a table property and not a data type. The other reason is that a table can have only one IDENTITY column in it.”

    Another rant on identity?

  • caderoux

    I think you have your AND table incorrect?
    ;
    WITH logic
    AS (
    SELECT 0 AS val
    ,’FALSE’ AS txt
    UNION
    SELECT 1 AS val
    ,’TRUE’ AS txt
    UNION
    SELECT NULL AS val
    ,’UNKNOWN’ AS txt
    )
    SELECT l.txt + ‘ OR ‘ + r.txt AS expression_or
    ,CASE WHEN l.val = 1
    OR r.val = 1 THEN ‘TRUE’
    WHEN NOT (
    l.val = 1
    OR r.val = 1
    ) THEN ‘FALSE’
    ELSE ‘UNKNOWN’
    END AS result_or
    ,l.txt + ‘ AND ‘ + r.txt AS expression_and
    ,CASE WHEN l.val = 1
    AND r.val = 1 THEN ‘TRUE’
    WHEN NOT (
    l.val = 1
    AND r.val = 1
    ) THEN ‘FALSE’
    ELSE ‘UNKNOWN’
    END AS result_and
    FROM logic AS l
    CROSS JOIN logic AS r

  • caderoux

    Apologies for the reformatting of my previous message…
    Apologies for the reformatting of my previous message…

  • caderoux

    This version does the PIVOT, so the output is easier to read

  • caderoux

    Well, you can see it properly formatted here: http://snipt.org/Xli
    Well, you can see it properly formatted here: http://snipt.org/Xli

  • Tony Davis

    code formatting
    caderoux,

    I apologize for the difficulty in pasting formatted code. The best option we can offer is to use our code prettifier at:

    simple-talk.com/prettifier

    You paste in using the Source Code tab, set your options using the left menu (you must use “Forums – FONTs and PRE for posting to the blogs and forums), click Prettify, then grab the HTML from the “Source HTML” tab and pasting it into the HTML tab of the blog editor.

    I did this for your previous post above.

    Cheers,

    Tony.

  • Hugo Kornelis

    Disappointing…
    A very disappointing article.

    The rant about IDENTITY is completely useless, since it has nothing to do with the subject of the article. It’s also incorrect. As far as I know, nobody ever suggested identity to be a data type. It’s a column property that can be applied to non nullable columns only.

    The 3VL truth tables presented by Joe Celko are incorrect. The correct result for “false AND unknown” as well as “unknown AND false” is of course “false” – not “unknown” as it these tables state!

    Joe then completely loses it when he comes to the IMPLIES operator. Besides the difference he highlighted, I count three more differences between the two tables. One even in a cell that does not involve any “unknwon” value at all. So have we jsut invalidated DeMorgan’s law, or has Joe been very sloppy? We can all relax. I’ve invested the five minutes to work out all nine results for both variations of the IMPLIES operator and there were no differences at all. The difference highlighted by Joe might be caused by his incorrect assessment of “unknown and false” (I didn’t check), but some of the other differences -at least the one for “true IMPLIES false”!- can only be caused by extreme sloppiness.

    Then comes another of Joe’s rants, that is only vagulely related to the subject at hand. He chooses to completely ignore the fact that “boolean” as a (non numeric!!) datatype is part of the SQL-2003 standard (and as far as I know has not been removed for newer versions). And the example he gives is just silly. A customer apparently has chosen to save some costs by not storing employees birthdates (yeah, right). And then, when a law changes that relates to age of employees, he again chooses to save money by not revealuating some data that has been derived off that non-stored birthdate. And then, he holds the database guy responsible for the consequences? Get out of it! – I call shenanigans on this one.

    At this point, I was unable to stomach any firthher reading of this article. I glanced over it and didn’t see any more errors, but I would not be surprised if better reading would uncover a few more.

    For a good discussion about NULLs, search the web. I’m sure there are many available, and I’m also sure that most of them are far better than this one.

    Best, Hugo

  • Dean

    Nil Carborundum
    Very interesting. Thanks.

    Nil carborundum.

  • Keith Macdonald

    Thanks, and 3VL elsewhere
    Joe

    Thanks for a good reminder, it’s the principal that is important, never mind the snipers 🙂

    I still remember the surprise when we first started coding with the MS-Windows 2.0 SDK (25 years ago?) and found the “Yes, No, Cancel” dialog box. “Oh! We’ve got to code for 3VL!” In a way it’s good to know that the issue isn’t 100% unique to SQL.

    Regards
    Keith

  • Bruce W Cassidy

    Rant?
    Wow. A good discussion of 3VL and how gnarly it is, and people get hung up on identity being a table property?

    Good article, Joe. And as you implied, I should remember to use ANY() and ALL() more often!

  • James Birch

    False and Unknown
    Hugo is correct that the AND 3VL table is incorrect. False and Unknown do indeed evaluate to False. A simple test to prove this:

    SELECT 1
    WHERE NOT (0=1 AND 0=NULL)

    What’s more disappointing is that this 6 year old article, which is still in the top 5 for Google searches on SQL 3 valued logic, has never been corrected.

    https://en.wikipedia.org/wiki/Three-valued_logic contains the correct table.

    Regards,

    James