You’ve probably read that SQL is a language based on sets and predicates. Well, that sounds impressive but what does it really mean? It means that we ought to have some predicates that use sets as their operands. And we do; but a lot of them are a bit obscure and not often used. All of them can be written with the [NOT] EXISTS () predicate and other logical operators in correlated subqueries. While that is perfectly OK, the results can be a bit hard to read. Let us examine some of these features and see how we can use them.
The [NOT] EXISTS() Predicate
I am going to assume by now that you have seen an EXISTS()
predicate in SQL. , but perhaps you do not yet fully appreciate it. It existed at the very start of the language, and it has a nice intuitive meaning. This is because it is one of the few predicates that we have, perhaps the only one, that evaluates to either TRUE
and FALSE
, but never UNKNOWN
.
The current BNF is very simple.
<exists predicate> ::= EXISTS <table subquery>
Originally, the <table subquery>
had to be a one column table. This is because the EXISTS()
predicate was defined in the same part of the standard that gave us the “<expression <comp op> ALL <table subquery>
” and “<expression <comp op> [SOME|ANY] <table subquery>
” predicate definitions (we will get to them in another section so be patient). Originally, comparison operators were defined only for scalars; currently standard SQL allows row-based comparisons.
The single column restriction does not make any sense for the EXISTS()
predicate; it is clearly a table-based function. If the cardinality of <table subquery>
is greater than zero, then the result of the <exists predicate>
is TRUE
; otherwise, the result of the <exists predicate> is FALSE
.
This is why there was a popular piece of folklore that the best way to write EXISTS()
predicate was “EXISTS (SELECT <constant> FROM…)
” or “EXISTS (SELECT <expression> FROM…)
” and it actually did work that way in some of the very early SQL engines; they actually materialized the query expression table! However, using “EXISTS (SELECT * FROM…)
” defined the asterisk as a single undefined column. It is the preferred choice today, but we see it as standing for an entire row, not a column. Hey, it is hard to write good, clear BNF for a new language! The truth is that internally optimizers quickly got rid of the materialization, and simply evaluated the table expression until they got a row.
It was also confusing in those days to realize that a NULL
exists. There were proposals to make a table subquery of all NULLs
return an UNKNOWN
result from EXISTS()
. We decided against it, on the grounds that existence is at a “higher level” than exact, known values. Imagine that you have a paper bag and cannot see what is in it, but you can still pick it up and know of it has some kind of contents.
Quantified Comparison Predicates: SOME ANY and ALL
These predicates have been in the language from the very early days. They were our attempts to get the universal quantifier (“) which means “for All” and existential quantifier () which means “there exists” from formal logic into SQL. A surprising number of SQL programmers do not even know they exist.
The current definition of these predicates allows row comparisons, but that has not been implemented in SQL Server. Let us stick to the basic, original scalar value syntax that is in SQL Server.
1 2 3 4 5 6 7 |
<quantified comparison predicate> ::= <value expression> <quantified comparison predicate part 2> <quantified comparison predicate part 2> ::= <comp op> <quantifier> <table subquery> <quantifier> ::= <all> | <some> <all> ::= ALL <some> ::= SOME| ANY |
The easiest way to think of this is that we are using an abbreviation to distribute the comparisons over a set of AND
-ed or OR
-ed simple comparison predicates.
The predicate “<value expression> <comp op> ANY <table expression>
” is equivalent to taking each row, call it “s” and assume that they are numbered from 1 to n, and testing <value expression> <comp op>
s with OR
s between the expanded expressions:
1 2 3 4 |
((<value expression> <comp op> s1) OR (<value expression> <comp op> s2) ... OR (<value expression> <comp op> sn)) |
When you get a single TRUE
result, the whole predicate is TRUE
.
As long as table S has cardinality greater than zero and one non-NULL
value, you will get a result of TRUE
or FALSE
. The keyword SOME
is the same as ANY
; it is just a matter of style and readability. Likewise, “<value expression> <comp op> ALL <table expression>
” takes each row s
of table S
and tests <value expression> <comp op>
s with AND
s between the expanded expressions:
1 2 3 4 |
((<value expression> <comp op> s1) AND (<value expression> <comp op> s2) ... AND (<value expression> <comp op> sn)) |
When you get a single FALSE
result, the whole predicate is FALSE
. As long as table S
has cardinality greater than zero and all non-NULL
values, you will get a result of TRUE
or FALSE
.
That sounds reasonable so far. Now let EmptyTable
be an empty table (no rows, cardinality zero) and NULLTable
be a table with only NULL
s in its rows (but cardinality greater than zero). The rules for SQL say that “<value expression> <comp op> ALL NullTable
” always returns UNKNOWN
, and likewise “<value expression> <comp op> ANY NullTable
” always returns UNKNOWN
. This makes sense, because every row comparison test in the expansion would return UNKNOWN
, so the series of OR
and AND
operators would behave in the usual way.
This is easier to see with actual values, we can use a table with (10, NULL, 10)
and value expression of 10 to get:
1 2 3 4 5 |
(s1 = 10) AND (s2 = 10) AND (s3 = 10) (10 = 10) AND (NULL = 10) AND (10 = 10) (TRUE) AND (UNKNOWN) AND (TRUE) (UNKNOWN) AND (TRUE) (UNKNOWN) |
See how SQL’s three valued logic can bite us? In the DML, and unknown is rejected, but in the DDL an unknown is accepted.Likewise, the table (10, NULL, 25) yields
1 2 3 4 5 |
(s1 = 10) AND (s2 = 10) AND (s3 = 10) (10 = 10) AND (NULL = 10) AND (25 = 10) (TRUE) AND (UNKNOWN) AND (FALSE) (UNKNOWN) AND (FALSE) (UNKNOWN) |
Repeat the exercise with ORs:
1 2 3 4 5 |
(s1 = 10) OR (s2 = 10) OR (s3 = 10) (10 = 10) OR (NULL = 10) OR (10 = 10) (TRUE) OR (UNKNOWN) OR (TRUE) (UNKNOWN) OR (TRUE) (TRUE) |
Likewise, the row (10, NULL, 25) yields
1 2 3 4 5 |
(s1 = 10) OR (s2 = 10) OR (s3 = 10) (10 = 10) OR (NULL = 10) OR (25 = 10) (TRUE) OR (UNKNOWN) OR (FALSE) (TRUE) OR (FALSE) (TRUE) |
However, “<value expression> <comp op> ALL EmptyTable
” always returns TRUE
and “<value expression> <comp op> ANY EmptyTable
” always returns FALSE
. Most people have no trouble seeing why the ANY
predicate works that way; you cannot find a match, so the result is FALSE
. But most people have lots of trouble seeing why the ALL
predicate is TRUE
. This convention is called existential import in formal logic. If I were to walk into a bar and announce that I can beat any pink elephant in the bar, that would be a true statement. The fact that there are no pink elephants in the bar merely shows that the problem is reduced to the minimum case.
This was actually a major issue in the early days of symbolic logic. Lewis Carroll believed in existential import, which means if you say “all men are mortal” you imply “some men (at least one) exists” but historically logic went against them. That is because we define them in terms of AND
and OR
to give us some very nice mathematical properties.
1) ∃x P(x) = ¬∃ x ¬P(x)
2) ∃x P(x) = ¬∀ ¬P(x)
This rule lets us use the [NOT] EXISTS()
predicate in some cases. The “Table1.x <comp op> ALL (SELECT y FROM Table2 WHERE <search condition>)
” predicate converts to:
1 2 3 4 |
... NOT EXISTS (SELECT * FROM Table1, Table2 WHERE Table1.x <comp op> Table2.y AND NOT <search condition>)... |
The “Table1.x <comp op> ANY (SELECT y FROM Table2 WHERE <search condition>)"
predicate converts to
1 2 3 4 |
... EXISTS (SELECT * FROM Table1, Table2 WHERE Table1.x <comp op> Table2.y AND <search condition>) ... |
Of the two quantified predicates, the ALL
predicate is used more. The ANY predicate is more easily replaced and more naturally written with an EXISTS() predicate or an IN predicate. In fact, the standard defines the IN() predicate as shorthand for “= ANY” and the NOT IN predicate as shorthand for “<> ANY”, which is how most people would construct them in English.
The IN() predicate first appeared in the Pascal programming language, so it is also more natural for programmers. We generally teach the IN() predicate without mentioning that it is an ANY predicate in disguise. Stealing from Pascal, the IN() predicate can have a simple list of expressions (they do not have to be scalars!) Or a sub query. The sub query may or may not be correlated.
The ALL
Predicate and Extrema functions
It is counter-intuitive at first that these two predicates are not the same in SQL:
x >= (SELECT MAX(y) FROM Table1)
x >= ALL (SELECT y FROM Table1)
but you have to remember the rules for the extrema functions — they drop out all the NULL
s before returning the greatest or least values. The ALL
predicate does not drop NULL
s, so you can get them in the results.
However, if you know that there are no NULL
s in a column or are willing to drop the NULL
s yourself, then you can use the ALL
predicate to construct single queries to do work that would otherwise be done by two queries. For example, in order to find which manager handles the largest number of products, you would first construct a derived table :
1 2 3 4 5 6 7 8 |
SELECT manager_name FROM (SELECT manager_name, COUNT(*) FROM Personnel GROUP BY manager_name) AS Product_Cnts (manager_name, product_cnt) WHERE product_cnt = (SELECT MAX(product_cnt) FROM Product_Cnts); |
But Alex Dorfman found a single query solution instead:
1 2 3 4 5 6 7 |
SELECT manager_name, COUNT(*)AS product_cnt FROM Personnel GROUP BY manager_name HAVING COUNT(*) + 1 > ALL (SELECT DISTINCT COUNT(*) FROM Personnel GROUP BY manager_name); |
The use of the SELECT DISTINCT
in the subquery is to guarantee that we do not get duplicate rows when two managers handle the same number of products. You can also add a “.. WHERE dept IS NOT NULL
” clause to the subquery to get the effect of a true MAX()
aggregate function.
UNIQUE Predicate
While SQL Server has the ANSI standard UNIQUE
constraint, it does not have the UNIQUE
predicate. The syntax is quite simple and can be computed by comparing COUNT (*)
to COUNT (<value expression >)
.
1 |
UNIQUE <table subquery> |
If there are no two rows in <table subquery>
such that the value of each column in one row is non-NULL
and is not distinct from the value of the corresponding column in the other row, then the result of the <unique predicate> is TRUE
; otherwise, the result of the <unique predicate>
is FALSE
.
This is based on the GROUP BY
equivalence operator. Essentially you are saying the make-believe HAVING
clause has a COUNT(*) = 1
. And optimizer does not even have to look at the table if it has unique indexes on the appropriate columns, so implementation should be pretty fast in modern SQL engines.
MATCH Predicate
While not available in SQL Server, the MATCH predicate is conceptually useful for other advanced constructs. It involves matching rows against a table. ANSI standard SQL uses it in CASE expressions and declarative referential integrity constraints. It is very hard to fake in SQL server. Here is the syntax
1 2 3 |
<match predicate> ::= <row value predicand> <match predicate part 2> <match predicate part 2> ::= MATCH [UNIQUE] [SIMPLE | PARTIAL | FULL] <table subquery> |
Obviously the <row value predicand> has to have the same structure as the table to which it is matched. If neither SIMPLE, PARTIAL, nor FULL is specified, then SIMPLE is implicit. The idea is to take a template and match the table so subquery against a row value. The the matching options handle NULL
s slightly differently. A full match must have a NULL
in the corresponding columns or an exact matching value. A partial match gives the benefit of the doubt to the NULL
s (think of the CHECK() constraint in DDL). A simple match follows the usual rules for row equivalence in DDL. This is probably easier to see with actual example
column_1 |
column_2 |
column_3 |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
NULL |
10 |
10 |
10 |
20 |
Here are some examples of <match predicate>
expressions, all of which are TRUE
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
... `.. ... WHERE ROW(10, 10, 10) MATCH SIMPLE (SELECT * FROM Table_1) ... ... WHERE ROW(10, NULL, 10) MATCH UNIQUE SIMPLE (SELECT * FROM Table_1) ... ... WHERE ROW(NULL, NULL, NULL) MATCH PARTIAL (SELECT * FROM Table_1) ... ... WHERE ROW(10, NULL, 10) MATCH PARTIAL (SELECT * FROM Table_1) ... ... WHERE ROW(NULL, NULL, NULL) MATCH UNIQUE PARTIAL (SELECT * FROM Table_1) ... ... WHERE ROW(NULL, NULL, NULL) MATCH FULL (SELECT * FROM Table_1) ... ... WHERE ROW(10, 10, 10) MATCH FULL (SELECT * FROM Table_1) ... |
And here are examples that are FALSE
:
1 2 3 4 5 6 7 8 9 10 11 |
... WHERE ROW(10, 10, 10) MATCH UNIQUE SIMPLE (SELECT * FROM Table_1) ... ... WHERE ROW(10, NULL, 10) MATCH UNIQUE PARTIAL (SELECT * FROM Table_1) ... ... WHERE ROW(10, NULL, 10) MATCH FULL (SELECT * FROM Table_1) ... ... WHERE ROW(10, 10, 10) MATCH UNIQUE FULL (SELECT * FROM Table_1) ... |
Conclusion
Set-oriented predicates can greatly simplify the answering of many real-life business questions. They are definitely more than mathematical curiosities. There is more to set-oriented predicates in SQL than just the simple IN() and EXISTS(). It will be worthwhile for you to sit down and make up some sample data so you can play with it. Learn how the other features that you might not have known about actually work in this language.