Click here to monitor SSC
  • Av rating:
  • Total votes: 22
  • Total comments: 9
Joe Celko

Improving Comparison Operators and Window Functions

01 April 2011

It is dangerous to assume that your data is sound.  SQL already has intrinsic ways to cope with missing, or unknown data in its comparison predicate operators, or Theta operators. Can SQL be more effective in the way it deals with data quality? Joe Celko  describes how the SQL Standard could soon evolve to deal with data in ways that allow aggregation and windowing in cases where the data quality is less than perfect

Dr. Codd introduced the term "theta operators" in his early papers for what a programmer would have called a comparison predicate operator. The large number of data types in SQL makes doing comparisons a little harder than in other programming languages. Values of one data type have to be promoted to values of the other data type before the comparison can be done. SQL is a strongly typed language which means a lot of type castings are not possible – you cannot turn Christmas into a number and find its square root.

The comparison operators are overloaded and will work for numeric, character, and temporal data types. The symbols and meanings for comparison operators are shown in the table below. Thus, the symbol <= means “at most” when the items are numeric, means “collates before or equal to” with character data types and means “no earlier than” with temporal data types.

The comparison operators will return a logical value of TRUE, FALSE or UNKNOWN. The values TRUE and FALSE follow the usual rules and UNKNOWN is always returned when one or both of the operands is a NULL.

SQL Programmers today do not remember the early days when NULLs were controversial. No procedural programming language had anything like this. The closest thing was in spreadsheets where various proprietary conventions existed for missing data.

The early Sybase SQL Server implementation allowed you to write “<expression> = NULL” and have it treated as “<expression> IS NULL”; this became a configuration option. But at the same time, “<expression> <non-equality operator> NULL” would work correctly and return UNKNOWN.

To make things even more confusing, SQL has two different equivalence relations. Let me give you a little math. An equivalence relation takes elements of a set and partitions them into “equivalence classes”. Each element belongs to one and only one class. Given any two elements from the set, you use the equivalence operator to determine if the elements are in the same class or not. The rules are:

a = a; idempotent property


(a = b) ⇔ (b = a); symmetric property


(a = b) ∧(b = c) ⇒ (a = c); transitivity property

Obviously, good old “equals” is such a relationship. But so does a Modulus function. Think about defining the old Pascal Boolean EVEN() function defined as MOD (x, n) where (n=2).

EVEN(a) = EVEN(a);

(EVEN(a) = EVEN(b)) ⇔(EVEN(b) = EVEN(a));

(EVEN(a) = EVEN(b))∧ (EVEN(b) = EVEN(c)) ⇒(EVEN(a) = EVEN(c));

It partitions integers into odd or even numbers. You could do this for any value of (n).

In SQL, when you do a GROUP BY, you get a partitioning, and the NULLs are all put into one group. This was debated in the ANSI X3H2 Committee. If we had used strict equality, each NULL would be its own class and things would be a mess. So we invented grouping. Grouping is handy for many queries and not just for aggregate functions. It has the nice property of getting us back to two valued logic (2VL) and we like that.

This has lead programmers to write expressions to treat NULLs as members of the same equivalence class outside of aggregations. The most common idioms you will see are:

((<exp1> = <expr2) OR (<exp1> IS NULL = <expr2> IS NULL))

or

(COALESCE (<exp1>, <absurd value>) = COALESCE (<exp2>, <absurd value>)

You have to be sure that <absurd value> is just that; something that cannot occur in either expression. Neither of these idioms is easy to read or to write. And until the optimizer gets smarter, they do not compile as well as you might like.

The SQL:2003 Standards added a verbose but useful theta operator.

 <expression 1> IS NOT DISTINCT FROM <expression 2>

is logically equivalent to

 ((<expression 1> IS NOT NULL

  AND <expression 2> IS NOT NULL

  AND <expression 1> = <expression 2>)

 OR (<expression 1> IS NULL AND <expression 2> IS NULL))

The following the usual pattern for adding NOT into SQL constructs,

 <expression 1> IS DISTINCT FROM <expression 2>

is a shorthand for

 NOT (<expression 1> IS NOT DISTINCT FROM <expression 2>)

This double negative was because the IS NOT DISTINCT FROM was defined first. I have no idea why.

The practical problems of missing or unknown data can be deemed to be solved by the use of NULLs as the convention in SQL. Hooray for us! But now the problem is data quality.

Data Quality as The Next Problem

Let me refer you to the work of Larry English, David Loshin, Jack Olson and Tom Redman. There is a good list of books on this topic here.

The basic message is that we have databases full of bad data. This is not a great surprise to anyone who still has a job in IT. I propose that we add a new theta operator to SQL in the next SQL Standard to help with this problem. Let me give the syntax and then explain it:

 <fix the mess predicate> ::= <expression 1> SHOULD [NOT] HAVE BEEN <expression 2>

As usual, we will follow the ANSI SQL Standard convention that:

 <expression 1> SHOULD NOT HAVE BEEN <expression 2>

is equivalent to:

NOT (<expression 1> SHOULD HAVE BEEN <expression 2>)

Implementation details are left to SQL optimizer writers; we Standards writers are only obligated to provide unambiguous BNF syntax and a narrative description of the effects that uses the “standard speak” requirements.

In the case of SQL, we decided we wanted an LALR(1) grammar. You are probably thinking that you should have been sober when Professor Celko got to that stuff in his compiler writing class. But since I was not sober either, I can forgive you.

The predicate describes itself. But let's look at this as compared to an equivalence relation. It is idempotent:

a SHOULD HAVE BEEN a

This says that we trust the data; it looks good. The symmetric and transitivity properties do not apply. Constants can appear on either side, so this is asymmetric, like the LIKE predicate.

The first purpose of the predicate is to allow us to use bad data in queries, insert, update and delete statements. However it is useful in DDL CHECK() constraints.

Windowed Aggregate Functions

The rule for NULLs is that they are dropped from Aggregate functions. That is to compute the SUM([ALL | DISTINCT] <expression>), AVG([ALL | DISTINCT] <expression>), MIN([ALL | DISTINCT] <expression>), MAX([ALL | DISTINCT] <expression>) and COUNT([ALL | DISTINCT] <expression>) first drop all of the NULLs, then drop redundant duplicate values if DISTINCT is specified and finally the function is performed.

The regular aggregate functions can now take a window clause.

<aggregate function>

OVER([PARTITION BY <column list>]

         [ORDER BY <sort column list>]

      [<window frame>]

      [FUDGE TO <expression> [<fudge factor>]] --- proposed extension

     )

 

 <fudge factor> ::=  PLUS OR MINUS <expression>

They are easier to explain with the help of a diagram.

The Window Clause

The window clause is also called the OVER() clause informally. It is an extension for aggregate functions. The idea is that the table is first broken into partitions with the PARTITION BY subclause. The partitions are then sorted by the ORDER BY clause. An imaginary cursor sits on the current row where the windowed function is invoked. A subset of the rows in the current partition is defined by the number of rows before and after the current row by the [ROW | RANGE] subclause. Finally, the subset is passed to an aggregate or ordinal function to return a scalar value.

PARTITION BY Subclause

A set of column names specifies the partitioning, which is applied to the rows that the preceding FROM, WHERE, GROUP BY, and HAVING clauses produced. If no partitioning is specified, the entire set of rows composes a single partition and the aggregate function applies to the whole set each time. Though the partitioning looks a bit like a GROUP BY, it is not the same thing. A GROUP BY collapses the rows in a partition into a single row. The partitioning within a window, though, simply organizes the rows into groups without collapsing them.

ORDER BY Subclause

The ordering within the window clause is like the ORDER BY clause in a CURSOR. It includes a list of sort keys and indicates whether they should be sorted ascending (ASC) or descending (DESC). The important thing to understand is that ordering is applied within each partition.

<sort specification list>::= <sort specification> [{,<sort specification>}...]

 

<sort specification>::= <sort key> [<ordering specification>] [<null ordering>]

 

<sort key>::= <value expression>

 

<ordering specification>::= ASC | DESC

 

<null ordering>::= NULLS FIRST | NULLS LAST

The NULLS FIRST and NULLS LAST options are trickier than they first look, but T-SQL does not have them yet so do not panic yet.

Window Frame Subclause

The tricky one is the window frame. T-SQL does not have it yet and many MVPs want Microsoft to catch with DB2 and Oracle. Here is the BNF, but you really need to see code for it to make sense.

<window frame clause>::=  <window frame units> <window frame extent>

[<window frame exclusion>]

 

<window frame units>::= ROWS | RANGE

RANGE works with a single sort key. It is for data that is a little fuzzy on the edges, if you will. If ROWS is specified, then sort list is made of exact numeric with scale zero.

<window frame extent>::= <window frame start> | <window frame between>

 

<window frame start>::=

UNBOUNDED PRECEDING | <window frame preceding> | CURRENT ROW

 

<window frame preceding>::= <unsigned value specification> PRECEDING

If the window starts at UNBOUNDED PRECEDING, then the lower bound is always the first row of the partition; likewise, CURRENT ROW explains itself. The <window frame preceding> is an actual count of preceding rows.

<window frame bound>::= <window frame start> | UNBOUNDED FOLLOWING | <window frame following>

 

<window frame following>::= <unsigned value specification> FOLLOWING

If the window starts at UNBOUNDED FOLLOWING, then the lower bound is always the last row of the partition; likewise, CURRENT ROW explains itself. The <window frame following> is an actual count of following rows.

<window frame between>::=

BETWEEN <window frame bound 1> AND <window frame bound 2>

 

<window frame bound 1>::= <window frame bound>

 

<window frame bound 2>::= <window frame bound>

FUDGE TO Subclause

The windowed functions need to be extended to handle bad data, just as we are doing with the SHOULD HAVE BEEN comparison operator. This is handled by the proposed FUDGE TO subclause.

Imagine that you want to do a sum and to keep it simple let's ask "What is two plus two?"

A Liberal Arts major will tell you that it is 22. A social worker will tell you that they don't know the answer but that they are very glad that we had the opportunity to discuss it. They are both useless.

An engineer will tell you the answer is somewhere between 3.999 and 4.001 because of floating point errors. A SQL programmer will tell you it is four because of integer math. These are useful, but only if you have clean data.

The new FUDGE TO subclauses will look at the normal result value then will adjust it to whatever your boss wants it to be. This uses a fudge factor value arrive at the target values. This is an old informal concept used by accountants and bureaucrats. Formalizing it is an implementation detail.

Future Work

We hope that the research that IBM put into their Watson project (www.ibm.com/watson) can be put into SQL. Watson is the super computer has defeated the two greatest human champs in the history of the television quiz show "Jeopardy” this year.

The goal is to have the compiler read the code and read the comments. When they disagree. It compiled the comments instead of the code.

To celebrate April Fool's day, Joe wrote this specially-commissioned article that mixes the profound with the silly.
 We will be giving away two big prizes for the winners who correctly identify all  the April Fool jokes in this article. Please email your solutions to editor@simple-talk.com with the words 'April Fool' in the subject line. The two winners will be drawn out of the hat containing the names of the correct entries and will get either a SQL Developer Bundle license worth £895 or a SQL DBA Bundle license worth £825 (please state your preferences) . The next six correct entries will get a NET Reflector VS Pro license worth £60. Winners for the competition must be subscribers to either SQLServerCentral or Simple-Talk. Joe Celko and Phil Factor will, between them, decide on what constitutes a correct entry.  Entries by April 3rd 2011.

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 22 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: follow uo article
Posted by: Celko (view profile)
Posted on: Thursday, March 31, 2011 at 7:28 PM
Message:
I was hoping to explain a proposal for "SELECT [VAGUE | DISTINCT] <expression list>" as a replacement for the existing "SELECT {ALL | DISTINCT] < expression list>" syntax. However, we were unable to get the precise language we needed to submit a proposal to, I don't know, somebody. Maybe we will do it later, sometime.

Subject: Can we also add IS NOT NEARLY AS MUCH to the language?
Posted by: Anonymous (not signed in)
Posted on: Friday, April 01, 2011 at 7:15 AM
Message: Example:

150,000.00 IS NOT NEARLY AS MUCH AS 150,000,000.00

That would really help me.

Subject: Eureka!
Posted by: Anonymous (not signed in)
Posted on: Friday, April 01, 2011 at 7:28 AM
Message: Why not a READ MY MIND operator

Subject: follow uo article
Posted by: Celko (view profile)
Posted on: Friday, April 01, 2011 at 7:28 AM
Message:
I was hoping to explain a proposal for "SELECT [VAGUE | DISTINCT] <expression list>" as a replacement for the existing "SELECT {ALL | DISTINCT] < expression list>" syntax. However, we were unable to get the precise language we needed to submit a proposal to, I don't know, somebody. Maybe we will do it later, sometime.

Subject: multiple NULL meanings
Posted by: DEK46656 (not signed in)
Posted on: Friday, April 01, 2011 at 7:52 AM
Message: Personally, I would like to see multiple definitions of NULL added: one for missing data, one for missing fields from outer JOINs, and one more for “I have no idea”. Then you could have implicit and explicit conversions between them.

Subject: READ MY MIND
Posted by: Celko (view profile)
Posted on: Friday, April 01, 2011 at 8:59 AM
Message: The READ MY MIND statement is not a database programmer command. It is clearly intended for end users and management, and needs to be implemented at a higher system level, across the back ends, front ends and browsers.

Subject: Converting Christmas into a number and taking a square root
Posted by: Anonymous (not signed in)
Posted on: Friday, April 01, 2011 at 9:04 AM
Message: I believe I have successfully resolved the issue of conveting Christmas into a number and taking its square root. What's more, it's a one-line query:

select sqrt(convert(int, convert(char(8), convert(datetime, '2011-12-25'), 112)))

which clearly returns 4484.55404694826.

Subject: Converting Christmas into a number and taking a square root
Posted by: Anonymous (not signed in)
Posted on: Friday, April 01, 2011 at 9:16 AM
Message: That's MSSQL T-SQL BTW. Probably impossible in Oracle.

Subject: Anon's SQRT()
Posted by: HLogic (view profile)
Posted on: Friday, April 01, 2011 at 2:04 PM
Message: I'm sure Anon meant to use CAST() as CONVERT() is not ANSI...

...and my result differed, hmmm...
SELECT SQRT(CAST(CAST('2011-12-25' AS datetime) AS decimal));

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.