Click here to monitor SSC
Av rating:
Total votes: 32
Total comments: 1


Joe Celko
Procedural, Semi-Procedural and Declarative Programming in SQL
15 February 2010

A lot of the time, the key to making SQL databases perform well is to take a break from the keyboard and rethink the way of  approaching the problem; and rethinking in terms of a set-based declarative approach. Joe takes a simple discussion abut a problem with a UDF to illustrate the point that ingrained procedural reflexes can often prevent us from seeing simpler set-based techniques.

I have spent many years trying to persuade people to use declarative rather than procedural code in SQL. One of my books is THINKING IN SETS, with the oversized subtitle "Auxiliary, Temporal, and Virtual Tables in SQL" to explain what topics are covered. Most programmers discover that it is too big a big leap from a procedural mindset to a declarative mindset, and so don't quite make the transition all at once. Instead they evolve from a procedural paradigm to a variety of semi-procedural programming styles.

This is just the way that we learn; a step at a time, not a leap of insight all at once. The first motion pictures, for example, were shot with a fixed position camera aimed at a stage. That is how people had seen stage plays for several thousand years. W. D. Griffith was to movies as Dr. Codd was to databases. Griffith made the first two-reeler in 1910; nobody had thought about more than one reel before that. Then in 1911 and 1912 he and his cameraman, Billy Bitzer, started moving the camera and lens while filming. He gave us new camera angles such as the close-up and soft focus.

In 2010 February, I came across an example, in a Newsgroup discussion, of  programmers making the steps, but not the leap. The code details aren't relevant to the point I'm making so I am going to gloss over them.

The thread starts with a posting about a User-Defined Function (UDF) that is not working. His opening paragraph was:

“I have the code below to take a set of INTEGER values and return a VARCHAR based on the combination of inputs. However, I'm getting an error on line 6, which is the first line where the word CASE pops up. Plenty of CASE statements have passed through my hands before, so I'm lost of why this one is wrong.”

What followed was a CASE expression with BETWEENs and ORs and CASE within CASE constructs. It took pairs of (x, y) and produced an answer from a set of three values, call them {'a', 'b', 'c'}. Again, the coding details are not my point. The body of the function could just as well have been a complicated mathematical expression.

Two repliers pointed out that CASE is an expression and not a statement in SQL. They also pointed out that he was  returning a VARCHAR(1) instead of a CHAR(1).  The CASE function can be confusing to anyone who has been initially trained with a procedural language which has IF-THEN-ELSE statement constructs.

His code looked like this skeleton:

CREATE FUNCTION FnFindFoobar (@x INTEGER, @y INTEGER)
RETURNS VARCHAR
WITH EXECUTE AS CALLER
AS
BEGIN
<< horrible CASE expression with x and y >>;
END;

The clean up and quick fix was:

CREATE FUNCTION Find_Foobar (@in_x INTEGER, @in_y INTEGER)
RETURNS CHAR(1)
WITH EXECUTE AS CALLER
AS
BEGIN
RETURN (<< horrible CASE expression with x and y >>);
END;

Someone else then asked if he had considered pre-calculating the CASE expression results and populating a table with them. This was good advice, since the number of (x, y) pairs involved came to a few thousand cases. There is no point in dismissing this solution when the look-up table is as small as this one. Read-only tables this size tend to be in main storage or cache, so they can shared among many sessions, and you aren't going to save much on memory by choosing a different method.

But the person who made this suggestion went on to add “You can use the table with your UDF or you could use it without the UDF.”, but he did not explain what the differences are. They are important. Putting the data in the read-only tables this size will tend to keep it in main storage or cache, If you are really that tight for primary and/or secondary storage that you cannot fit a ~5K row table in your hardware, buy some chips and disks. They are so cheap today. Now the data be shared among many sessions. The table and its indexes can be used by the optimizer. In SQL Server you can include the single column foobar in the index to get a covering index and performance improvement.

But if you choose to lock the data inside the procedural code of a UDF, can it be shared? Do computations get repeated with each invocation? What about indexes? Ouch! A UDF pretty-much locks things inside. Standard SQL/PSM has a [NOT] DETERMINISTIC option in its procedure declarations. This tells the compiler whether the procedure or function is always going to return the same answer for the same arguments.

 note about Standard SQL terms: a parameter is the formal place holder in the parameter list of a declaration and an argument is the value passed in the invocation of the procedure

 A non-deterministic function has to be computed over and over again, every time the UDF is called; if the Query Optimiser doesn't know for certain whether a a procedure or function is deterministic, it has to assume it isn't and go the long route.

Here is the skeleton of what was posted.

-- Create table
CREATE TABLE Foobar
(x INTEGER NOT NULL,
y INTEGER NOT NULL,
foobar CHAR(1) NOT NULL,
PRIMARY KEY CLUSTERED (x, y));

-- Populate table with recursive CTEs and proprietary syntax
WITH X_CTE
AS
(SELECT 100 AS x
UNION ALL
SELECT x + 1
FROM X_CTE
WHERE x < 300),

Y_CTE
AS
(SELECT 1 AS y
UNION ALL
SELECT y + 1
FROM Y_CTE
WHERE y < 100)

INSERT INTO Foobar (x, y, foobar)
SELECT x, y, << horrible CASE expression with x and y >> AS foobar
FROM X_CTE
CROSS JOIN
Y_CTE;

This is a nice trick, but it is worth noting that it is highly proprietary. No other SQL has ever allowed SELECT without a FROM. Standard SQL requires a RECURSIVE keyword, and you cannot use CTEs with any statement but a SELECT. It is easy enough to re-write this into portable Standard SQL, using a table of integers called Series (I used to call it Sequence, but that is now a reserved word in Standard SQL. If you have a better name, please tell me). This is the most common SQL auxiliary table; experienced SQL programmers create it and then a Calendar table at the start of almost new project.

INSERT INTO FooLookup (x, y, foobar)
SELECT X_CTE.x, Y_CTE.y, << horrible CASE expression >> AS foobar
FROM (SELECT seq
FROM Series
WHERE seq BETWEEN 100 AND 300) AS X_CTE(x)
CROSS JOIN
(SELECT seq
FROM Series
WHERE seq BETWEEN 1 AND 100) AS Y_CTE(y);

Recursion, a procedural tool, is expensive. But that is not my point. The first thought was to use a procedural tool and not a data driven approach to get that CROSS JOIN. See what I mean a mindset? This is the semi-procedural guy going back to what he knows. He almost got to a declarative mindset.

Now let's go on with the rest of the skeleton code for the function:

CREATE FUNCTION Find_Foobar
(@in_x INTEGER, @in_y INTEGER)
RETURNS CHAR(1)
WITH EXECUTE AS CALLER
AS
BEGIN
RETURN
COALESCE
((SELECT foobar
FROM Find_Foobar
WHERE x = @in_x
AND y = @in_y), 'A');
END;

The reason for COALESCE() is that 'A' is a default value in the outer CASE expression, but also a valid result in various THEN and ELSE clauses inside inner CASE expressions. The scalar query will return a NULL if it cannot find an (x, y, foobar) row in the table. If we know that the query covers the entire (x, y) universe, then we did not need the COALESCE() and could have avoided a UDF completely.

Now, let's think about declarative programming. In SQL that means constraints in the table declaration in the DDL. This skeleton has none except the PRIMARY KEY. Aiee! Here is a problem that you find with magazine articles and newsgroup postings; It is so easy to skip over the constraints when you provide a skeleton table . You did not need them when you declared a file, do you? What one can forget is that the three SQL sub-languages (DDL, DML and DCL) work together. In particular, the DDL constraints are used by the DML optimizer to provide a better execution strategy..

The << horrible CASE expression >> implied expectations for x and y. We were given lower limits (100 and 1), but the upper limits were open after a small range of (x, y) pairs. I think we can assume that the original poster expected the vast majority of cases (or all of them) to fall in that small range and wanted to handle anything else as an error. In the real world, there is usually what Jerry Weinberg called “reasonableness checking” in data. The principle is also known as Zipf's Law or the “look for a horse and not a zebra” principle in medicine.

The simple first shot would be to assume we always know the limits and can simply use:

CREATE TABLE FooLookup
(x INTEGER NOT NULL
CHECK (x BETWEEN 100 AND 300),
y INTEGER NOT NULL
CHECK (y BETWEEN 1 AND 100),
foobar CHAR(1)
DEFAULT 'A'
NOT NULL
CHECK (foobar) IN ('A', 'B', 'C'),
PRIMARY KEY (x, y));

The DEFAULT 'A' subclause will take care of situation where we did not have an explicit value for foobar. This avoids the COALESCE(). But what if one of the parameters can be anything? That is easy; drop the CHECK() and add a comment. What if one of the parameters is half open or has a huge but sparse space? That is, we know a lower (upper) limit, but not the matching upper (lower) limit. Just use a simple comparison, such as CHECK (y >= 1), instead of a BETWEEN.

A common situation, which was done with nested CASE expression in the original, is that you know a range for a parameter and what the results for the other parameter within that range are. That might be easier to see with code. Here is a CASE expression for some of the possible (x,y) pairs:

CASE
WHEN x BETWEEN 100 AND 200
THEN CASE
WHEN y IN (2, 4, 6, 8) THEN 'B'
WHEN y IN (1, 3, 5, 7, 9) THEN 'C'
END
WHEN x BETWEEN 201 AND 300
THEN CASE
WHEN y IN (2, 4, 6, 8, 99) THEN 'C'
WHEN y IN (3, 5, 7, 9, 100) THEN 'B'
END
ELSE 'A'
END

This is the DML version of a constraint. It lives only in the INSERT. UPDATE, INSERT or SELECT statement where it appears. What we really want is constraints in the DDL so that all statements, present and future, use it. The trick is to create the table with low and high values for each parameter range; a single value is shown with the low and high values equal to each other.

CREATE TABLE FooLookup
(low_x INTEGER NOT NULL,
high_x INTEGER NOT NULL,
CHECK (low_x <= high_x),
low_y INTEGER NOT NULL,
high_y INTEGER NOT NULL,
CHECK (low_y <= high_y),
foobar CHAR(1) NOT NULL
CHECK (foobar) IN ('A', 'B', 'C'),
PRIMARY KEY (x, y));

CASE expression now becomes this table:

low_x high_x low_y high_y foobar
==================================
100 200 2 2 'B'
100 200 6 6 'B'
100 200 8 8 'B'
100 200 1 1 'C'
100 200 3 3 'C'
100 200 5 5 'C'
100 200 7 7 'C'
100 200 9 9 'C'
201 300 2 2 'C'
201 300 4 4 'C'
201 300 6 6 'C'
201 300 8 8 'C'
201 300 99 99 'C'
201 300 3 3 'B'
201 300 5 5 'B'
201 300 7 7 'B'
201 300 9 9 'B'
201 300 100 100 'B'
301 9999 101 9999 'A'
-9999 99 -9999 0 'A'

As a safety device, put the default 'A' in ranges outside the rest of the table. I used -9999 and 9999 for the least and greatest limits, but you get the idea.

The query has to use BETWEENs on the high and low limits:

SELECT F.foobar, ..
FROM FooLookup AS F, ..
WHERE my_x BETWEEN F.low_x AND F.high_x
AND my_y BETWEEN F.low_y AND F.high_y
AND ..;

Is this always going to be the best way to do something? Who knows? Test it.



This article has been viewed 11883 times.
Joe Celko

Author profile: Joe Celko

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 32 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Better Name for Sequence
Posted by: An Phu (not signed in)
Posted on: Tuesday, February 23, 2010 at 3:59 PM
Message: >I used to call it Sequence, but that is now a reserved word in Standard SQL. If you have a better name, please tell me

How about Number0 and a view, Number1?

 










Phil Factor
Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL Server... Read more...



 View the blog
Using SQL Test Database Unit Testing with TeamCity Continuous Integration
 With database applications, the process of test and integration can be frustratingly slow because so... Read more...

SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

How to Import Data from HTML pages
 It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Hosted Team Foundation Server 2010 Review
 Team Foundation Server (TFS) has expanded its remit to support the whole software development process,... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk