Click here to monitor SSC
  • Av rating:
  • Total votes: 30
  • Total comments: 11
Joe Celko

Nesting Levels in SQL

27 January 2014

The 'Structured' part of SQL denotes the fact that queries can be nested inside each other in such a way that, wherever you can use a table, you can use a table expression. Such derived tables can provide powerful magic, to which is added CTEs and Lateral Tables. Joe Celko explains.

The ancestor of SQL was an IBM project named Sequel, which was a shorthand for “Structured English Query Language”; the term “Structured” in the name came from two sources. The first was the “Structured Revolution” at the time. Dijkstra, Youron, DeMacro and the other early pioneers of modern Software Engineering had just given us “Structured Programming” and “Structured Analysis” so everything had to be Structured. Younger people will remember when everything had to be “Object Oriented”, then “Big Data” then “in the Cloud” and whatever the current fad is at this reading.

And for the record, the ANSI/ISO Standard language is called “Ess Que El” and not “SEQUEL” in spite of the fact that we all screw up. The informal rules in the Standards world is that an ISO Standard made up of letters is spelled out, but a US Federal standard is pronounced as a word, no matter how weird. his is why a Physician's Standards Review Organization is called a “Piss Row” in Medicate/Medicaid terminology. Another rule is that the French will fight to keep the initials of a standard in French order, not English. Did you know that “ISO” is actually named “International Organization for Standardization”?

The second source of the “structured” term was the ability to nest queries inside each other via joins, set operations and other table level operations. This is actually a mathematical property called Orthogonality. In English, it is why operations done on numbers produce numbers, so you can use parentheses to write mathematical expressions. Any place in which you can use a number, you can use a numeric expression. Any place in which you can use a table, you can use a table expression.

In SQL there is a hierarchy of data in which the outer level is the schema or database. The database is made up of tables (which can be base or virtual tables). Each table is made up of a set of rows. These rows have no ordering, but all have the same structure, so it is proper set. Each row is made up of columns. The columns are scalar values drawn from a domain; a domain is set of values of one data type that has rules of its own. These rules include the type of scale upon which it is built and reasonable operations with it.

I tell people to use set diagrams (improperly called Venn Diagram; Euler invented them) when they want to use a doodle to help them think of a query. But most programmers doodle flow diagrams because they grew up with flowcharts, DFDs and similar mind tools. Let me show you what I mean.

Derived Tables

A derived table is a table expression embedded in a containing statement. It has to be placed inside parentheses and it can optionally be given a correlation name and its columns can also optionally given names.

(<table expression>) [[AS] <correlation name> [(<derived column list>)]]

The derived table will act as if it is materialized during the duration of the statement that uses it. Notice the phrase "act as if" in that last sentence. The optimizer is free to re-arrange the statement anyway that it wished so long as the results are the same as the original statement. If there is no (<derived column list>), then the derived table exposes the tables and their columns that created the derived table.

Once the query expression becomes a derived table with a correlation name, the internal structure is hidden. If no (<derived column list>) is given, then it inherits the column names from the component tables. Watch out; if the expression has two or more tables that have common column name, you have to to use <table name>.<column name> syntax to avoid ambiguity.

Materialization is not an easy choice. If one statement is using a derived table, it might be better to integrate it into that statement like a text macro. But if many statements are using the same derived table, it might be better to materialize it once, put it in primary or secondary storage and share it. This is the same decisions the SQL engine had to make with VIEWs. But the derived tables are not in the schema level where the optimizer can find them and keep statistics about them. It takes a pretty smart optimizer to filter them out for materialization.

This is why it is better to put a derived table definition into a VIEW when it is re-used often.

Column Naming Rules

Derived tables should follow the same ISO-11179 Standard naming rules. A table is a table. The keyword "AS" is not required, but it is a good programming practice and so is naming the columns. If you do not provide names, then the SQL engine will attempt to do it for you. The table name will not be accessible to you since it will be a temporary internal reference in the schema information table. The SQL engine will use scoping rules to qualify the references in the statement -- and what you said might not be what you meant. Likewise, columns in a derived table inherit their names from the defining table expression. But only if the defining table expression creates such names. For example, the columns in a UNION, EXCEPT or INTERSECT statement have no names unless you use the AS clause.

When you have multiple copies of the same table expression in a statement, you need to tell them apart with different correlation names. For example, given a table of sports players, we want to show a team captain and team co-captain.

SELECT T1.team_name,

       T1.last_name AS captain,

       T2.last_name AS cocaptain

  FROM Teams AS T1, Teams AS T2

 WHERE T1.team_name = T2.team_name

   AND T1.team_position = 'captain'

   AND T2.team_position = 'cocaptain';

I have found that using a short abbreviation and a sequence of integers for correlation names works very well. This also illustrates another naming rule. The player's last name is used in two different roles in this query, so we need to rename the column to the role name (if it stands by itself without qualification) or use the role name as a prefix (i.e. use "boss_emp_id" and "worker_emp_id" to qualify each employee's role in this table).

Scoping Rules

A derived table can be complete in itself and without a scoping problem at all. For example, consider this query:

SELECT O.order_nbr, B.box_size

 FROM Orders AS O,

   (SELECT box_size, packing_qty)
     FROM Boxes)

     AS B(box_size, packing_qty)

 WHERE O.ship_qty <= B.packing_qty;

The derived table "B" has no outer references and it can be retrieved immediately while another parallel processor works on the rest of the query. Another form of this kind of derived table is a simple scalar subquery:

SELECT O.order_nbr AS over_sized_order

 FROM Orders AS O

 WHERE O.ship_qty > (SELECT MAX(packing_qty) FROM Boxes);

The scalar subquery is computed; the one row, one column result table is converted into a unique scalar value and the WHERE clause is tested. If the scalar subquery returns an empty result set, it is converted into a NULL. Watch out for that last case, since NULLs have a data type in SQL and in some weird situations you can get casting errors.

When a table expression references correlation names in which they are contained, you have to be careful. The rules are not that much different from any block structured programming language. You work your way from the inside out.

This is easier to explain with an example. We have a table of pilots and the planes they can fly (dividend); we have a table of planes in the hangar (divisor); we want the names of the pilots who can fly every plane (quotient) in the hangar. This is Chris Date's version of Relational Division; the idea is that a divisor table is used to partition a dividend table and produce a quotient or results table. The quotient table is made up of those values of one column for which a second column had all of the values in the divisor.

SELECT DISTINCT pilot_name

 FROM PilotSkills AS PS1

 WHERE NOT EXISTS

    (SELECT *

     FROM Hangar AS H

     WHERE NOT EXISTS

        (SELECT *

         FROM PilotSkills AS PS2

         WHERE PS1.pilot_name = PS2.pilot_name

          AND PS2.plane_name = H.plane_name));

 

The quickest way to explain what is happening in this query is to imagine a World War II movie where a cocky pilot has just walked into the hangar, looked over the fleet, and announced, "There ain't no plane in this hangar that I can't fly!", which is bad English but good logic.

Notice that PilotSkills appears twice in the query, as PS1 and as PS2. Go to the innermost "SELECT.. FROM.." construct. We have a local copy of PilotSkills as PS2 and outer references to tables H and PS1. We find that H is a copy of the Hangar table one level above us. We find that PS1 is a copy of the PilotSkills table two levels above us.

If we had written "WHERE pilot_name = PS2.pilot_name" in the innermost SELECT, the scoping rules would have looked for a local reference first and found it. The search condition would be the equivalent of "WHERE PS2.pilot_name = PS2.pilot_name", which is always TRUE since we cannot have a NULL pilot name. Oops, not what we meant!

It is a good idea to always qualify the column references with a correlation name. Hangar did not actually need a correlation name since it appears only once in the statement. But do it anyway. It makes the code a little easier to understand for the people that have to maintain it -- consistent style is always good. It protects your code from changes in the tables. Imagine several levels of nesting in which an intermediate table gets a column that had previously been an outer reference.

Exposed Table Names

The nesting in SQL has the concept of an "exposed name" within a level. An exposed name is a correlation name, a table name that is not followed by a correlation name, or a view name that is not followed by a correlation name. The exposed names must be unique.

Here are some examples to demonstrate scoping rules.

SELECT ..

 FROM (SELECT * FROM A WHERE A.x = 1)

    INNER JOIN

    (SELECT * FROM B WHERE B.x = 2)

 WHERE .. ;

 Tables A and B can be referenced in the outer WHERE clause. These are both exposed names.

SELECT ..

 FROM (SELECT * FROM A WHERE A.x = 1)

    INNER JOIN

    (SELECT * FROM B WHERE B.x = 2) AS X(..)

 WHERE .. ;

But only Table X can be referenced in the outer WHERE clause. The correlation name X is now an exposed name.

 SELECT ..

 FROM (SELECT *

     FROM A

     WHERE A.x

        = (SELECT MAX(xx) FROM C))

    INNER JOIN

    (SELECT * FROM B WHERE B.x = 2)

 WHERE .. ;

Table C is not exposed to any other SELECT statement.

Common Table Expressions (CTE)

ANSI/ISO Standard SQL-99 added the Common Table Expression or CTE. It is also a query expression that is given a name, just like a derived table. The difference is that they appear before the SELECT statement in which they are used.

They are just a very useful bit of syntactic sugar. The only thing to remember is that each CTE is exposed in the order they appear in the WITH clause. That means the n-th CTE in the list can reference the first thru (n-1)-th CTEs. 

LATERAL Tables

If you have worked with blocked structured procedural languages, you will understand the concept of a "Forward Reference" in many of them. The idea is that you cannot use something before it is created in the module unless you signal the compiler. The most common example is a set of co-routines in which Routine A calls Routine B, then Routine B calls Routine A and so forth. If Routine A is declared first, then calls to B have to have an additional declaration that tells the compiler Routine B will be declared later in the module.

ANSI/ISO Standard SQL has a LATERAL derived table construct which lets one table reference another table as the same level. T-SQL does not have this feature yet, but you can use APPLY to get some of the same results.

 This might be easier to demonstrate than to explain. The following example is valid:

SELECT D1.dept_nbr, D1.dept_name, E.sal_avg, E.emp_cnt

 FROM Departments AS D1,

     (SELECT AVG(E.salary), COUNT(*)

        FROM Personnel AS P

       WHERE P.dept_nbr

        = (SELECT D2.dept_nbr

               FROM Departments AS D2

              WHERE D2.dept_nbr = P.workdept)

    ) AS E (sal_avg, emp_cnt);

Notice that the Departments table appears as D1 and D2 at two levels -- D1 is at level one and D2 is a level three.

The following example is not valid because the reference to D.dept_nbr in the WHERE clause of the nested table expression references the Personnel table via P.dept_nbr that is in the same FROM clause.

-- Error, Personnel and Departments are on the same level

SELECT D.dept_nbr, D.dept_name, E.sal_avg, E.emp_cnt

  FROM Departments AS D,

  (SELECT AVG(P.salary), COUNT(*)

     FROM Personnel AS P

    WHERE P.dept_nbr = D.dept_nbr) AS E(sal_avg, emp_cnt);

To make the query valid, we need to add a LATERAL clause in front of the subquery. Notice the order of Personnel and Departments. This exposes the department number so that a join can be done with it in the derived table labeled E.

 

+

-- with a lateral clause

SELECT D.dept_nbr, D.dept_name, E.sal_avg, E.emp_cnt

  FROM Departments AS D,

  LATERAL (SELECT AVG(P.salary), COUNT(*)

     FROM Personnel AS P

     WHERE P.dept_nbr = D.dept_nbr) AS E(sal_avg, emp_cnt);

Programming Tips

Let me finish with some heuristics.

  1. Use aliases that tell the guy maintaining this code what tables are referenced. Do not simply use  A,B,C, .. as names. This comes from the 1950's when files were referenced by the name of the tape drive that held them.
  2. Petty print” the SQL so that the indentation tells the reader what the query structure is. This is a simple task for any of many tools.
  3. Never use “SELECT *” in production code; it will eventually bite you when a table changes. You have a text editor and it is easy to cut & paste a list of column names.
  4. When you have three or more levels of nesting, think about a CTE. The code will be more readable.
  5. When you have two or more references to a query expression think about a CTE. The code will be more readable, but in many SQLs it will also help the optimizer make a decision about materialization (T-SQL is still behind on this).

Any feedback?

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 30 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Interesting read
Posted by: Bill Ross (not signed in)
Posted on: Monday, February 03, 2014 at 5:09 AM
Message: Nice article.

Subject: formatting
Posted by: Anonymous (not signed in)
Posted on: Monday, February 03, 2014 at 1:11 PM
Message: The formatting in IE8 goes haywire starting with the Column Naming rules / teams examples.

Subject: Re: Formatting
Posted by: Andrew Clarke (view profile)
Posted on: Tuesday, February 04, 2014 at 3:53 AM
Message: Try it again. Yeah, we goofed, but had a bit of a problem finding a machine with IE8 on it. It should all work now!

Subject: Yourden
Posted by: Tod C (not signed in)
Posted on: Wednesday, February 05, 2014 at 4:38 PM
Message: I think the spelling of Yourden is with a "d" in there.

Subject: Nice one
Posted by: ChrisPage (view profile)
Posted on: Tuesday, February 11, 2014 at 5:38 AM
Message: When I started out in the Oracle world everyone was saying SQL and then sometime in 1990 all the Oracle consultants I was working with were told by their masters to say sequel (they were required to do 1 day a month training back in Reading and they literally came back one month preaching sequel not sql), most of us seem to have picked up the habit but it still grates. Good article too :-)

Subject: Nesting Levels
Posted by: LotusShiv (view profile)
Posted on: Tuesday, February 11, 2014 at 12:11 PM
Message: Nice article. FYI, you mention its better to use proper aliases for tables used (in your Programming Tips at the end) and yet I see you using P, D etc., for Personnel, Department when something like Prsl and Dept or something like that might make sense, just a suggestion. I would have appreciated if you talked a little bit more about CTE rather than just sort of giving a brief definition.

Subject: Feedback
Posted by: Celko (view profile)
Posted on: Tuesday, February 11, 2014 at 2:14 PM
Message: 1) It should be Yourdon! Arrgh! Two typos!

2)I will defend short alias names with the Law of Proximity. This is principle that when words or other visual elements are "close" to each, either in space, color or semantics, they are seen as a grouping. Since the table name and the alias start with the same first letter, it is easy for a Latin alphabet user to associate "Departments" with "D" , but not so easy to figure out "A" (sequentially lettering the names in the order they appear in the FROM clause), "Chrysanthemums" (unrelated name with a personal meaning for the programmer) or to read the full name over and over.

CTEs are worth an article by themselves.

Subject: CTE
Posted by: dishdy (view profile)
Posted on: Tuesday, February 11, 2014 at 3:01 PM
Message: The only thing I find annoying about the use of CTE in Oracle is that if one of the CTE tables is not used it gets signaled as an error. During development this can occur frequently. In SQL Server this does not happen.

Subject: Oracle CTE
Posted by: Celko (view profile)
Posted on: Tuesday, February 11, 2014 at 3:22 PM
Message: Did not know that! DB2 and Oracle do more optimizations with them (in-line expansion versus materialization ,etc) while SQL Server is just in-line expansion.

Subject: Yeah, it does... but...
Posted by: Jeff Moden (view profile)
Posted on: Wednesday, February 12, 2014 at 8:29 AM
Message: ==============================================
disdy wrote:
The only thing I find annoying about the use of CTE in Oracle is that if one of the CTE tables is not used it gets signaled as an error. During development this can occur frequently. In SQL Server this does not happen.
==============================================

I've found (very thankfully) that "problem" in SQL Server, as well. I have "problem" in quotes because I don't understand why anyone would write a CTE ("Subquery refactoring" in Oracle) or any other code code that consumes resources without the results of the code being used. It doesn't make sense.

Subject: It's during development...
Posted by: dishdy (view profile)
Posted on: Wednesday, February 12, 2014 at 8:41 AM
Message: It's during development that it is annoying. The final result will always be coherent, i.e. all CTE tables will always be referenced at least once.
But during development you frequently want to quickly look at the result of just one of the CTE tables. This is what I'm referring to.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.