Click here to monitor SSC
  • Av rating:
  • Total votes: 8
  • Total comments: 3
Joe Celko

The Adjacency List Model for Trees - Part I

08 February 2011

In the early days of System R at IBM, one of the arguments against a relational database was that SQL could not handle hierarchies like IMS could, and would therefore not be practical for large databases. It might have a future as an ad hoc query language, but that was the best that could be expected of it.

In a short paper, Dr. E. F. Codd described a method for showing hierarchies in SQL that consisted of a column for the boss_name and another column for the employee in the relationship. It was a direct implementation in a table of the Adjacency List Model (ALM) of a graph. Oracle was the first commercial database to use SQL and the sample database that comes with their product, nicknamed the "Scott/Tiger" database in the trade because of its default user and password codes, uses an adjacency list model in a combination Personnel/Organizational chart table. The organizational structure and the personnel data are mixed together in the same row.

This model stuck for several reasons other than just Dr. Codd and Oracle's seeming endorsements. It is probably the most natural way to convert from an IMS database or from a procedural language to SQL if you have been a procedural programmer all of your life.

The Simple Adjacency List Model

In Oracle's "Scott/Tiger" personnel table, the "linking column" is the employee identification number of the immediate boss_name of each employee. The president of the company has a NULL for his boss_name. Here is an abbreviated version of such a hybrid Personnel/Organizational chart table.

CREATE TABLE Personnel_Org

(emp_name VARCHAR(10) NOT NULL,

 boss_name VARCHAR(10), -- null means root

 salary_amt DECIMAL(6,2) DEFAULT 0.00 NOT NULL,

... );

 

Personnel_Org

emp_name   boss_name salary_amt

===============================

'Albert'   NULL      1000.00

'Bert'    'Albert'    900.00

'Chuck'   'Albert'    900.00

'Donna'   'Chuck '    800.00

'Eddie'   'Chuck'     700.00

'Fred '   'Chuck'     600.00

The use of a person's name for a key is not a good programming practice, but ignore that for now since it will make the discussion easier. We should have a table for the personnel, an organizational chart table and a job assignment table to match people (or vacancies ) with positions. But people just don't do it.

There is a horrible truth about the Simple Adjacency List Model that nobody noticed. It is not a normalized schema. The short definition of normalization is that all data redundancy has been removed and it is safe from data anomalies. I coined the phrase that a normalized database has "one thing, one way, one place, one time" as a mnemonic for characteristics we want in a data model.

The "one thing" rule means that the table models either one and only one set of entities of the same kind, or it models one and only one relationship among entities; it does not model both entities and relationships.

Now consider that this table includes information about the node (the salary of the employee) as well as who their boss (boss_name) is in each row. This means that you have a mixed table of entities (personnel) and relationships (organization) and thus its rows are not properly formed.

Since both emp_name and boss_name are drawn from the same domain, they ought to use the same data type and constraints. There is nothing to compel this here, but we could use a REFERENCE to a Personnel table in a proper ALM. Again, not an inherent problem, but an example of bad DDL.

The third characteristic of a normalized table is that each entity instance appears "in one place" in the schema -- that is, it belongs in one row of one table, from which it can be referenced. The last characteristic of a normalized table is that each fact appears "one time" in the schema. That is, you want to avoid data redundancy. Both of these conditions are violated and we can have anomalies.

UPDATE Anomalies

Let's say that 'Chuck' decides to change his name to 'Charles', so we have to update the Personnel_Org table:

UPDATE Personnel_Org

   SET emp_name = 'Charles'

 WHERE emp_name = 'Chuck';

But that does not work. We want the table to look like this:

Personnel_Org

emp_name boss_name salary_amt

 =============================

'Albert'   NULL      1000.00

'Bert'    'Albert'    900.00

'Charles' 'Albert'    900.00 change as employee

'Donna'   'Charles'   800.00 ◄ change as boss_name #1

'Eddie'   'Charles'   700.00 ◄ change as boss_name #2

'Fred '   'Charles'   600.00 ◄ change as boss_name #3

Four rows are affected by this UPDATE statement. If a Declarative Referential Integrity REFERENCES clause was used, then an ON UPDATE CASCADE clause with a self-reference could make the three "boss_name role" changes automatically. Otherwise, the programmer has to write two UPDATE statements.

BEGIN TRANS

UPDATE Personnel_Org

   SET emp_name = 'Charles'

 WHERE emp_name = 'Chuck';

UPDATE Personnel_Org

   SET boss_name = 'Charles'

 WHERE boss_name = 'Chuck';

END;

or if you prefer, one UPDATE statement which hides the logic in a faster, but convoluted, CASE expression.

UPDATE Personnel_Org

   SET emp_name = CASE WHEN emp_name = 'Chuck'

                  THEN 'Charles',

                  ELSE emp_name END,

       boss_name = CASE WHEN boss_name = 'Chuck'

                   THEN 'Charles',

                   ELSE boss_name END

 WHERE 'Chuck' IN (boss_name, emp_name);

But as you can see, this is not a simple change of just one fact.

INSERT Anomalies

The Simple Adjacency List Model has no constraints to preserve subordination. Therefore, you can easily corrupt the Personnel_Org with a few simple insertions, thus.

-- make a cycle in the graph

INSERT INTO Personnel_Org VALUES ('Albert', 'Fred', 100.00);

Obviously, you can create cycles by inserting an edge between any two existing nodes.

DELETE Anomalies

The Simple Adjacency List Model does not support inheritance of subordination. Deleting a row will split the tree into several smaller trees, as for example.

DELETE FROM Personnel_Org WHERE emp_name = 'Chuck';

Suddenly 'Donna', 'Eddie' and 'Fred' find themselves disconnected from the organization and no longer reporting indirectly to 'Albert' anymore. In fact, they are still reporting to 'Chuck', who does not exist any more! Using an ON DELETE CASCADE referential action or a TRIGGER could cause the entire subtree to disappear. While burying your slaves with you was fine for the Pharaoh of Egypt, it is probably a bad surprise for Chuck's former subordinates.

Structural Anomalies

Finally, we need to preserve the tree structure in the table. We to be sure that there is only one NULL in the structure, but the Simple Adjacency List Model does not protect against no root, multiple roots or cycles.

-- self-reference

INSERT INTO Personnel_Org (boss_name, emp_name) VALUES (a, a);

 

-- simple cycle

INSERT INTO Personnel_Org (boss_name, emp_name) VALUES (c, b);

INSERT INTO Personnel_Org (boss_name, emp_name) VALUES (b, c);

The problem is that the Adjacency List Model is actually a general model for any graph. A tree is a special case of a graph, so you need to restrict the Adjacency List Model a bit to be sure that you do have only a tree.

Fixing the Adjacency List Model

In fairness, I have been kicking a straw man. These flaws in the Simple Adjacency List Model can be overcome with a redesign of the schema.

Firstly, the Personnel list and the Organizational chart could and should be modeled as separate tables. The Personnel table contains the facts about the people (entities) who we have as our personnel and the Organizational Chart tells us how the job positions within company are organized (relationships), regardless of whom -- if anyone -- holds what position. It is the difference between the office and the person who holds that office. Finally, you need a job assignment table to get them together.

CREATE TABLE Personnel

(emp_nbr INTEGER NOT NULL PRIMARY KEY,

 emp_name VARCHAR(10) DEFAULT '{{vacant}}' NOT NULL,

 emp_email VARCHAR(255) NOT NULL,

 birth_date DATE NOT NULL,

...);

I am assuming that we have several dummy employees named '{{vacant}}' with a dummy employee number. It makes reports and queries easier when there is a unique place marker that can be upgraded to a real person later.

The information about the positions within the company goes into a second table, thus.

CREATE TABLE Job_Positions

(position_nbr INTEGER NOT NULL PRIMARY KEY,

 job_title VARCHAR(30) NOT NULL PRIMARY KEY,

salary_amt DECIMAL (12,4) NOT NULL CHECK (salary_amt >= 0.00),

...);

Then we need a table to put personnel in to positions

CREATE TABLE Job_Assignments

(position_nbr INTEGER NOT NULL

REFERENCES JobPositions (position_nbr),

emp_nbr INTEGER NOT NULL

REFERENCES Personnel (emp_nbr),

 PRIMARY KEY (position_nbr, emp_nbr),

 ..);

Then the Organizational Chart table will show the hierarchical relationship among the positions as an ALM. Yes, this is getting complicated for a short article trying to demonstrate principles. But it is important to talk about how to do it right.

Pretend that the SALM uses emp_nbr in two roles; emp_nbr for the subordinate and boss_name for the supervisor. Now let's add constraints to it.

Trees have certain properties:

  1. The tree has one and only one root node
  2. The root node has in-degree = 0 (i.e. no parent node)
  3. All non-root nodes have in-degree = 1 (i.e. exactly one parent)node
  4. The number of edges is one less than the number of nodes in the tree

Let's declare the table with

CREATE TABLE Personnel_Org

(emp_name CHAR(5) NOT NULL UNIQUE, -- one parent

 boss_name CHAR(5), -- null means root

 salary_amt DECIMAL(10,2) NOT NULL

CHECK (salary_amt >= 0.00),

CHECK (boss_name <> emp_nbr)); -- shortest cycle check

The most obvious constraint is to prohibit a single node cycle in the graph, like node "D: in the diagram. That is the "CHECK (boss_name <> emp_nbr)" in the DDL. You do not see this done practice. The UNIQUE(emp_nbr) constraint limits an employee to one and only one boss.

Checking for a single root node can be done with:

CHECK((SELECT COUNT(*) FROM Personnel_Org WHERE boss_name IS NULL) = 1)

We know that the number of edges in a tree is the number of nodes minus one so this is a connected graph. That constraint looks like this:

CHECK ((SELECT COUNT(*) FROM Personnel_Org) -1 -- edges

        = (SELECT COUNT(boss_name) FROM Personnel_Org)) -- nodes

While this CHECK() syntax is valid ANSI/ISO Standard SQL, it is not T-SQL. These things have to be put into TRIGGERs. Here is some skeleton code:

CREATE TRIGGER Valid_Tree_Check

ON AdjacencyListModel

FOR UPDATE, INSERT, DELETE

AS

BEGIN

IF (SELECT COUNT(*) FROM Personnel_Org WHERE boss_name IS NULL) <> 1

BEGIN

 RAISERROR('multiple or missing root', 16, 1);

 ROLLBACK;

END;

 

IF (SELECT COUNT(*) FROM Personnel_Org) -1 -- edges

    <> (SELECT COUNT(boss_name) FROM Personnel_Org) -- nodes

BEGIN

 RAISERROR('invalid edge and node counts', 16, 1);

 ROLLBACK;

END;

END; --Valid_Tree_Check

The nice part about a TRIGGER is that you get an exact error message to help scrub the data. Now consider this table and the matching graph.

 emp_name boss_name   

====================

'Albert'   NULL  

'Bert'    'Albert'

'Chuck'   'Albert'

'Donna'   'Eddie'

'Eddie'   'Donna'

It has one root and the right edge-node relationship, but it is not a tree.

The point is that it is hard to validate an Adjacency List Model for a tree without doing a traversal. But traversals are procedural programming and we want something declarative that an optimizer might be able to use some day. Are there any other bits of graph theory that might help? Well, nodes come in three flavors: the single root node, the leaf node and the guys int he middle who don't have a neat sounding name. They can be defined by:

Finding the root node in an Adjacency List Model is easy; look for a NULL boss-name.

Finding the leaf nodes is a bit more work. But they are the ones without subordinates.

(SELECT emp_name FROM Personnel_Org

 EXCEPT

 SELECT boss_name FROM Personnel_Org)

AS Leaf_Nodes(emp_name)

Finding the middle nodes could be done by removing the root and the leaf nodes. But with a little set theory, we see they are the ones that are both a subordinate and a supervisor -- a classic set intersection.

(SELECT emp_name FROM Personnel_Org

 INTERSECT

 SELECT boss_name FROM Personnel_Org)

AS Middle_Nodes(emp_name)

This is enought for now. In part II, I will do more integrity checking and get into basic queries.

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 8 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: cycles
Posted by: sirmaelstrom (view profile)
Posted on: Monday, February 13, 2012 at 4:53 PM
Message: How does one prevent circular references? Does this need to be done in business logic or is there an efficient way of detecting a cycle in a change then rolling back with a trigger?

Subject: Nicely Done
Posted by: Jeff Moden (view profile)
Posted on: Sunday, June 03, 2012 at 1:38 AM
Message: I realize this article is more than a year old but thought I'd say, "Nicely Done". This article explains a lot about Adjacency Lists no matter which RDBMS you may be using. Thanks, Joe.

Subject: May have to take that back
Posted by: Jeff Moden (view profile)
Posted on: Sunday, June 03, 2012 at 1:47 AM
Message: I knew I'd seen the exact text from this article before. Joe just copied it from one of his own books. I can now understand why someone gave it 1 star before I gave it 5.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.