Click here to monitor SSC
  • Av rating:
  • Total votes: 6
  • Total comments: 3
Joe Celko

The SQL of Membership: Equivalence Classes & Cliques

28 July 2014
Equivalence Classes & Cliques

It is awkward to do 'Graph databases' in SQL to explore the sort of relationships and memberships in social networks  because equivalence relations are classes (a set of sets) rather than sets. However one can explore graphs in SQL if the relationship has all three of the mathematical properties needed for an equivalence relationship. 

Equivalence Classes & Cliques

We keep harping that SQL is based on sets, but how many of us ever go back and re-read any old text books on set theory? In particular, one concept gets forgotten is equivalence relations on sets, which create equivalence classes. These classes are disjoint and we can put an element from a set into one of them with some kind of rule.

Let's use ~ as the notation for an equivalence relation. The definition is that it has three properties:

  • The relation is reflexive: A ~ A. This means the relation applies to itself.
  • The relation is symmetric: : if  (A ~ B) (B ~ A).  This means when the relation applies to two different elements in the set, it applies both ways.
  • The relation is transitive: (A ~ B) ∧ (B ~ C)  (A ~ C). This means we can deduce all elements in a class by applying the relation to any known members of the class. This is pretty important for programming, as we will see.

So, if this is such a basic set operation, why isn't it in SQL? The problem is that it produces classes, not a set! A class is a set of sets, but a table is just a set. The notation for each class is a pair of square bracket that contain some representative of each set. Assume we have ...

  • a Î A
  • b Î B

These expressions are all the same:

  • A ~ B
  • [a] = [b]
  • [a] Ç [b] ≠ Æ

A common example is the set Z of integers and any MOD() operator will give you equivalence classes. The MOD(n, 2) operation gives you one class consisting of all even numbers, and the other consisting of all odd numbers. The nice part is that you can compute the MOD() function with arithmetic.

An equivalence classes can also be defined by a set of sets. This is an important concept in graph theory and graph database. If you have not seen it, Google “Six Degrees of Kevin Bacon”. The idea is that any individual involved in the Hollywood film industry can be linked through his or her film roles to Kevin Bacon within six steps.

Let's steal a sample set from a SQL Forum posting of friends, where ~ now means “is a friend of” and that we can use “foaf” -- “Friend of a friend” -- to build cliches.

  • {Fred, Jessica, John, Mary}
  • {Albert, Nancy, Peter}
  • {Abby}
  • {Frank, Joe}

A complete graph means that each node is connected to every other node by one edge. If a subgraph is complete, it is actually called a Clique in graph theory!

 The complete graph Kn of order n is a simple graph with n vertices in which every vertex is adjacent to every other. The complete graph on n vertices has n(n-1)/2 edges, which corresponding to all possible choices of pairs of vertices. But this is not an equivalence relation because it does not include the reference of each node to itself. Remember A~A? Think about Abby in this set, assuming that she likes herself, in spite of her lack of social skills.

Obviously, if you have a clique with (k) nodes in it, you can a complete subgraph of any size (j <k). A maximal clique is a clique that is not a subset of any other clique (some authors reserve the term clique for maximal clique.

SQL seems to have two equivalence relations that are major parts of the language. The first is plain old vanilla equal (=) for scalar values in the base data types. SQL is just like any other programming language, but we also have NULLs to consider. Oops! We all know that (NULL = NULL) is not true. So we have to exclude NULLs or work around them.

The other “almost” equivalence relation is GROUP BY in which the class is the grouping columns. A quick example would be “SELECT city_name, state_code FROM Customers GROUP BY state_code;” and this is still not quite right because we have to do some kind of aggregation on the non-grouping columns in the query.

Graphs in SQL

The idiom for graphs in SQL is to use a table with the nodes involved and a second table with pairs of nodes (a, b) to model the edges, which model our ~ relation.. This is classic RDBMS; the first table is entities (e.g. cities on a map, electronic components in a device, etc) and the second table is a relationship (e.g. roads between cities, wiring between components, etc).

The bad news is that when you use it for equivalence relations, it can get big. A class with one member is has one row to show the edge back to itself. A class with two members {a, b} has {(a, a), (b, b), (a, b), (b, a)} to show the ~ properties as edges. Three members gives us six rows; { (a, a), (b, b), (c, c), (a, b), (a, c), (b, a), (b, c), (c, a), (c, b)}. The general formula is (n * (n-1)) + n, which is pretty close to (n!).

First, load the table a few facts we might already know:

CREATE TABLE Friends

(lft_member VARCHAR(20) NOT NULL,

 rgt_member VARCHAR(20) NOT NULL,

 PRIMARY KEY (lft_member, rgt_member))

 

INSERT INTO Friends (lft_member, rgt_member)

VALUES

('John', 'Mary'),

('Mary', 'Jessica'),

('Peter', 'Albert'),

('Peter', 'Nancy'),

('Abby', 'Abby'),

('Jessica', 'Fred'),

('Joe', 'Frank');

Reflexive Rows

The first thing we noticed is that only Abby has a reflexive property row. We need to add those rows and can do this with basic set operators. But it is not that easy, as you can see with this insertion statement:

INSERT INTO Friends

SELECT X.lft_member, X.rgt_member

 FROM

((SELECT lft_member, lft_member FROM Friends AS F1

UNION

SELECT rgt_member, rgt_member FROM Friends AS F2)

EXCEPT

SELECT lft_member, rgt_member FROM Friends AS F3)

AS X (lft_member, rgt_member);

The use of table aliases is tricky. You have to be sure that the SQL engine will construct a table in such a way that you do not get scoping problems. The UNION and EXCEPT operators are used to assure that we do not have primary key violations.

Abby

Abby

Albert

Albert

Frank

Frank

Fred

Fred

Jessica

Jessica

Jessica

Fred

Joe

Joe

Joe

Frank

John

Mary

John

John

Mary

Mary

Mary

Jessica

Nancy

Nancy

Peter

Peter

Peter

Nancy

Peter

Albert

Symmetric Rows

Look at the update table and there are no { (a, b), (b, a)} pairs. We need to add this second property to the relation table. This follows the pattern of set operators we just used:

INSERT INTO Friends

SELECT X.lft_member, X.rgt_member

 FROM

(

(SELECT F1.rgt_member, F1.lft_member

  FROM Friends AS F1

 WHERE F1.lft_member <> F1.rgt_member)

EXCEPT

(SELECT F2.lft_member, F2.rgt_member FROM Friends AS F2)

) AS X (lft_member, rgt_member);

This will give us:

Abby

Abby

Albert

Peter

Albert

Albert

Frank

Joe

Frank

Frank

Fred

Jessica

Fred

Fred

Jessica

Mary

Jessica

Jessica

Jessica

Fred

Joe

Joe

Joe

Frank

John

Mary

John

John

Mary

Mary

Mary

John

Mary

Jessica

Nancy

Peter

Nancy

Nancy

Peter

Peter

Peter

Nancy

Peter

Albert

 If you do  quick GROUP BY query, you see that Abby is seriously anti-social with a count of 1 , but Mary and Jessica look very social with a count of 3. This is not quite true because the property is the dreaded “Rock-paper-scissors-lizard-Spock” relationship."

Transitive Rows

Transitivity goes on forever. Well, until we have what mathematicians call a closure. This means we have gotten a set of elements that have all of the valid relationships. In some cases, these sets are infinite. But in database, we can only have insanely large tables.  Let's pull a subset that is part oi a clique of 4 friends.

Fred

Jessica

Fred

Fred

Jessica

Mary

Jessica

Jessica

Jessica

Fred

John

Mary

John

John

Mary

Mary

Mary

John

Mary

Jessica

Now apply the transitive relation to get some of the missing edges of a graph:

 

INSERT INTO Friends

SELECT X.lft_member, X.rgt_member

 FROM

((SELECT F1.lft_member, F2.rgt_member

  FROM Friends AS F1, Friends AS F2

 WHERE F1.rgt_member = F2.lft_member)

EXCEPT

(SELECT F3.lft_member, F3.rgt_member FROM Friends AS F3)

) AS X (lft_member, rgt_member);

This is a simple statement using the definition of transitivity, without a universal quantifier or loop on it. It will add these rows to this subset: 

Fred

Mary

Jessica

John

John

Jessica

Mary

Fred

When you look at Mary and Jessica, you have all of their friends. However, Fred and John do not know that they are friends. So we invoke the statement again and get those two rows. If you try it a third time, there are zero rows added.

The first thought is this sounds like a job for a recursive statement. But it is not that easy. If the original graph has a cycle in it, you can hang in an infinite loop if you try to use a recursive CTE. The assumption is that each clique has a spanning graph in the pairs in. Oops! New term: a spanning graph is a sub-graph that includes the nodes and some or all of the edges of the graph. A complete graph is the spanning graph has all the possible edges, so each node is directly connected to any other node.

Cliques

Now change your mindset from graphs to sets. Let's look at the size of the cliques:

WITH X (member, clique_size)

AS

(SELECT lft_member, COUNT(*) FROM Friends GROUP BY lft_member)

SELECT * FROM X ORDER BY clique_size;

Abby

1

Frank

2

Joe

2

Nancy

3

Peter

3

Albert

3

Fred

4

Jessica

4

John

4

Mary

4

What we want to do is to assign a number to each clique. This sample data is biased by the fact that the cliques are all different sizes. You cannot simply use the size to assign a clique_nbr.

Create this table load it with the names.

CREATE TABLE Cliques

(clique_member VARCHAR(20) NOT NULL PRIMARY KEY,

 clique_nbr INTEGER);

We can start by assigning everyone their own clique, using whatever your favorite method is:

INSERT INTO Cliques

 VALUES

('Abby', 1),

('Frank', 2),

('Joe', 3),

('Nancy', 4),

('Peter', 5),

('Albert', 6),

('Fred', 7),

('Jessica', 8),

('John', 9),

('Mary', 10);

Everyone is equally a member of their clique in this model. That means we could start with anyone to get the rest of their clique. The update is very straight forward. Take any clique number, find the member to whom it belongs and use that name to build that clique. But which number to we use?  We could use the MIN, MAX or a random number in the clique; I will use the MAX for no particular reason

 I keep thinking that there is a recursive update statement that will do this in one statement. But I know it will not port (Standard SQL has no  recursive update statement right now. We would do it with SQL/PSM or a host language) and I think it would be expensive. Recursion will happen for each member of the whole set, but if we consolidate a clique for one person, we have removed his friends from consideration. 

The worst situation would be a bunch of hermits, so the number of clique would be the cardinality of set. That is not likely or useful in the real world. Let's put the update in the body of a loop.

BEGIN

DECLARE @clique_loop INTEGER;

DECLARE @mem CHAR(10);

SET @clique_loop = (SELECT MAX(clique_nbr) FROM Cliques);

 

WHILE @clique_loop > 0

BEGIN

SET @mem = (SELECT clique_member FROM Cliques WHERE clique_nbr = @clique_loop);

UPDATE Cliques

   SET clique_nbr

       = (SELECT clique_nbr

            FROM Cliques

           WHERE clique_member = @mem)

 WHERE clique_member

        IN (SELECT rgt_member

              FROM Friends

             WHERE lft_member = @mem);

SET @clique_loop = @clique_loop - 1;

SELECT * FROM Cliques;

END;

END;

As a programming assignment, replace the simple counting loop with one that does not do updates with clique_loop values that were removed in the prior iteration. This can save a lot of work in a larger social network than the sample data used here.

Adding Members

 Now that we have a Cliques table, we can do a simple insertion if the new guy belongs to one clique. However, we can have someone who knows a people in different cliques. This will merge the two cliques into one.

CREATE PROCEDURE Clique_Merge

@clique_member_1 CHAR(10),

 @clique_member_2 CHAR(10)

AS

UPDATE Cliques

  SET clique_nbr

     =(SELECT MAX(clique_nbr)

         FROM Cliques

        WHERE clique_member

            IN (@clique_member_1, @clique_member_2))

 WHERE clique_nbr 

       =(SELECT MIN (clique_nbr)

         FROM Cliques

        WHERE clique_member

            IN (@clique_member_1, @clique_member_2));

Deleting a member or moving him to another clique is trivial.

Conclusion

This article is full of handy tricks to use with SQL. But cliques and equivalence classes are better handled with a graph database than with SQL. This is no surprise; that is what they were meant to do! The graph database can also handle relationships that do not have all three of the mathematical properties needed for an equivalence relationship. 

If any readers want to add to this set of tricks, please do.

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 6 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Reading list
Posted by: Joe Celko (not signed in)
Posted on: Friday, August 1, 2014 at 9:21 AM
Message: Dover Publications is a great source for books on math and they have a good offering in Graph Theory.

“A First Course in Graph Theory” by Gary Chartrand, Ping Zhang (ISBN: 978-0486483689)
“Graph Theory” by Ronald Gould (ISBN: 978-0486498065)
“Introduction to Graph Theory” by Richard J. Trudeau (ISBN: 978-0486678702)
“Introductory Graph Theory” by Gary Chartrand (ISBN: 978-0486247755)

Subject: wow
Posted by: oxenskiold (not signed in)
Posted on: Friday, August 1, 2014 at 4:54 PM
Message: What a reminder!

Subject: Graph Traversal
Posted by: Josh Ashwood (not signed in)
Posted on: Friday, August 8, 2014 at 12:30 AM
Message: Nice article Joe.

 
Simple-Talk Database Delivery

DLM
Patterns & Practices Library

Visit our patterns and practices library to learn more about database lifecycle management.

Find out how to automate the process of building, testing and deploying your database changes to reduce risk and make rapid releases possible.

Get started

Phil Factor
Schema-Based Access Control for SQL Server Databases

Access-control within the database is important for the security of data, but it should be simple to implement. It... Read more...

 View the blog

Top Rated

A Start with Automating Database Configuration Management
 For a number of reasons, it pays to have the up-to-date source of all the databases and servers that... Read more...

Archiving Hierarchical, Deleted Transactions Using XML
 When you delete a business transaction from the database, there are times when you might want to keep a... Read more...

The Mindset of the Enterprise DBA: Harnessing the Power of Automation
 After you have done the necessary groundwork of standardizing and centralizing your database... Read more...

Rollback and Recovery Troubleshooting; Challenges and Strategies
 What happens if your database deployment goes awry? Do you restore from a backup or snapshot and lose... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.