Click here to monitor SSC
  • Av rating:
  • Total votes: 49
  • Total comments: 5
Joe Celko

SQL GROUP BY Basics

08 February 2016

It sometimes pays to go back and look at what you think you already know about SQL. Joe Celko gives a quick revision of the GROUP BY and HAVING clauses in SQL that are the bedrock of any sort of analysis of data, and comes up with some nuggets that may not be entirely obvious.

The GROUP BY and HAVING clauses in SQL goes all the way back to the start of the language, in its simplest form. The original idea was that you would build a working table using the SELECT … FROM … WHERE clauses, and then partition that result set into groups. A group is defined as a set of rows with the same values for the grouping columns. The non-grouping columns had to be group properties (usually aggregate functions), formulas or computations done on these or constants (which is technically a group property).

Each group is then reduced down to a single row. The important concept here is the original table no longer exists. However, if you are old enough to have to work with the very early versions of SQL Server, the Sybase syntax did not get rid of the original table. This led to some problems. It exposed the original row data with the aggregated data.

The original HAVING clause was then applied to each of the rows in the new grouped table. This meant that it was done after the WHERE clause that built the original table.

Later we added Windows clauses, extensions to the GROUP BY in the form of ROLLUP, CUBE and GROUPING SETS. But there is still quite a bit you can do with the original simple GROUP BYHAVING constructs. They have the advantage of completing their work in a single table scan, while the newer constructs may need temporary storage or multiple passes through the data. The originals are portable and predictable. They are now well-optimized after all these decades.

 Basic Aggregate Functions

In the original SQL, we had a limited set of simple aggregate functions. Their general format was '<function name> ([DISTINCT| ALL] <expression>)”. The first thing these functions do is to remove all of the NULLs from the data. The function could be applied to an empty set, if only NULLs were returned. What is the sum or average of an empty set? Following the usual conventions of SQL, empty sets return NULL, not zero. Zero is an actual value; that was a very big debate in the Middle Ages, when Europeans started seeing zero showing up in those newfangled Hindu Arabic numerals

SQL has a convention that we can put keywords inside the parameters  in some of the intrinsic system functions. The aggregate functions have the option of DISTINCT or ALL; if nothing is given, then ALL is assumed.

SUM ([DISTINCT| ALL] <numeric expression>)
= arithmetic total of a numeric expression.

AVG ([DISTINCT| ALL] <numeric expression>)
= arithmetic average or mean of a numeric expression.

 MIN ([DISTINCT| ALL] <expression>)
= depending on the expression, the smallest numeric value, the least recent temporal value, or the lowest collation value in a string.

 MAX ([DISTINCT| ALL] <expression>)
= depending on the expression, the greatest numeric value, the temporal value for the wrist in the future, or the highest collation value in a string.

 COUNT ([DISTINCT| ALL] <expression>)
= the cardinality of the set of expression values, regardless of type.

 COUNT ([DISTINCT| ALL] *)
= the cardinality of the table, taken as a whole set.

The DISTINCT keyword says remove all redundant duplicates (i.e. keep one copy) from the set of expressions being aggregated. The ALL keyword says to retain all the rows (this is like UNION versus UNION ALL and so forth in other parts of the language). In practice, nobody actually uses the ALL option; in fact, a lot of the newer SQL programmers do not even know it exists.

It may come as a surprise that <expression> as a parameter to an aggregate function does not have to be just a simple column name. However, it does have some limitations, based on scoping. This is why you cannot nest simple aggregate functions inside each other. Think about SUM (AVG (x)) and what it means. First, we would compute the average. It would be a single value (or a NULL). The sum of a single value is just that single value (or a NULL).

Also, the expression has to be of an appropriate data type. The SUM () and AVG () are clearly numeric. It would not make much sense to take an average of a result set of dates or strings, would it? However, the scale and precision of the results are implementation defined in ANSI/ISO standards. You need to know your product; Oracle, DB2, and SQL Server are not the same in the way they handle this.

The MIN () and MAX () are called the ‘extrema’ functions. They work for all data types in SQL but with some caveats. Technically, the DISTINCT and ALL options exist, but DISTINCT is insanely redundant. If you use it, I am pretty sure any optimizer will catch it. As an aside, when the MIN () or MAX () is computed, the optimizer will take the first value as the max (or min) and compare to the following values; if it finds a greater (or lesser) value, that becomes the new max (or min). This is fast and simple scan. But you will see SQL written with ROW_NUMBER() OVER(..) followed by filters for a row number equal to one to do this same task! All that does is force a useless sort on the data,

Numeric values have a well-defined order, temporal data has a well-defined order, but strings depend on their collation to determine their order. Even the same language can have several collations; German is one example. Even worse, in SQL varying length strings have to be padded out with trailing blanks before they are compared. Thus, 'Joe', 'Joe ' and 'Joe ' are all equal. But which one represents that group?

This is why we like to design encodings that use only the basic ASCII subset that ISO requires for all Unicode languages. But that is another topic.

As usual, NULLs pose their own problem. Technically a NULL is not equal anything, not even another NULL. Following that model, each NULL in the result set should form its own group. We actually debated that in ANSI X3H2 in the early days.

The decision was to have both equality (=) and grouping. When grouping is used, all NULLs are treated as a single group. Well, it is technically called an equivalence class for anybody that had a course in set theory. This was also one reason that we dropped NULLs from the result set. This convention is actually pretty handy in practice.  

Group Characteristics

It is often useful to look at groupings that have some characteristic in common without actually looking at the particulars. This is done in the HAVING clause, which deals with the group characteristics. Doing the work in the HAVING clause can save complicated joins, CASE expressions and subqueries.

The most useful tools are usually the simplest. The SIGN (), properly called the ‘signum’, the absolute value, ABS (), and simple math are probably the most useful. Here are some examples:

HAVING COUNT (DISTINCT <expression>) = COUNT (<expression>)

We know <expression> has all distinct values, otherwise the left hand side would be smaller than the right hand side count.

 HAVING COUNT (*) = COUNT (<expression>);

There are no NULLs in the results. The cardinality on the left side would count them, but the right hand side would not.

 HAVING MIN (<expression>) = MAX (<expression>)

This tell us that the group has only one value. This is actually quite handy. Variations on this idea are:

 HAVING MIN (SIGN (<expression>)) = MAX (SIGN (<expression>))

We now know the <expression> is all positive, all negative or all zero values.

 HAVING MIN (ABS (<expression>)) = 0;

Tells us we have at least one zero.

 Programming Technique For Grouping

 The basic programming technique for grouping is fairly simple and a recipe to follow. First, write CTE or select statement that gets the groups. That sounds kind of obvious, but it is a little more complicated. Most of the time, the grouping attributes are in the original set.

The common example would be geographic codes, such as state codes. These are nice discrete values that make obvious groups. They are probably well understood by the people using the data.

 Sometimes the groups may have to be computed. For example, if you have information about a birthdate, you might want to partition them into age groups (0 to 10, 11 to 14, 15 to 18, 18 to 25 and finally the horrible 65+ years of age). This still gives you nice sharp categories. However, you now have to go to a join on the table with the range of each group to get your working table

However, if we use approximate numeric values for the groups got all kinds of worries about rounding errors and things on the boundaries. This is the worst possible situation.

Once you decide what your groups are, we go to step number two. This is simply do the group by. Technically we do not actually need a GROUP BY clause. If it is not given, then the entire result set is treated as if it is a single group. Originally the group by was a simple list of columns from the query. The columns in this list did not have to appear in the SELECT statement (this was a myth from the very early days of SQL).

The result of the SELECT.. FROM.. WHERE.. clauses is partitioned into disjoint subsets, which all have the same value in each column or are all NULLs. This is what set theory calls an equivalence class and there is lots of them. The mod (n) function is another example, which gives us groups defined by 0, 1, 2, .., (n -1) values.

At one point in the history of the ANSI/ISO standards, we got interested in OLAP processing and added cubes, rollups, grouping sets and a grand total. All of these can be defined using the GROUPING SETS construct, but it actually gets pretty awkward to do it that way.

 HAVING Clause

Finally, we worry about using the having clause which actually can in some ways be the simplest of all of this. It is really just a WHERE clause that applies to the grouped table. We are back to simple search conditions again!

We do not actually need the HAVING clause; if it is not there. We simply stop the point we have done the group by. But this is where we get some real power to pick out grouped character grouping characteristics.

 Example Queries

A simple example of this technique will give you an idea how it works. Here a skeleton table for the example.

CREATE TABLE Sales

(salesman_name VARCHAR (10) NOT NULL,

 product_name VARCHAR (10) NOT NULL

   CHECK (product_name IN ('scissors', 'paper', 'stone')),

 sale_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,

 sale_amt DECIMAL(6,2) NOT NULL

   CHECK (sale_amt >= 0.00)

PRIMARY KEY (salesman_name, product_name, sale_date));

INSERT INTO Sales

VALUES

('Fred', 'scissors', '2016-01-01', 100.00),

('Fred', 'stone', '2016-01-01', 75.98),

('Fred', 'paper', '2016-01-01', 40.00),

('John', 'paper', '2016-01-02', 35.85),

('John', 'paper', '2016-01-03', 0.00),

('John', 'stone', '2016-01-01', 100.00),

('John', 'stone', '2016-01-02', 5.00),

('John', 'stone', '2016-01-04', 4.98),

('Mary', 'paper', '2016-01-11', 45.95),

('Mary', 'scissors', '2016-01-10', 0.00),

('Mary', 'stone', '2016-01-12',20.25),

('Vlad', 'stone', '2016-01-11', 23.95),

('Vlad', 'stone', '2016-01-12', 100.00),

('Vlad', 'stone', '2016-01-13', 75.98),

('Vlad', 'stone', '2016-01-14', 16.85);

Our first problem is to find the salesman who sold all three products

SELECT salesman_name

  FROM Sales

GROUP BY salesman_name

HAVING COUNT (DISTINCT product_name) = 3;

This will give us 'Mary' and “Fred' as the salesman who sold the whole catalog. Now, tell me if we have a salesman who has sold only one product. We do not care which product, we are looking for a specialized guy.

SELECT salesman_name, MAX(product_name) AS single_product_name

  FROM Sales

GROUP BY salesman_name

HAVING MIN(product_name) = MAX(product_name);

But why do it this way? Why not use “HAVING COUNT (DISTINCT product_name) = 1 instead?  This would look like the first query. The optimizer has to do a scan to count the product names, but the table statistics in every SQL compiler I know about keeps at least the min, max, and row count for computing execution plans. Once I see the min (or max) value for that column, I know that I have found the min (or max) value for each salesman and can ignore the rest of his values.

 Try another one: Do we have salesmen who sold over $100.00 or products? The obvious way is:

SELECT salesman_name

  FROM Sales

 GROUP BY salesman_name

HAVING SUM(sale_amt) >= 100.00;

We get 'Fred', 'Vlad' and  'John' as the result set. But there is an alternative.

SELECT salesman_name

  FROM Sales

GROUP BY salesman_name

HAVING SIGN(SUM(sale_amt) - 100.00) >= 0;

 

In this case, this trick is probably not worth it. The next example applies the techniques in a more complex situation. This example was actually taken from a posting on a SQL forum. Yes, it is a skeleton, so be kind. However, it can be used to demonstrate how to get things into a form where we can use some of the tricks with the GROUP BY.

CREATE TABLE Foo

(generic_id CHAR (2) NOT NULL PRIMARY KEY,

 vague_name VARCHAR (20) NOT NULL);

 

INSERT INTO Foo

 VALUES ('10', 'Albert'), ('11', 'Bob'), ('12', 'Chuck');

 

CREATE TABLE Bar

(generic_id CHAR (2) NOT NULL

 REFERENCES Foo (generic_id),

 creation_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,

 PRIMARY KEY (generic_id, creation_date),

 assembler_flag BIT);

 

INSERT INTO Bar

VALUES

 ('10', '2015-01-01', 1),

 ('10', '2015-01-02', 1),

 ('11', '2015-01-01', 0),

 ('11', '2015-01-12', 1),

 ('12', '2015-01-01', 0),

 ('12', '2015-02-02', 0);

 

What I want is the latest creation_date for a given name where the assembler_flag is zero or if there is no zero assembler_flag for the name then the latest creation_date for assembler_flag = 1.

SELECT T.vague_name, T.creation_date, T.assembler_flag

  FROM (SELECT Foo.vague_name, Bar.creation_date, Bar.assembler_flag,

               ROW_NUMBER()

               OVER(PARTITION BY Foo.generic_id

                    ORDER BY Bar.assembler_flag, Bar.creation_date DESC)

               AS rn

          FROM Bar, Foo

         WHERE Foo.generic_id = Bar.generic_id) AS T

WHERE T.rn = 1;

 The results are summarized here.

Albert

'2015-01-02'

all assembler_flags = 1

 1

Bob

'2015-01-01'

all assembler_flags = 0

 0

Chuck

'2015-02-02'

mixed assembler_flags

 0

 I named the column “assembler_flag” to stress that we do not use BIT flags in SQL; that was assembler language and this is a predicate based language. But did you notice that Bob and Chuck are treated the same way? We lost information about Chuck having mixed flags. We cannot handle NULL flags in this query, either. We also spent time partitioning, sorting and materializing a row number.

 The usual trick for finding if a grouping has only one value is to use “HAVING (MIN (assembler_flag) = MAX (assembler_flg))”, but SQL Server gives an error (Msg 8117, Level 16, State 1, Line 6; Operand data type bit is invalid for min operator)!  This is yet another reason not to write 1960's assembly flag code in SQL. The BIT data type in SQL Server was originally a true BIT (0, 1), then it became an exact numeric data type (0, 1, NULL) in later releases. But it does not do aggregate functions the way you might expect.

 We still have to handle NULLs without business rules from the poster. Let's assume that a NULL becomes zero. This gives us the first hack at an answer, with a CTE to handle the NULL problem.

WITH Build_Groups

AS

 (SELECT Foo.vague_name, Bar.creation_date,

         CAST (COALESCE(Bar.assembler_flag, 0) AS INTEGER) AS flg

    FROM Foo, Bar

   WHERE Foo.generic_id = Bar.generic_id)

 

SELECT vague_name, MAX(creation_date) AS recent_creation_date

 FROM Build_Groups

 GROUP BY vague_name

HAVING MIN(flg) = MAX(flg);

This will give you Albert and Chuck (each has only one flag value), but not Bob (mixed flags). Back to the drawing board! Well, keyboard.

WITH Build_Groups

AS

(SELECT Foo.vague_name, Bar.creation_date,

        CAST (COALESCE (Bar.assembler_flag, 0) AS INTEGER) AS flg

   FROM Foo, Bar

  WHERE Foo.generic_id = Bar.generic_id)

 

SELECT vague_name,

       CASE WHEN SUM (flg) = COUNT (flg) THEN 'ones'

            WHEN SUM (flg) = 0 THEN 'zeros'

             ELSE 'mixed' END

       AS flag_class

  FROM Build_Groupso90io90io90io90ib n

 GROUP BY vague_name;

The result can be summarized like this:

Albert

'2015-01-02'

all assembler_flags = 1

Bob

'2015-01-01'

all assembler_flags = 0

Chuck

 '2015-02-02'

mixed assembler_flag = 0

 The important thing to notice in this example is that the CASE expression WHEN clauses have to be at the same level of aggregation as the groups. That is why I am using SUM() and COUNT() in the when clauses. Trivially, constants like 'mixed' are considered to be at all levels of aggregation, as are computations done on the aggregates.

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 49 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Something missing
Posted by: puzsol (view profile)
Posted on: Monday, February 15, 2016 at 11:18 PM
Message: Hi Joe,

Ignoring the crazy table name in the last statement ("Build_Groupso90io90io90io90ib"), I wasn't sure how that query was meant to produce the same result as the windowed statement (ie where was the date coming from?)... but with a little fiddling, this seems to work:

;with Build_Groups as
(
select f.vague_name,
max(b.creation_date) as creation_date,
cast(coalesce(b.assembler_flag,0) as int) as assembler_flag
FROM Foo f
inner join Bar b on b.generic_id = f.generic_id
group by f.vague_name, coalesce(b.assembler_flag,0)
), Group_Flag as
(
select g.vague_name,
case when sum(assembler_flag) = count(assembler_flag) then 1 else 0 end as assembler_flag
from Build_Groups g
group by g.vague_name
)
select g.*
from Build_Groups g
inner join Group_Flag f on f.vague_name = g.vague_name and g.assembler_flag = f.assembler_flag
order by g.vague_name

Was that the sort of thing you intended?

NOTE: In my execution plan, the rownumber statement is more efficient - not sure how it would be with a larger result set.

Subject: Awesome
Posted by: Anonymous (not signed in)
Posted on: Tuesday, February 16, 2016 at 1:05 AM
Message: Awesome. Great tutorial.



Subject: Very Helpful
Posted by: Impressive (view profile)
Posted on: Tuesday, February 16, 2016 at 11:39 AM
Message: This article came right on time and has helped me understand much better (SQL Newbie here)

Subject: Update
Posted by: puzsol (view profile)
Posted on: Tuesday, February 16, 2016 at 8:43 PM
Message: Not happy with the speed of the grouping attempt in my previous post, I had another go:

select f.vague_name,
coalesce(r0.creation_date, r1.creation_date) as creation_date,
coalesce(r0.flag, r1.flag) as assembler_flag
from Foo f
left join
(
select generic_id, max(creation_date) as creation_date, 0 as flag
from Bar
where assembler_flag is null or assembler_flag = 0
group by generic_id
) r0 on r0.generic_id = f.generic_id
left join (
select generic_id, max(creation_date) as creation_date, 1 as flag
from Bar
where assembler_flag = 1
group by generic_id
) r1 on r1.generic_id = f.generic_id

Which appears to be faster than the windowed statement on this small table.... which I think shows Joe's point that we should be able to do better than the windowed statement for performance.

Subject: BIT flags
Posted by: schematech (not signed in)
Posted on: Wednesday, February 17, 2016 at 3:22 AM
Message: Hi Joe - Interesting article, but can you explain what you meat by "...to stress that we do not use BIT flags in SQL"? I use BIT columns all over the place to indicate a True/False attribute for an object. Is that not the same thing?
Bob

 
Simple-Talk Database Delivery

DLM
Patterns & Practices Library

Visit our patterns and practices library to learn more about database lifecycle management.

Find out how to automate the process of building, testing and deploying your database changes to reduce risk and make rapid releases possible.

Get started

Phil Factor
How to Build and Deploy a Database from Object-Level Source in a VCS

It is easy for someone who is developing a database to shrug and say 'if only my budget would extend to buying fancy... Read more...

 View the blog

Top Rated

Clone, Sweet Clone: Database Provisioning Made Easy?
 One of the difficulties of designing a completely different type of development tool such as SQL Clone... Read more...

Database Lifecycle Management: Deployment and Release
 So often, the unexpected delays in delivering database code are more likely to happen after the... Read more...

The PoSh DBA: Assigning Data to Variables Via PowerShell Common Parameters
 Sometimes, it is the small improvements in a language that can make a real difference. PowerShell is... Read more...

Issue Tracking for Databases
 Any database development project will be hard to manage without a system for reporting bugs in the... Read more...

Releasing Databases in VSTS with Redgate SQL CI and Octopus Deploy
 You can still do Database Lifecycle Management (DLM) workflows in the hosted version of Team foundation... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

Temporary Tables in SQL Server
 Temporary tables are used by every DB developer, but they're not likely to be too adventurous with... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.