26 April 2007

SQL Server Grouping Workbench

A gentle lesson about GROUP BY on the Nursery Slopes develops gradually into a wild ride off-piste amongst the pine-trees.

This Workbench is about using the GROUP BY clause in a SQL Statement. It is designed so it can be pasted straight into SSMS or the Query Analyser, though to run the examples, you’ll need the data file too. Just click the “CODE DOWNLOAD” link at the bottom of this article. Also included in the download bundle is the original SQL file, if you need it.

Contents

Introduction

Like all these workbench articles, the whole point is to try different things out, improve on the code, and learn in a practical way. If you’re like us, and find you can take in information only after real practice doing things for real then this article is for you.

If you think that it is too elementary, move to the end of the article and test out your knowledge by answering all the questions (Yes, the editor is giving away prizes for the best answers he receives. Insist on the glorious USB Red-Gate pendant).

Aggregations in SQL are very simple to do, but can cause a lot of confusion due to the word ‘Group’. The result you get from a grouping does not consist of the individual records but a collection or ‘aggregation’ of records.

A pack of playing cards is a group of 52 playing cards. If you were to group a pack of cards by colour, you’d have two ‘rows’ of 26 cards; if you grouped by suit, there would be four ‘rows’ with 13 in each. You could group them by their rank/court attribute, or in other ways. The point is that the result is always an aggregate, even if there is only one member in the collection.

Each row in the result of a GROUP BY represents a collection, or grouping. You can get the values for the number of records in each collection, or grouping, or the sum, average, and so on of numeric values, or the max or min of strings, but to ask for individual values makes no sense if you haven’t grouped by that value.

Sometimes, you will use an aggregate function in a SELECT Statement without using a GROUP BY clause. There is an implicit grouping of the entire result into one row, which is why you can only use aggregate functions in such a case, and will get one row returned.

GROUP BY statements are the bedrock of reporting in SQL.

The built-in aggregate functions are…

Common ones

Statistical functions

Checksum functions

(You can use sum and avg with numeric columns only – int, smallint, TinyIntBigint decimal, numeric, float, SmallMoney and money. You cannot use min and max with bit datatypes. You cannot use aggregate functions other than count(*) with text and image datatypes.)

We’ll start with some simple manoevres and then finish off with a few wheelies and handbrake-turns.

As sample data, we’ll take a list of all oil-producing countries in the world and their oil-production.

Create a new database or use your ‘practice’ development Database and then create this.

And we’ll get our data in. Download the data file, and load the data as follows:

Removing Duplicates

Before we get stuck into doing reporting, I’ll illustrate how simple it is to remove duplicates from tables using grouping.

Of course, if you are going to remove duplicates, you will really want to move them to another table for forensic analysis to find out how it happened and whether they are valid entries. Here is our Duplicates table:

Let’s maliciously create some duplicates!

You can then easily see these duplicates by:

Notice that we use a HAVING clause. This is because we are selecting only those rows that contain more than one row in the original table (we could have checked only part of the original table by using the WHERE clause. A WHERE clause cannot contain an aggregate function of course, but the HAVING clause can.

So we move them to another table to check them to make sure. We wrap this up in a transaction just in case anything goes wrong, as we don’t want to lose a record that might turn out to be legit.

Now look in the duplicate table:

And check the main table for duplicates with the SQL I’ve already given.

Simple Reporting

What about getting a summary of oil production per continent?

We are grouping by continent and get back seven records, each of which represents a collection of oil producers in the same continent.

…or by region ordered by production?

Grand totals are easy:

We used the grouping() function to order the total at the end we used WITH ROLLUP to compute the total. It can be very useful for running sums and running averages.

…but someone is bound to ask for just the top five:

…or maybe the bottom five!

Or, more tiresomely, for the top five, the others as an ‘Others’ row, and the sum total!

And then there will be the guy who wants a list of just those regions who produce more than the average.

Or possibly the five countries closest to the average:

Or conceivably, region closest to the average

Grouping and Pivot Tables

So let’s get a bit harder and do a pivot table relating regions and continents with column and row totals.

The ‘order by grouping’ trick ensures that the total comes in the right place on the last row!

Enumerating the Group members

This is all OK but a lot of people want, and expect, to have a list of all the constituents of their grouping in the result. They don’t like DBAs laughing and saying it isn’t possible. There are now CLR Aggregation routines that do it, but here is a method that works on SQL 2000. Not a cursor in sight! (it works a lot better in 2005 with VaRCHAR(MAX)).

Now this technique could get quite interesting if combined with ‘ranging’ where you can impose categories onto the data of your choice (Date ranges are very common).

Ranging

By ranging, I mean imposinbg aribrary value ranges, and grouping by them you can, of course, use a helper table to do this much more elegantly.

The code above is much more efficient than it looks, but why not make a User-defined function to do it?

So let’s combine ranging and enumeration!

We can list all the countries that fall in each range category….

Grouping as a utility

One can use grouping in a variety of ways that have nothing to do with reporting . Here is an example of using GROUP BY to help produce a table by splitting a delimited list. It needs the Number Helper Table (You’ll have to refer to the ‘Helper Table Workbench to find out about that).

Questions

Send your answers to editor@simple-talk.com. The first three correct entries will receive a much-coveted Simple-Talk goodie bag (polo shirt, USB key, bottle opener, beermat, pen).

  1. How would you get the count of the number of rows in a table with NULLS in a particular column, using GROUP BY, but without using a COALESCE or CASE statement?
  2. What is the GROUPING() function useful for?
  3. Can a WHERE clause contain an aggregate function?
  4. When would you need to use a HAVING clause?
  5. What does the ROLLUP do? How would you use it?
  6. Can you use UDFs (user-defined scalar-valued functions) in GROUP BY clauses?

See also other Workbenches at Simple-Talk

Robyn Page’s SQL Server DATE/TIME Workbench, Robyn Page
Date calculation and formatting in SQL Server can be surprisingly tricky. Robyn Page’s “hands-on” workbench will lead you through the minefield.

Robyn Page’s SQL Server String Manipulation Workbench, Robyn Page
String searching and manipulation in SQL Server can be error-prone and tedious…unless you’re armed with the techniques described in Robyn’s string manipulation workbench…

SQL Server Error Handling Workbench, Grant Fritchey
Grant Fritchey steps into the workbench arena, with an example-fuelled examination of catching and gracefully handling errors in SQL 2000 and 2005, including worked examples of the new TRY..CATCH capabilities.

Robyn Page’s SQL Server Cursor Workbench, Robyn Page
The topic of cursors is the ultimate “hot potato” in the world of SQL Server. Everyone has a view on when they should and mainly should not be used. By example and testing Robyn Page proves that, when handled with care, cursors are not necessarily a “bad thing”.

Robyn Page’s SQL Server Data Validation Workbench, Robyn Page
Robyn Page provides essential techniques for ensuring the validity of the data being entered into your SQL Server tables.

Robyn Page’s Excel Workbench, Robyn Page and Phil Factor
The need to produce Excel reports from SQL Server is very common. Here, Robyn Page and Phil Factor present practical techniques for creating and manipulating Excel spreadsheets from SQL Server, using linked servers and T-SQL. The pièce de résistance is a stored procedure that uses OLE Automation…

Robyn Page’s SQL Server Security Workbench, Robyn Page and Phil Factor
Robyn Page and Phil Factor present practical T-SQL techniques for controlling access to sensitive information within the database, and preventing malicious SQL injection attacks.

SQL Server Helper Table Workbench, Robyn Page and Phil Factor
Cursors and iterations are both renowned for slowing down Transact SQL code, but sometimes seem unavoidable. In this workbench, Robyn Page and Phil Factor demonstrate some set-based techniques for string manipulation and time interval-based reporting, which use helper tables instead.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

Downloads

This post has been viewed 78304 times – thanks for reading.

Tags: , , , , , , , , ,

  • Rate
    [Total: 28    Average: 4.3/5]
  • Share


Robyn Page has worked as a consultant with Enformatica and USP Networks with a special interest in the provision of broadcast services over IP intranets. She was also a well known actress, being most famous for her role as Katie Williams, barmaid and man-eater in the Television Series Family Affairs. She is currently having a career break to raise a young family.

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 20 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

View all articles by Robyn Page and Phil Factor

  • Anonymous

    Average in pivot
    Be careful with zeros and nulls when pivoting using a case statement if you are going to be taking averages — you would want to have the “else” part be null instead of zero so that the “else” rows aren’t included in the count for determining the average.

  • Anonymous

    wahhh…………………..
    Hai Hai,

    I am beginners. Its really help me a lot. Thanks you so much. Please give me more samples/links like above.

  • Anonymous

    On the spot
    Just the reference I needed for my current project. Thank you!

    Best regards
    LB

  • Anonymous

    What about @OtherErrorHappened?
    In the Duplicates section, you look for errors when you delete the dupes from the original table, but then you don’t check to see if there were any errors in @OtherErrorHappened before you commit.

    Why is this?

  • dolson

    Wrong link
    Great article.

    FYI: The link for “SQL Server Helper Table Workbench” goes to the “SQL Server Security Workbench” page:
    sql-server-security-workbench-part-1

  • Anonymous

    Did we move
    Great article, looking at the data, not sure when the United States moved to asia

  • Robyn Page

    Re: @OtherErrorHappened , Did We Move etc.
    Oops. Phil took something out temporarily in order to test it, and forgot to put it back in! Or was it me? (senior moment)
    Thanks too to an anonymous person who pointed out a confusing typo in a comment where we put GROUP BY instead of HAVING. (brain fade)
    The Oil-Producers data came from several sources and needed a lot of cleaning, but I can’t use the excuse of Continental Drift for the States wandering to Asia. I know we’re insular, but this is ridiculous! (blond moment)

  • Phil Factor

    Re: Did We Move etc. Continental drift
    Source files fixed. The United States is safely back in North America. Apologies. We’d like to take Britain out of Europe as easily but that is another issue!

  • Anonymous

    Removing duplicates – maybe this is better idea than loop etc 😉

  • Robyn Page

    Re: Removing Duplicates
    Dunno, but it is certainly cleverer!

  • Anonymous

    Duplicates
    Loops should be avoided.

  • Anonymous

    Can’t download the textfile!!!!!
    Sorry! We cannot display this page because of an error.
    We have alerted the Simple Talk team, and will fix the problem as soon as we can.

    In the mean time, please try exploring other parts of our site.

    If you repeatedly get this error, please contact editor@simple-talk.com

  • Phil Factor

    Re: Can’t download the textfile!!!!!
    Whatever it was seems to have stopped happening. We’ll get it checked out.

  • iman770

    please help me !
    hi; i iman from iran
    i start new project with SQL
    i need code about text search in database with any format in for exam .txt or .doc or .pdf
    i want this code for persian language .
    thanks

  • Anonymous

    Variance Calculation
    I was looking for some help with a matrix style report I am trying to develop.

    Example, months accross the top, columns would be product grouping, then year. The detail would be a summation of invoice value. I can achieve this no problem. The issue is when I want to get a variance between the two years, i.e. Jan 2006 vs 2007 as row, underneath each month.

    Any help would be greatly appreciated.

    Kind Regards

  • Phil Factor

    Re: Variance Calculation
    I’m not sure what precisely you mean by variance. This is a statistical term that applies to a list of numbers, but here you just have the data that applies to just two dates. Do you mean the difference between the two? (Normally given as a percentage)

  • Anonymous

    Re:Variance Calculation
    By Variance I mean. There will be sales data for say Jan 2007 and Jan 2006 and so on throughout the year. In my matrix example the months would run along the top and the two years as rows. So I trying to calculate the variance between Jan 2007 and Jan 2007 and display in a row. I hope that makes more sense. Thanks Mark

  • Anonymous

    Some of the ‘duplicates’ code needs to be updated…
    you need to change the code:

    –let’s maliciously create some duplicates!
    INSERT INTO OilProducers(Country,BarrelsPerDay,continent, region)
    SELECT Country,BarrelsPerDay,continent, region
    FROM OilProducers WHERE country LIKE ‘A%’
    INSERT INTO OilProducers(Country,BarrelsPerDay ,continent, region)
    SELECT Country,BarrelsPerDay,continent, region
    FROM OilProducers WHERE country BETWEEN ‘A’ AND ‘E’
    –you can then easily see these duplicates by
    SELECT ‘First_ID’=MAX(OilProducer_ID), ‘No.InTable’=COUNT(*)
    FROM OilProducers
    GROUP BY country
    HAVING COUNT(*)>1

    TO:

    INSERT INTO Duplicates(Country,BarrelsPerDay,continent, region)
    SELECT Country,BarrelsPerDay,continent, region
    FROM OilProducers WHERE country LIKE ‘A%’
    INSERT INTO Duplicates(Country,BarrelsPerDay ,continent, region)
    SELECT Country,BarrelsPerDay,continent, region
    FROM OilProducers WHERE country BETWEEN ‘A’ AND ‘E’

    SELECT ‘First_ID’=MAX(OilProducer_ID), ‘No.InTable’=COUNT(*)
    FROM Duplicates
    GROUP BY country
    HAVING COUNT(*)>1

  • Anonymous

    aaargh – ignore comment above 😉

    you don’t need to change the code 🙂

  • Anonymous

    Nice
    Its Nice article and posting of Removing Loops is nice

  • Anonymous

    FTS
    iman770, look into Full-Text Search in Books Online. It does exactly what you’re asking.

  • Anonymous

    Top n and others
    Hi!
    In the query “or, more tiresomely, for the top five, the others as an ‘Others’ row AND the sum total!” I want to show the columns “Continent” and “Country” as well, how can I do that?

    Thanks for great tips!
    Lasse

  • Sarah Grady

    Anonymous Comments Disabled
    Due to a large volume of spam, anonymous comments have been disabled.