Robert Chipperfield

Should you always use a clustered index?

Published Monday, December 17, 2007 5:50 PM

In one of his recent editorials, Steve Jones of SQLServerCentral wondered about whether every table evar should have a primary key, and in doing so, mentioned the often-quoted advice of also making sure every table has a clustered index.

If you already know about the difference between a clustered and non-clustered index, you can safely skip the next couple of paragraphs. If not, here goes:

Take the example of a phone book. The actual data - that is, the name, address and phone number records - is ordered by the name. If you want to look up Joe Bloggs's phone number, you open the book somewhere near the middle, maybe see the names there start with "M", but "Bloggs" is before "M", so you go a bit earlier in the book. You keep narrowing it down until you find the entry labelled Bloggs, and that's it - all the data for that record is right there. That's a bit like a clustered index.

On the other hand, a book might have a table of contents, sorted alphabetically. If you want to find out about llamas, you search the contents for llamas, which probably then gives you a page number, at which point you go to the page, and there's the data about llamas. The difference here is that you've had to do an extra bit of indirection - following the page number pointer - in order to get to the data. You can probably now see that while you can have as many tables of contents, ordered in any way you like, one set of data can only be physically arranged in one way. This means you can have many non-clustered indexes, but only one clustered index on a table.
OK, back to the original point of the post. In SQL Server, if you've only got a non-clustered index on a table, but no clustered index, then the "pointers" in the non-clustered index actually point to the physical location in the file on disk where the data resides. On the other hand, if you've also got a clustered index, then the non-clustered index contains instead the key into the clustered index. This means that whenever you access a record through the non-clustered index, you must first traverse that, then traverse the clustered index as well - more expensive than simply going to the row directly.

So, I was wondering, how expensive is this extra indirection? In the true spirit of micro-benchmarks, the following data should be treated with an appropriate amount of suspicion.

Firstly, I created six tables, each with three integer columns named [one], [two] and [three] (logical!). The test would be to read five million rows from each of these tables, ordered by the value in column one. The data in columns [one] and [two] was random and uncorrelated, and the data in [three] was equal to that in [one]. Before inserting the data, I created the following sets of indexes:

  • None at all!
  • A clustered index on the "wrong" column ([two])
  • A clustered index on the "right" column ([one])
  • A clustered index on [two] and a non-clustered index on [one]
  • A clustered index on [three] and a non-clustered index on [one]
  • A non-clustered index on [one]

After doing this, I restarted the instance of SQL Server in order to clear any cache it might have, then read the contents of the tables using a quick C# application I knocked up. Some of the results really surprised me:
Read 5000000 rows
ClusteredWrongColumn: 15241.072ms
Read 5000000 rows
ClusteredRightColumn: 4229.28ms
Read 5000000 rows
ClusteredAndNonclustered: 19376.368ms
Read 5000000 rows
NonclusteredAndSimilarClustered: 10259.92ms
Read 5000000 rows
NonclusteredOnly: 11011.792ms
Read 5000000 rows
NoIndexes: 17856.96ms

Re-running the application without restarting SQL Server changed the results somewhat, presumably due to caching (the RAM footprint of the instance had gone from 400MB before the first run to 1.3GB afterwards):
Read 5000000 rows
ClusteredWrongColumn: 17715.984ms
Read 5000000 rows
ClusteredRightColumn: 3837.68ms
Read 5000000 rows
ClusteredAndNonclustered: 17120.752ms
Read 5000000 rows
NonclusteredAndSimilarClustered: 14614.512ms
Read 5000000 rows
NonclusteredOnly: 10588.864ms
Read 5000000 rows
NoIndexes: 13752.992ms

So, having the clustered index on the right column obviously beats having it on the wrong column, and having a non-clustered index beats having none at all. So far so good.

Having the non-clustered index be on "similar" (OK, identical, but I bet it'd work with not-quite-identical-but-close data as well) data to the clustered was also clearly better than when they were totally uncorrelated, and this also makes sense: the page cache will get much less thrashed than if you're jumping around all over the place in the clustered index rather than reading it more-or-less sequentially.

As I suspected, a non-clustered index without a clustered index underneath it performed significantly better: ten seconds rather than fifteen to twenty.

However, there was one bit that really surprised me: the ClusteredAndNonclustered result was similar in the second run, and significantly worse in the first run, than the ClusteredWrongColumn result. I guess SQL Server must have done an in-memory sort of the table when the non-clustered index didn't exist, which meant fewer disk reads in order to get the non-clustered index when it did exist. That'd also explain the much closer results in the second run.

What was the point of this post? Curiousity mainly, but it's also worth realising that a clustered index isn't always the best thing in the world for every table, especially if you're searching over a wide variety of very different fields, rather than just doing lookups on one field for the majority of the time. You need to look at each case on its own, and understand exactly what the queries are you're running before making a decision.

Finally, I should mention that this isn't the whole story. My tests used a load of data inserted into an empty database, then queried. Your databases probably have a whole lot more update and delete operations going on, and they can have a big impact on the way data ends up being laid out on disk - yet another variable to consider!

Comments

 

GSquared said:

Another thing to test on the same subject would be "covering indexes".  Those should perform best (at least, for Select), but also take up more disk space.

Another factor to consider is, if you are constantly updating, inserting and deleting data, each index, including the clustered one (if any), has to be re-ordered, etc.  A heap table (no clustered index) is generally fastest for those operations.
December 17, 2007 3:45 PM
 

RobertChipperfield said:

Thanks for the comment. I did mean to look at covering indexes as well, but unfortunately forgot to when I was creating the test cases. Theoretically they should be as good as a clustered index, as you say.

One thing you do start hitting if you have a heap table is that if you do updates of variable width data, you can start getting forwarding records, at which point you do then get an extra indirection again.
December 18, 2007 4:33 AM
 

GSquared said:

Yeah, heap tables have their own issues.
December 18, 2007 11:33 AM
 

gbn said:

You cannot defrag a table without a clustered index (well, you can shrink and compact but not actually reorganise).

The non-clustered index pointing to clustered index also reduces a lot of internal operations (Consistency in ACID) because the clustered index stays more static than the actual data...
December 19, 2007 4:53 AM
 

jakenm said:

There's an interesting Technet "Best Practices" article that discusses the performance differences between tables organized with clustered indexes versus heaps:

http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/clusivsh.mspx

I personally use clustered indexes on all my tables, and I try to pick a column that is unique, increasing, and fairly static, which helps avoid some of the performance issues that can arise with clustered indexes.  As with all things SQL Server, it really depends on your particular situation, but IMO avoiding heaps is a net gain in most situations.
December 20, 2007 5:52 PM
 

RobertChipperfield said:

Wow, that's a really impressive article! Thanks for the pointer...
December 21, 2007 3:38 AM
 

BrianTowers said:

One instance where it never occurred to me to use a clustered index is in the association table used to define a many-to-many relationship between two tables.

If the unique index for table A is Aid and for table B is Bid then my association table defining the many-to-many relationship between A and B has 2 columns, PKAid and PKBid with indexes defined on both. Surely there is no point in having a clustered index on such a table?
January 23, 2008 7:54 AM
 

RobertChipperfield said:

Hi Brian,

Well, "I am not a DBA" :-), but I think some of the time there might be. If you're doing a lot of queries where one of the columns is always specified in a WHERE clause, then it might be a good thing to put the clustered index on that column. If you're sometimes specifying one, and other times specifying the other, then maybe it won't be so good as you'd have to pick one of them to index.

The other gotcha is that if you do create a clustered index on a non-unique column, SQL Server also has to create a hidden "uniquifier" field, so all clustered index keys do end up unique, even if the key you see isn't. This hits the performance of both the clustered and non-clustered indexes, though I've not measured by how much. I'm guessing this will be the case if you used a clustered index on an associations table like you describe.

Rob
January 23, 2008 8:45 AM
You need to sign in to comment on this blog

About RobertChipperfield

I'm a software engineer at Red Gate, where I've worked since September 2006. I've worked on a wide range of products, including SQL Doc, SQL Data Compare, SQL Log Rescue, SQL Multi Script, and ANTS Profiler.

Outside of work, I enjoy amateur radio, electronics, and of course the usual assortment of computer-related technologies, from hardware all the way through to high-level software.


















<December 2007>
SuMoTuWeThFrSa
2526272829301
2345678
9101112131415
16171819202122
23242526272829
303112345
Identity Columns
 When Nigel Rivett takes us on a tour of the apparently innocuous subject of Identity Columns in TSQL,... Read more...

The Why and How of .NET Profiling
 Amirthalingam Prasanna gives a simple and practical guide about why you need to profile your .NET... Read more...

Execution Plan Basics
 Every day, out in the various discussion boards devoted to Microsoft SQL Server, the same types of... Read more...

NET Performance Cribsheet
 Robyn and Phil tackle the topic of how to make .NET applications perform well. As usual, they try to... Read more...

SQL Code Layout and Beautification
 William Brewer takes a look at the whole topic of SQL Code layout and beautification, an important... Read more...