Click here to monitor SSC

Brad M McGehee

Focus on SQL Server
Check out my Simple-Talk articles.
Add to Technorati Favorites      Add to Google     

 RSS Feed

     Twitter      View Brad McGehee's profile on LinkedIn

Reasons Why You May Not Want to Use a Heap

Published Friday, February 27, 2009 10:57 AM

For a book project I am working on, I have started compiling a list of reasons why heaps aren't usually a good choice when designing a database. Now, I didn't say that heaps are always bad, but in most cases, I think they should be avoided (read the list to see why).

As I compiled the list, in some cases I have had the opportunity to verify that they are true, but in other cases, I have not had the time to verify them. I would like your input on the list. Is there anything on the list that is not true, or only partially true (under some circumstances, but not others), and what other reason are there to avoid using heaps that have I left out?

I look forward to your feedback.

  1. If non-clustered indexes are not added to a heap, then all queries against a heap will require table scans to retrieve the requested data. If the heap is large, then these queries will be very resource intensive and hurt SQL Server's overall performance.
  2. Since the data in a heap is unordered, performing a table scan on a heap can cause a lot of extra I/O activity because inefficient random reads, not efficient sequential reads, are more the norm.
  3. While a non-clustered index can be added to a heap to speed up some queries, when the non-clustered index is non-covering, the use of a RID bookmark lookup is required. A RID lookup means that once the record(s) to be returned by the query are identified in the non-clustered index, additional reads (the RID bookmark lookup) must be made of the related rows in the heap, so that all of the data requested by the query is returned. This is not very I/O efficient, especially if many rows are returned. At some point, it may be faster for SQL Server to do a table scan than it is to use a non-clustered index when returning many rows. On the other hand, if the non-clustered index is covering, then the non-clustered can be used to immediately return the data to the user without having to lookup anything in the heap.
  4. Heaps cannot be replicated using SQL Server replication.
  5. If you want to create an XML index on an XML data column, a clustered index must exist on the table.
  6. If you want to create a spatial index on a spatial data column (GEOMETRY or GEOGRAPHY), a clustered index must exist on that table.
  7. If a heap has a non-clustered index on it (as the primary key), and data is inserted into the table, two writes have to occur. One write for inserting the row, and one write for updating the non-clustered index. On the other hand, if a table has a clustered index as the primary key, inserts take only one write, not two writes. This is because a clustered index, and its data, are one in the same. Because of this, it is faster to insert rows into a table with a clustered index as the primary key than it is to insert the same data into a heap that has a non-clustered index as its primary key. This is true whether or not the primary key is monotonically increasing or not.
  8. When data is updated in a heap, and the updated row is larger than the old row and can't fit into the old space, a forwarding record is inserted into the original location that points to the new location of the page. If this happens a lot, then there is a lot of space wasted in a database maintaining the forwarding records. This also contributes to additional I/O activity as both the pointer, and the row, have to be read.
  9. Even if data updated in a heap is not larger than the old row (the updated data is smaller or the same size than the original data), updating a heap with a non-clustered primary key is slower than updating the same table that has a clustered index as the primary key. This is because updating a table with a clustered index is less write intensive than updating a heap with a non-clustered index as its primary key.
  10. If a row is deleted from a heap with a non-clustered index as its primary key, it is slower than deleting the same row from the same table with a clustered index as its primary key. This is because it takes more I/O to perform this task on a heap than on a clustered index.
  11. When data is deleted from a heap, the data on the page is not compressed (reclaimed). And should all of the rows of a heap page are deleted, often the entire page cannot be reclaimed. This not only wastes space, it contributes to fragmentation of the data pages within a database.
  12. If you take two identical tables, one that is a heap with a non-clustered index as its primary key, and a table that has a clustered index as its primary key, the heap with the non-clustered index will be substantially larger, wasting valuable space and increasing disk I/O.
  13. The ALTER INDEX rebuild and reorganize options cannot be used to defragment and reclaim space in a heap (but they can used to defragment non-clustered indexes on a heap). If you want to defragment a heap in SQL Server 2005, you have three options: 1) create a clustered index on the heap, then drop the clustered index; 2) Use SELECT INTO to copy the old table to a new table; or 3) use BCP or SSIS to move the data from the old table to a new table.

 

 

Share this post :

Comments

 

Greg Linwood said:

It would've been handy if you numbered your bullets, but here are a few corrections:

1) Heap tables CAN be replicated (replication only requires a PK, not a CIX)

2) Heap tables CAN be rebuilt, the correct command is ALTER TABLE, not ALTER INDEX

3) In your first point, you mention that all queries that access a heap without an index perform a table scan - this is fairly obvious really, but it's equally true to say that a table stored on a clustered index without an index on the column being filtered in a query would result in the same outcome (a table scan)

4) In your third point, you say that a bookmark lookup must be used to make use of a non-clustered index on a heap. This is totally incorrect!! Bookmark lookups ONLY occur when a non-clustered index doesn't cover a query's requirements ON A CLUSTERED INDEX. This makes clustered indexes MUCH LESS EFFICIENT when queries aren't covered by a NCIX on a CIX vs on a HEAP!!

There are many other inaccuracies & mis-information in this blog but it's hard to address them all given that your bullets aren't numbered. I hope you find time to tidy this blog up.

Regards,
Greg Linwood
March 2, 2009 3:24 AM
 

Adam Machanic said:

Agreed with most of what Greg said, except for part of point #4. The term "bookmark lookup" is outdated as of SQL Server 2005, but in SQL Server 2000 one can happen either with a NCIX backed by a heap or a NCIX backed by a CIX.  In SQL Server 2005 the verbiage is improved so we now have a "lookup" if there is a CIX or an "RID lookup" if there is a heap.

But Greg's point about performance is close to the mark; RID lookups are at a minimum 3x faster than lookups from a logical reads perspective, depending on the size of the table (this is determined by the index depth). Greg even has a blog post on this subject so I'm going to assume he was tired after too much partying at the MVP summit when he posted the reply here :-)

With regard to your list in general, I think it would have been much more interesting to go the other way: Reasons TO use a heap.  Virtually every SQL Server "tips 'n tricks" document available tells readers not to use heaps (a recommendation that is probably good for newbies but way too strict and general to be of any value to more advanced users).  If you want to provide something of quality for the readers, educate them on when they should use it.  Most of them are already programmed not to.  Make them think.
March 4, 2009 9:51 AM
 

zenon said:

I don't know where to begin; I'm simply overwhelmed by the, the... "reasoning" of the bullet points.

You need to ask a couple of questions to begin with like "What is a heap?" and "what is a clustered index?" and "what is a non-clustered index?"

A heap definition depends upon the version of SQL Server. 2000 heaps are not 2005 heaps and 2005 hepas are not 2008 heaps(?! not sure on the 2008 heaps being different yet. likely not but MS have changed its mind about heaps again with 2008.)

In 2000, the way the data is laid out is different from the way it is laid out than in 2005. For example, 2000 doesn't differentiate between numeric data and character data and 2005 does in the way it is laid down on a heap. AND there is the whole BLOB discussion besides that is different from 2000 and 2005. Each one of these things defines how a heap will respond to an ugly query.

Now a non-clustered index doesn't order the data true but it defines a subset of the data by providing a set of pointers where the data is found. If the index is "tuned" to the queries of the app. it will respond by providing a smaller source set than the heap. Thus non-clustered index quality depends entirely upon whether or not they match the queries the tables and views receive.

Clustered indexes, the love-children of ignorant developers and inexperienced designers, re-order the data. If you don't have non-clustered indexes that map to the queries then there is no difference between a universal scan of the clustered index and a table scan. But you will think you are doing a good job because the execution plan says it is using the index. But so what? A scan is a scan is a scan.

So your bullet points are nice and everything but they demonstrate to me a profound misapprehension of the basic operation of indexes in either 2000 or 2005.  But you make me feel a lot better about the job prospects if this is the likes of my competition during this depression.
March 10, 2009 10:28 AM
 

carolamrt said:

I'm the newbie you all are referring to.  When I created my tables I didn't know the difference.  I upsized Access databases to SQL Server 2005 and ran.  Now I have four tables with something like 18 million records per table and none are clustered.  I just ran through my database clustering the other smaller tables (before I read the comments to the original blog).  And now I have these last four tables on my to do list, assuming they will take all night per table.  I am still using Access as the front end.   Don't hit me.

But should I cluster or should I not cluster?  These are typical customer/invoice tables where I do have lots of indexes.  Purpose of tables is a historical data warehouse.  Nobody is data entering.  All data is imported via nightly process.  

Is there a nice reliable piece of information describing when to heap and when to cluster?  

Thanks.
March 10, 2009 12:06 PM
 

Ray Herring said:

If my screen name was Zenon I would not be so smug about future employment.
Clustered Indexes are not "the love childern of ignorant developers", they are tools that are available to to DBA to assist in tuning a database.  They have just as much of a place as Triggers, Non-Clustered indexes, Statistics, and even Cursors.
Greg and Adam addressed Brad's article with some respect and spoke to specific points.  You, on the other hand, include personal attacks, disrespectful language, broad generalizations, and sophistry in a feeble attempt to enhance your own stature at the expense of Brad's quest for knowledge.  About the only positive thing I can attribute to your post is that it made me finally join Simple Talk.

BTW, I have inherited an application with many of those lovely heaps and I can tell you that they are not generally superior to a well designed table with Clustered and non-clustered indexes.  In one case, no amount of tuning on a non-clustered index could overcome the fact that the underlying heap was so fragmented than a non-clustered index seek took longer than a table scan.
March 10, 2009 2:34 PM
 

Greg Linwood said:

Adam - what do you mean by "...but in SQL Server 2000 one can happen either with a NCIX backed by a heap..."?

This is wrong - Bookmark Lookups CANNOT happen on a NCIX backed by a heap on ANY version of SQL Server.

As for terminology, the term "Bookmark" was invented in the SQL 7 release specifically to describe the change in NCIX implementation when NCIXs were created on CIXs.

Wheter SQL 2005's showplan / graphical display uses this term or not doesn't change the implementation - Bookmark lookups occur today exactly as they did in SQL 7 + versions, regardless any change in terminology. A sceptic might say that the deprecation of this term is an attempt to hide the inefficiency (c:

Cheers,
Greg
March 11, 2009 6:06 AM
 

Adam Machanic said:

Greg: Sorry, but you're mistaken.  I just did the following on a SQL Server 2000 instance:

--
create table #x ( i int, j int )
create index ix_x on #x (i)

set statistics profile on

select *
from #x
where i = 1
--

Result:

--
select *   from #x  where i = 1
 |--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([tempdb].[dbo].[#x______000000012314]))
      |--Index Seek(OBJECT:([tempdb].[dbo].[#x________000000012314].[ix_x]), SEEK:([#x].[i]=1) ORDERED FORWARD)
--

March 11, 2009 9:28 AM
 

Adam Machanic said:

Zenon:

"In 2000, the way the data is laid out is different from the way it is laid out than in 2005. For example, 2000 doesn't differentiate between numeric data and character data and 2005 does in the way it is laid down on a heap."

This is simply not correct.  You might want to take some time to read through "Inside SQL Server 2000."
March 11, 2009 9:34 AM
 

bradmcgehee said:

I would like to thank those who provided constructive feedback on this blog post. As I mentioned, the goal of the post was to get feedback on different reasons why a heap may not be a good choice, and that the list of reasons I provided was from various sources that I had collected (and that I had not personally verified them all). The purpose of this blog post was not to provide a definitive list, but to act as a starting point to creating such a list.

In regards to the feedback about creating a list of why you might want to use a heap, that is outside the scope of this blog post. I'll save that for another day.

I neglected to mention that the focus of this list was for SQL Server 2005/2008, not SQL Server 2000. Because of this, I will only refer to feedback in regards to SQL Server 2005/2008.

So let's get back to some of the constructive feedback provided.

In regard to point #1, feedback was given that this point is obvious. Yes, it is obvious to experienced DBAs. I have included this point so the list is inclusive.

In regard to point #3, feedback was given that it was "totally incorrect." In this case, I think the point was misread, or perhaps I did not write it precisely enough. I have altered this point to make it clearer, so there is no confusion. I have underlined that text I added for clarification.

In regard to point #4, this point is incorrect, as the feedback suggested, and should be removed from the list (I have crossed it off the list in the original blog). Transactional replication does require a unique key, but it can clustered or non-clustered. On the other hand, adding a clustered index to the table (so that it is not a heap) offers many of the other advantages of a clustered index that are included in this list.

In point #13, I said that a heap cannot be rebuilt using the ALTER INDEX command. This is true. Feedback was provided that a heap can be rebuilt using the ALTER TABLE instead. This is true for SQL Server 2008 only, not in SQL Server 2005. For example, "ALTER TABLE heaptablename REBUILD" can be used to rebuild a heap in SQL Server 2008. I have added this new information to the original post, as underlined text. Thanks for bringing this to my attention, as I had not noticed this new feature in SQL Server 2008 before.

In regard to the use of the term "bookmark lookup,” I often used it as a generic term for "Key Lookup" and "RID Lookup". While this is less precise, I chose to do this because I didn't want to take the time to explain the difference between the two. On the other hand, if you feel it is important to always distinguish between the two in order to always be technically accurate, then this is good feedback.

There were some additional comments that there were other points that were not correct in this list. If you believe this is the case, please tell me what they are so I can correct them. The more precisely you can provide the feedback, the easier it will be for everyone to understand what you are talking about.
March 16, 2009 1:10 PM
You need to sign in to comment on this blog

About bradmcgehee

Brad M. McGehee is a MCITP, MCSE+I, MCSD, and MCT (former) with a Bachelor’s degree in Economics and a Masters in Business Administration. Currently the Director of DBA Education for Red Gate Software, Brad is an accomplished Microsoft SQL Server MVP with over 16 years SQL Server experience, over 7 years training experience, and has been involved in the industry since 1982. Brad is a frequent speaker at SQL PASS, European PASS, SQL Server Connections, SQLTeach, devLINK, SQLBits, SQL Saturdays, TechFests, Code Camps, SQL Server user groups, and other industry seminars, where he shares his 16 years of cumulative knowledge and experience. In 2009, Brad made 33 public presentations to a total of 1,853 attendees, in six different countries. In 2010, Brad made 31 public presentations to a total of 3,156 attendees in two different countries. Brad was the founder of the popular community site SQL-Server-Performance.Com, and operated it from 2000 through 2006, where he wrote over one million words on SQL Server topics. A well-respected and trusted name in SQL Server literature, Brad is the author or co-author of more than 15 technical books and over 275 published articles. His most recent books include How to Become an Exceptional DBA (2nd Edition), Brad's Sure Guide to SQL Server 2008: The Top Ten New Features for DBAs, Mastering SQL Server Profiler, and Brad’s Sure Guide to SQL Server Maintenance Plans. These books are available free in PDF format at: http://www.sqlservercentral.com/Books/. He blogs at www.bradmcgehee.com.
<February 2009>
SuMoTuWeThFrSa
25262728293031
1234567
891011121314
15161718192021
22232425262728
1234567
How to Kill a Company in One Step or Save it in Three
 The majority of companies that suffer a major data loss subsequently go out of business. Wesley David... Read more...

Migrating from OCS 2007 R2 to Lync: Part 4
 Having migrated the rest of our users and legacy resources across and started getting ready to... Read more...

Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL... Read more...

Seth Godin: Big in the IT Business
 Seth Godin has transformed our understanding of marketing in IT. He invented the concept of 'permission... Read more...

Using SQL Test Database Unit Testing with TeamCity Continuous Integration
 With database applications, the process of test and integration can be frustratingly slow because so... Read more...