Click here to monitor SSC
Av rating:
Total votes: 31
Total comments: 3


Jonathan Lewis
Oracle to SQL Server, Crossing the Great Divide, Part 3
23 June 2010

We soon learn, in SQL Server, that heaps are a bad thing, without necessarily understanding how or why. Jonathan Lewis is an Oracle expert who doesn't  like to take  such strictures for granted, especially when they don't apply to Oracle. Jonathan discovers much about how SQL Server places data, and concludes from his experiments that heaps perform badly in SQL Server because you cannot specify a fill factor for them.

At the start of the previous installment of this series, I noted that good, scalable database performance is largely dependent on putting the data in the right place, being able to access it efficiently, and avoiding unnecessary overheads. Having looked at ways of generating data of various types and patterns at reasonable volumes, the time has come to investigate how SQL Server handles the placing of data.

Pages and Extents

In Oracle, we talk about data segments. A simple object corresponds to a single data segment, each partition of a partitioned object (or sub-partition of a composite partitioned object) is a separate data segment; segments, in turn, comprise a collection of extents; and an extent is a contiguous set of blocks (pages) in a file.

I started my research into similar data structures in SQL Server. There is no such thing as a "segment" in SQL Server, but a search in Books Online for the word "Extents" proved to be a good choice. The first hit was an article on Understanding Pages and Extents, which was an excellent starting point for learning how SQL Server allocates space and stores data.

I won't repeat the details of my journey through the manuals, but instead will summarize some of the highlights (with comments about Oracle in italics):

  • A page is the smallest unit of storage and is a fixed 8KB
    Oracle can use blocks of 2KB, 4KB, 8KB, 16KB or even, on some platforms, 32KB. A single database can use multiple block sizes. In most cases the default, and best choice, for a platform is 8KB, and mixing block sizes is rarely a good idea.
  • An extent is a collection of eight consecutive pages
    Oracle allows for variable extent sizes, and has various strategies for specifying the extent sizes that should be used for an object or collection of objects.
  • Certain "mapping" pages are used to track the allocation status of other pages, providing information such as which extents have been allocated, the objects to which they have been allocated, which pages in which extents are used, and which pages are free. This leads to references to pages such as Global Allocation Map (GAM) pages, Shared Global Allocation Map (SGAM) pages and Index Allocation Maps (IAM) pages.
    Oracle has two dramatically different ways for dealing with extent allocation ("dictionary managed" and locally managed", and two ways of dealing with block usage ("freelist managed" and "bitmap managed"), but I won't go into details.
  • A single extent may hold pages from several objects. This is a mechanism to avoid wasting space but is only used when an object is very small.
    In Oracle, all the blocks in an extent belong to a single object, but a single extent can be as small as two blocks, and in the most recent version of Oracle an object doesn't allocate an extent until you insert some data into it.

A couple of thoughts that follow on from the above observations: there is an "allocation unit" for space at the operating system level in Windows, which defaults to 4KB on my little laptop. Clearly it would make sense to match the size of the allocation unit to the size to the database page. In fact, since I also noticed that SQL Server does a read-ahead for an entire extent when possible, there is a case for creating an allocation unit of 16KB, 32KB, or 64KB (if these are possible) and making sure you align the database page boundaries with the O/S allocation unit boundaries.  (On the down side, perhaps a larger extent size would result in a "read / fill / write" operation when SQL Server wanted to write a single 8KB block into a 32KB extent.)

Data Storage in Heaps and B-Trees

Moving on from extents to objects, just two clicks away at Table and Index Organization, we find that:

  • A table is always partitioned, although it may (and probably usually does) have only one partition.
  • Each partition consists of many logical components which may be Heaps or (clustered) B-trees – the "HoBT". In other words, someone using SQL Server is likely to think of a "table" as something which includes its indexes.
    In Oracle, the separation of tables and indexes is more clear-cut, to the extent that a partitioned table may have local or global indexes. In other words, a "single partition" of an Oracle index may cover every partition of a partitioned table, or be partitioned using a different strategy from the table partitioning.
  • Each HoBT may hold three types of pages – "row data", "LOB" or "row overflow"
    In Oracle, overflow data may be "chained", holding rows too long for a single block, or "migrated", holding rows that have been updated and have to move because there is not enough space in the current block for the new length of the row. Nevertheless, "overflow" is not treated differently from any other row data. The LOB columns for a table are, like indexes, given their own dedicated segments.

With a little background, then, it's time for some technical investigation. How well can I track where the data goes? In the previous article, I made a few comments about the differences in the amount of space used by Oracle and SQL Server for storing the same data; for the same million rows of data, a heap in Oracle used a about 8% more pages, while the non-clustered primary key index used about 7% less. Since data location is key to performance, I'm going to try and find out what they do differently.

We'll start with the data creation script shown in Listing 1 (see the previous article for the reasons behind the structure of this script). The script is based on a template that I use for generating all sorts of volumes and patterns of data, but in this case creates only 5,000 rows in a heap table with a non-clustered non-unique index:

CREATE TABLE test_table

    (

      id INT ,

      random_data INT ,

      update_date DATE ,

      vc_small VARCHAR(10) ,

      vc_padding VARCHAR(100)

    ) ;

go

 

CREATE INDEX bt_i_rand ON test_table(random_data) ;

go

 

DECLARE @div INT = 50 ;

DECLARE @mod INT = 100 ;

DECLARE @limit INT = @div * @mod ;

DECLARE @driver INT = 1000 ;

 

WITH    generator

          AS ( SELECT   1 AS id

               UNION ALL

               SELECT   id + 1

               FROM     generator

               WHERE    id < @driver

             )

    INSERT  INTO test_table

            SELECT  id ,

                    ABS(xx % @mod) ,

                    NULL ,

                    NULL ,

                    REPLICATE('x', 100)

            FROM    ( SELECT TOP ( @limit )

                                @driver * ( g1.id - 1 ) + g2.id id ,

                                CAST(NEWID() AS VARBINARY) xx

                      FROM      generator g1

                                CROSS JOIN generator g2

                    ) iv

    OPTION  ( MAXRECURSION 0, FORCE ORDER ) ;

Listing 1: Creating test_table and populating it with data

During my research into space allocation in SQL Server, I found a script that reported the space used by a table (and its indexes), by querying the sys.allocation_units view. I adapted this query for my own requirements, as shown in Listing 2 (the STR commands are there simply to make the output tidy in SQLCMD).

SELECT  SUBSTRING(tab.name, 1, 16) table_name ,

        tab.object_id object_id ,

        prt.index_id index_id ,

        SUBSTRING(alu.type_desc, 1, 12) alloc_type ,

        alu.data_space_id ,

        STR(alu.total_pages, 8, 0) tot_pages ,

        STR(alu.used_pages, 8, 0) used_pages ,

        STR(alu.data_pages, 8, 0) data_pages

FROM    sys.schemas sch

        INNER JOIN sys.tables tab ON tab.schema_id = sch.schema_id

        INNER JOIN sys.partitions prt ON prt.object_id = tab.object_id

        INNER JOIN sys.allocation_units alu
                                  ON alu.container_id = prt.partition_id

WHERE   sch.name = 'DBO'

ORDER BY tab.name ,

        prt.partition_id ,

        prt.index_id ,

        alu.allocation_unit_id

go

Listing 2: Reporting on space allocation in test_table

After creating a new database (called testdata), and creating the one (heap) table and index shown in Listing 1, the output of the query was as follows:

table_name    object_id   index_id   alloc_type   data_space_id  tot_pages  used_pages  data_pages

---------------- ----------- ----------- ------------ ------------- --------- ---------- ---------

test_table    2105058535  0          IN_ROW_DATA  1              81         80          79

test_table    2105058535  2          IN_ROW_DATA  1              25         19          17

The index_id is zero for the heap table and 2 for the index. If I had created the index as a clustered index then there would be just one line in this report, with an index_id of 1.

We have 81 pages allocated to the heap table, of which 79 are used for data, and 25 pages allocated to its index, of which 17 are used for data.

In the previous article, I hinted that there appears to be no equivalent in heap tables for the "fill factor" that is an option for indexes (in Oracle, the PCTFREE parameter, which applies to tables and indexes, is effectively the same as "100 – fill factor"). Unfortunately, no-one took the hint so it's time to find out the hard way how well filled are the table blocks, which means dumping the actual pages...

Investigating Table Blocks using DBCC IND and DBCC PAGE

I had to find a way of listing the pages allocated to the table, and then dump them. Tricky, but I had noticed that whenever you want to do anything subtle in SQL Server you need to use the DBCC command, so I tried a Google search for "DBCC PAGE", and soon found a collection of interesting articles written by Paul Randall, the first of which gave me the syntax of both DBCC PAGE command, following which I soon found the "DBCC IND" command.

  • DBCC IND( database, table, index_id )
  • DBCC PAGE ( database, file_id, block_id, level)

I started by investigating the heap, with a call to DBCC IND, shown in Listing 3.

DBCC ind('testdata', 'test_table',0)

Listing 3: Calling DBCC IND

Despite what may be suggested by the "ind", this produces a list of the blocks in the table:

PageFID   PagePID     IAMFID   IAMPID    ObjectID    IndexID   PartitionNumber  PartitionID

------- ----------- ------ ----------- ----------- ----------- --------------- -----------------

1         154         NULL     NULL      2105058535  0         1               72057594038779904

1         153         1        154       2105058535  0         1               72057594038779904

1         157         1        154       2105058535  0         1               72057594038779904

1         158         1        154       2105058535  0         1               72057594038779904

 

iam_chain_type       PageType IndexLevel NextPageFID NextPagePID PrevPageFID PrevPagePID

-------------------- -------- ---------- ----------- ----------- ----------- -----------

In-row data          10       NULL       0           0           0           0

In-row data          1        0          0           0           0           0

In-row data          1        0          0           0           0           0

In-row data          1        0          0           0           0           0

This tells us that file 1, block 153 is the first data block in the table. So let's dump it, as shown in Listing 4, and see what's in it. Note that in SSMS, you will first need to turn on a trace flag (3604), using DBCC TRACEON (3604).

DBCC PAGE (testdata,1,153,3)

Listing 4: Dumping page 153

The level 3 dump gives us a full symbolic dump, from which I've extracted a few lines:

PAGE: (1:153)

 

...{page header was here}

 

Slot 0 Offset 0x60 Length 124

 

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Record Size = 124     

 

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4

 

id = 1                              

 

Slot 0 Column 2 Offset 0x8 Length 4 Length (physical) 4

 

random_data = 32                    

 

Slot 0 Column 3 Offset 0x0 Length 0 Length (physical) 0

 

update_date = [NULL]                

 

Slot 0 Column 4 Offset 0x0 Length 0 Length (physical) 0

 

vc_small = [NULL]                   

 

Slot 0 Column 5 Offset 0x18 Length 100 Length (physical) 100

 

vc_padding = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

 

Slot 1 Offset 0xdc Length 124

 

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Record Size = 124  

 ...

Slot 63 Offset 0x1ee4 Length 124

 

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Record Size = 124          

The block held 64 rows (slots 0 to 63) and each one was 124 bytes, for a total of 7,936. Add a couple of bytes for each pointer in the table of "Offsets", and the total gets to 8,064. Add the block header (which I believe is 96 bytes) and there are only 32 bytes of free space in the block.

This is rather bad news if you're hoping to update any of the data in the table. Unless there's a "fill factor" for heap tables, this default 100% fill gives anyone using SQL Server a fairly compelling reason for using nothing but clustered indexes, unless there are some tables that aren't going to need any updates.

To demonstrate the problem of updates, let's perform the following simple update of test_table, as shown in Listing 5.

UPDATE  test_table

SET     vc_small = 'xxxxxxxxxx' ;

Listing 5: Updating the test_table table

This is what slot 3 (amongst others) looks like after I've executed the update and then re-run the level 3 dump command.

Slot 3 Offset 0x1f2 Length 9

 

Record Type = FORWARDING_STUB        Record Attributes =                  Record Size = 9

 

Memory Dump @0x4176C1F2

 

00000000:   040e0100 00010008 00†††††††††††††††††........               

Forwarding to  =  file 1 page 270 slot 8  

Slots 0, 1, and 2 show vc_small = xxxxxxxxxx, but between them the three rows have taken up 30 of the 32 bytes that I had estimated as free space, so there's no room in the block for the update to slot 3 and the row has been copied to another block, leaving nothing but a pointer behind. The space freed up by this "row migration" allowed several more updates to take place in the block before it was full again, and I ended up with a total of 5 forwarding stubs in the block.

To an Oracle database designer, if there is no specific need to use a clustered index, to group data in a pattern that's different from the order that reflects the natural order of arrival of the data, then a heap table is the obvious structure for the table. However, when you implement a heap table in SQL Server, any updates to the data will have a negative impact on performance because of the way that rows will move to different blocks. This, presumably, is one reason why there seems to be such enthusiasm for creating a clustered index on an identity column, as it keeps the data collected in order of arrival but allows you to leave some space in each block for updates by specifying an appropriate fill factor.

What about non-clustered indexes on heap tables? How do they cope with row movement? Here are the first few lines of output from using the DBCC IND command on our index:

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID     PartitionNumber PartitionID

------- ----------- ------ ----------- ----------- ----------- --------------- ------------------

1       156         NULL   NULL        2105058535  2           1               72057594038845440

1       155         1      156         2105058535  2           1               72057594038845440

1       172         1      156         2105058535  2           1               72057594038845440

1       173         1      156         2105058535  2           1               72057594038845440

 

iam_chain_type       PageType IndexLevel NextPageFID NextPagePID PrevPageFID PrevPagePID

------------------- -------- ---------- ----------- ----------- ----------- -----------

In-row data          10       NULL       0           0           0           0

In-row data          2        0          1           248         0           0

In-row data          2        1          0           0           0           0

In-row data          2        0          1           215         1           210

You'll notice from the PagePID that the index and the table start off sharing the same extent (the table uses blocks 153, 154, 157, 158 and 159 and the index is using blocks 155 and 156).

Block 156 is the IAM page (PageType = 10) for the index, and block 172 is the root block (IndexLevel = 1, so it's not a leaf block). Lets' dump block 155, the first leaf block and take a look at the index entries:

FileId      PageId    Row  Level random_data (key) HEAP RID (key) KeyHashValue   

------ ----------- ------ ------ ----------------- -------------- ----------------

     1         155      0      0                 0 0x020100000100 (03003c52c4d8) 

     1         155      1      0                 0 0x050100000100 (0600e2cd409d) 

...

     1         155    143      0                 3 0x0D0100000100 (1100bb6857c6) 

     1         155    144      0                 3 0x990000000100 (9d00b21bdb69) 

     1         155    145      0                 3 0x9F0000000100 (a300191bb3a4) 

...

     1         155    307      0                 6 0xBE0000000100 (c50036fde6b2) 

     1         155    308      0                 6 0xBF0000000100 (c600ed093b03) 

 

(309 rows affected)

I've shown the start, the end, and a patch in the middle, from the list of leaf entries. You'll notice that each entry consists of key value (my random_data column), a HEAP RID (row identifier), which seems to be the block address for the key. Since the HEAP RID is also labeled as "key", I assume that it has been appended to the real key value as an aid to ordering the appearance of duplicates; certainly the RIDs appear to be sorted within key. Interestingly, the byte-ordering of the HEAP RID means that it could, in principle, introduce a greater degree of random I/O than necessary – it looks as if rows with the same key in adjacent blocks may appear at non-adjacent positions in the index.

I've printed a section from the index which shows a few occurrences of the random_data value of 3 because it's reporting one of the rows in the table block I dumped: the HEAP RID = 0x990000000100 matches file 1, block 153 (it looks like first 4 bytes is the block id, last two is the file id).

There's a problem though: slot 47 of the table block is the slot that holds the value 3, and I can't see the reference to slot 47 anywhere in the block dump of the leaf block. Does this mean that index look-ups are only accurate to the table block, leaving SQL Server to search the whole table block for the matching key? It seems unlikely, especially when you think of the possible consequences this would have for finding migrated rows.

Fortunately, by switching the dump level to 2 (raw dump row details) we can find the following in the file:

Slot 144, Offset 0x60, Length 16, DumpStyle BYTE

Record Type = INDEX_RECORD           Record Attributes =  NULL_BITMAP     Record Size = 16

Memory Dump @0x4146C060

00000000:   16030000 00990000 0001002f 00020000 †.........../....         

Note the "2f"; converted to decimal, that's the number 47 that we needed for the slot identifier. You might also note that this version of the dump doesn't include anything that looks like the KeyHashValue. Just as in Oracle, some of the dump files report information that is derived at the time of the dump and some of the real information is missing from the dump.

Having tracked down the link from an index leaf to the table row, it takes just a few minutes to check that when a row is migrated from the heap table, the index entry is not updated; it points to the forwarding stub, rather than pointing to the new location of the row (Oracle adopts the same strategy).

Tackling one structure at a time is probably enough. I shall take a look at clustered indexes (unique and non-unique) in the next article.

Conclusions

Unless I've missed something obvious, the main conclusion that I've reached about SQL Server and heap tables is that you cannot specify the equivalent of a fill factor for a heap table. This means that updates to heap tables will probably result in an unreasonable amount of row movement and so leading to an unreasonable performance overhead when you query the data, because you will have to follow a forwarding link to find the row. The absence of a fill factor is (by itself) enough of a threat to make heap tables in SQL Server fairly undesirable.



This article has been viewed 7904 times.
Jonathan Lewis

Author profile: Jonathan Lewis

Jonathan Lewis is well-known in the Oracle world as a freelance consultant with 22 years of experience with the Oracle RDBMS engine. His specialist skills are in the area of physical database design, and solving performance issues. Despite the differences in the software, he finds that the fundamental principles of solving performance issues don't really seem to change as you move from Oracle to SQL Server.

Jonathan is the author of 'Cost Based Oracle – Fundamentals' published by Apress, and 'Practical Oracle 8i – Designing Efficient Databases' published by Addison-Wesley, and has contributed to three other books about Oracle. His blog is at http://jonathanlewis.wordpress.com, where his first note on SQL Server 2008 appears at: http://jonathanlewis.wordpress.com/2010/02/04/sql-server/

Search for other articles by Jonathan Lewis

Rate this article:   Avg rating: from a total of 31 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Storage Whitepaper Suggestion
Posted by: williamd (view profile)
Posted on: Tuesday, June 29, 2010 at 7:55 AM
Message: Johnathan, thanks for the interesting series. It is nice to see what someone from the "dark-side" sees when trying SQL Server. The (sometimes not so) subtle differences in thinking that go into the two products are quite suprising.

You mentioned that pairing the allocation unit size to the storage model of SQL Server could be advantageous. This is in fact a recommended practice, as far as I am aware. There is also a common gotcha in disk partition alignment under windows, that can cause significant performance boosts when adhered to (not just for SQL Server).

May I suggest that you take a look at the SQLCAT website sqlcat.com (SQL Server Customer Advisory Team) to see a wide range of whitepapers concerning SQL Server.

Particularly interesting is their I/O considerations for SQL Server which is hosted at MSDN : http://msdn.microsoft.com/en-us/library/ee410782(SQL.100).aspx

I look forward to the remainder of your articles to see what else you find.

Subject: Fillfactor
Posted by: Emtucifor (view profile)
Posted on: Friday, July 02, 2010 at 7:54 PM
Message: By the time I saw your comment on the other article, it was closed for more comments, perhaps you had already written this one?

Anyway, the hint for setting fillfactor upon index creation is something like WITH (FILLFACTOR = 90)... I don't remember exactly. But since a heap has no index, I'm not sure where that would go. (I would look this up myself but I have to run and wanted to post before closing down my workstation, sorry about that...) So I recommend looking at the CREATE TABLE and CREATE INDEX statements for any reference to the FILLFACTOR hint.

Subject: Heaps vs. Clustered indexes - and more on the fill factor.
Posted by: Jonathan Lewis (view profile)
Posted on: Saturday, July 03, 2010 at 1:43 PM
Message: @williamd,

Thanks for the reference to the document on storage. It was really good example of how to write an intelligent technical analysis that covered the pros and cons of different options.

I have to say, though, that one of the other items I found on the SQLCAT site was amazingly misleading (though probably not by intent). It was something I found while searching the site to see if they had anything to say about heap tables.What I found was a comparison (dated March 2007) between tables organized with clustered indexes and heaps:
http://sqlcat.com/whitepapers/archive/2007/12/16/comparing-tables-organized-with-clustered-indexes-versus-heaps.aspx

The paper produced the "general observations" that clustered indexes were good and heap tables were bad. But the observations weren't in the slightest bit general - they were about a data set and series of tests which looked as if they had been deliberately chosen to emphasize the benefits you could get from using a clustered index in the right circumstances.


@emtucifor,
I read the manual pages for CREATE TABLE and CREATE INDEX before I made the comment that SQL Server has a fill factor for indexes that doesn't seem to be available for heap tables. It was the absence of an explict fill factor for heaps that made me search for an option to dump data blocks so that I could see if there was an implicit fill factor.

In fact I had an important thought shortly after SimpleTalk published this article and did a couple of extra tests that showed that SQL Server appears to have a fill factor of 95% for heap tables (i.e. 5% space left for updates).

Unfortunately it looks as if SQL Server checks this limit AFTER statement execution, which means that code that uses array processing could easily cause blocks to be filled well past 95%. By comparison Oracle checks the PCTFREE (100 - fillfactor) before each row - which is why heap tables can be more effective in Oracle than they are in SQL Server. (The important thought was that Oracle and SQLServer don't have to do things the same way at the same point in the code.)

 










Phil Factor
Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL Server... Read more...



 View the blog
Using SQL Test Database Unit Testing with TeamCity Continuous Integration
 With database applications, the process of test and integration can be frustratingly slow because so... Read more...

SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

How to Import Data from HTML pages
 It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Hosted Team Foundation Server 2010 Review
 Team Foundation Server (TFS) has expanded its remit to support the whole software development process,... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk