Click here to monitor SSC
Av rating:
Total votes: 32
Total comments: 5


Jonathan Lewis
Oracle to SQL Server: Crossing the Great Divide, Part 2
26 May 2010

A well-known Oracle expert records faithfully his struggles with the  unfamiliar : SQL Server.  He now sets himself the task of creating a table with a million rows of random data.  As one might expect, it is the lack of familiarity with the workarounds and tricks of SQL Server that trips him up.  His journey brings us fresh insights, and a glimpse at the alternative-universe of Oracle.

Data Generation Strategies

Good, scalable, database performance is largely about putting the data in the right place, being able to access it efficiently, and avoiding unnecessary overheads. At the end of the previous article, I had given myself the task of building a large data set quickly so that I could start examining some of the strategies for data placement in SQL Server 2008 to identify their strengths and weaknesses.

Initially all I wanted to do was emulate the action of an SQL statement I often use in Oracle to generate a few million rows of data, and this article is the story of my wanderings through the SQL Server landscape to find a reasonable, though not necessarily perfect, way of generating the type of data I needed.

A Million Row Table: the Oracle way

Although this was discussed in some detail in Part 1, Listing 1 demonstrates, in Oracle code, the sort of data generation code I was aiming for.

execute dbms_random.seed(0)

 

create table t1

as

with generator as (

    select      rownum      id

    from        dual

    connect by

                rownum <= 1000

)

select

    rownum                                                id,

    trunc((rownum-1)/50)                            clustered,

    mod(rownum,20000)                               scattered,

    trunc(dbms_random.value(0,20000))               randomized,

    trunc(sysdate) + dbms_random.value(-180, 180)   random_date,

    dbms_random.string('l',6)                       random_string,

    lpad(rownum,10,0)                               vc_small,

    rpad('x',100,'x')                               vc_padding

from

    generator   g1,

    generator   g2

where

    rownum <= 1000000

;

 

alter table t1 add constraint t1_pk primary key(id);

create index t1_clu on t1(clustered);

create index t1_sca on t1(scattered);

create index t1_ran on t1(randomized);

create index t1_dat on t1(random_date);

create index t1_str on t1(random_string);

Listing 1: Generating a million row table in Oracle

This example is going to give me 1,000,000 rows of data in a simple heap table, with six B-tree indexes, one of which is built implicitly by the addition of the primary key constraint. I have a primary key that has been generated by the pseudo-column rownum, a column of randomly generated strings that will be upper case ('U') and six characters in length, a column of randomly generated dates (with time to the nearest second) covering 360 days and centered on "today", and three numeric columns that are going to hold 20,000 distinct integer values with 50 rows of each value (although that's an approximation for the column called "randomized"). I needed numeric columns with different patterns in their data distribution because data location has a big impact on the work needed to access the data. I've also included a couple of simple character columns, one small, one larger, to make the row lengths look as if they might represent some realistic data and to give me something that isn't indexed that I can select or update.

Running this code on a small 2.0 GHz laptop, it takes about one minute to create the table, and about eight seconds for each index (mostly time spent re-reading table blocks). If I remove all the columns that make use of the use dbms_random package the time to create the table drops to about 7 seconds, so in practice I make sure that I don't generate any columns or indexes that I don't really need, for any given test.

Finally, it's worth noting that Oracle, unlike SQL Server, has to be explicitly instructed to collect statistics about a table after it has been created, although it does collect statistics on the indexes as you create them. As such, I've missed out a step in my code that would be critical to Oracle's Cost Based Optimizer.

In part 1 I outlined most of the problems with translating this code to SQL Server so I won't cover them again here. Many of them were little variations in syntax, function names and features and were relatively easy to overcome by diligent searching of the manuals. However, three of the issues were especially relevant to the strategies I attempted:

  • SQL Server doesn't have an equivalent of the dbms_random procedure, to generate random data
    I've yet to overcome this one; essentially I'll have to find one or write my own. However, I decided to worry about that later when I really need some randomized data. In this article, I'll just work on ideas for generating the scattered and clustered data patterns. (Update: since writing the article I've discovered that using the newid() function is a popular way to generate random data).
  • SQL Server doesn't recognize the create as select… syntax
    I have to do "create table", "insert as select ..." instead, which is a minor detail and easy to address. In addition, in the discussion on the previous article a couple of readers pointed out the option for a simple "select...into…from" to create and populate a table. This mechanism copies not null constraints but doesn't handle other constraints, and doesn't allow you to define a key as you create the table. However, that's okay with me because at present I'm not too worried about looking closely at indexes.
  • SQL Server doesn't have a connect by clause
    I had to find a convenient, alternative way of generating a rowsource to play with. As you can  see, my strategy with Oracle is to use common table expression (subquery factoring in Oracle speak) to generate a fairly small volume of data, then introduce a Cartesian join (cross product) to expand that to a much larger volume; I'll explain why later.

The following sections describe the various approaches that I attempted. Unless stated otherwise, the code is run using SQLCMD.

Strategy 1: Row-by-Row

I started with a strategy that I knew was probably a very bad idea (it certainly would be in Oracle): create an empty table and use a program loop to insert one row at a time. This is an easy way to get the counter that I need for my mod() and trunc() games.

I got a little surprise here in that the only loop construct in T-SQL seems to be WHILE...END. I also had a little trouble finding an assignment operator; initially, it seemed as if the only option was to SELECT into variables, but eventually I found the SET operator. Running a few timing tests, the two options seem to perform identically, so maybe they're effectively the same piece of code behind the scenes, anyway.

Stripped to the minimum, Listing 2 shows what I came up with.

CREATE TABLE big_table

    (

      id INT ,

      clustered_data INT ,

      scattered_data INT ,

      vc_small VARCHAR(10) ,

      vc_padding VARCHAR(100)

    ) ;

go

 

DECLARE @div INT = 50

DECLARE @mod INT = 200

DECLARE @loop INT = 1

DECLARE @limit INT = @div * @mod

 

BEGIN

    WHILE @loop <= @limit

        BEGIN

            INSERT  INTO big_table

                    ( id ,

                      clustered_data ,

                      scattered_data ,

                      vc_small ,

                      vc_padding

                 )

            VALUES  ( @loop ,

                      FLOOR(( @loop - 1 ) / @div) , -- trunc()

                      ( @loop - 1 ) % @mod ,        -- mod()

                      STR(@loop, 10, 0) ,           -- to_char()

                      REPLICATE('x', 100)           -- rpad()

                 )

 

            SET @loop = @loop + 1

        END

END

Go

Listing 2: The row-by-row approach

The functions I've used are the closest equivalents to the Oracle functions that I've referenced in the comments, but they're not exactly identical. I've set up a few variables, rather than fixing constants into the script; after all, I'm hoping to create a template that I can tweak easily to generate different volumes and patterns.

In this case, I restricted myself to 50 distinct values, 200 rows of each, for a total of just 10,000 rows. Performance was, as expected, dire: the code took 19.5 seconds to complete. Regardless of which RDBMS you're using, doing a big job in lots of little pieces is almost certain to be a very bad idea.

Interestingly, my code above is suffering from the overheads of handling some messaging for every single insert in the loop. In my reading, I stumbled across the NOCOUNT option, and by adding the SET NOCOUNT ON command to the start of Listing 2, I reduced the run time to 5.5 seconds. This is still slow, of course, but gives an interesting insight into how many little details you need to learn before you can criticize a product!

A note on NOCOUNT option: SQLCMD vs. SSMS
When running my code from SSMS, rather than SQLCMD, the run time was still 5.5 seconds regardless of whether or not I set the NOCOUNT option, and I did check that the "one rows processed" messages still appeared in the message window when NOCOUNT was off. So SSMS must be using a different method to SQLCMD?

Strategy 2: Sys.objects, a CTE and row_number

The sys.all_objects view is always going to hold a reasonable number of rows, so why not use that, doing a cross join to itself, to generate a lot of data? I could put the driving query into a common table expression (CTE) before doing the cross join, to make the code look a little simpler. I need a sequential number, so I can play around with the row_number() analytic function inside the CTE and then use a little arithmetic in the cross join.

Listing 3 puts this approach into practice. It seems to work quite well and generated 1,000,000 rows in about 13.5 seconds, running at close to 100% CPU. The definition of big_table is the same as it was in Listing 2.

DECLARE @div INT = 50 ;

DECLARE @mod INT = 20000 ;

DECLARE @limit INT = @div * @mod ;

DECLARE @driver INT = 1000 ;

 

WITH    generator

          AS ( SELECT TOP ( @driver )

                        row_number() OVER ( ORDER BY schema_id ) id

               FROM     sys.all_objects

             )

    INSERT  INTO big_table

            SELECT TOP ( @limit )

                    @driver * ( g1.id - 1 ) + g2.id ,

                    FLOOR(( @driver * ( g1.id - 1 ) + g2.id - 1 ) / @div) ,

                    ( @driver * ( g1.id - 1 ) + g2.id - 1 ) % @mod ,

                    STR(@driver * ( g1.id - 1 ) + g2.id, 10, 0) ,

                    REPLICATE('x', 100)

            FROM    generator g1

                    CROSS JOIN generator g2

    OPTION  ( FORCE ORDER ) ;

Listing 3: Cross joining sys.all_objects into a CTE

There are a couple of things I don't like about this code, though. First, it depends on a SYS object and it's possible that I'm able to see that at present purely because I'm operating as a privileged user. There's also the potential problem that the number of rows in the view may be smaller than I need for bigger tests; in my current test database it's less than 2,000 rows.

Another thing I'm not keen on is putting in that FORCE ORDER hint, which I've included to make sure that the optimizer joined the two copies of generator in the right order to allow my arithmetic to produce the expected results). I don't really know how the optimizer is going to deal with the CTE; the execution plan happens to work out the way I want at present but that might just be luck, and could change if I ask for fewer or more rows.

For a little extra clarity I like the idea of removing the repetition of the messy bit of arithmetic, involving the two ids, by putting in an inline view. I'll demonstrate that in the next example.

A note on "insert...with...select ..." instead of "with...insert...select..."
My first attempt at using a CTE to create some data ended in failure("Msg 102, Level 15, State 1, Server HPBASE, Line 2, Incorrect syntax near 'generator'". I had used the Oracle pattern, "insert...with...select ...", where the CTE is part of the select statement, rather than "with...insert...select...".

Strategy 3: A Recursive CTE

The CTE looks like a good starting point for generating a lot of data, but I'd like to avoid dependencies on anything that may be outside the permission set of the current user.

So how about a recursive CTE (a feature that didn't appear in Oracle until 11g) that doesn't query any objects at all?

DECLARE @div INT = 50 ;

DECLARE @mod INT = 20000 ;

DECLARE @limit INT = @div * @mod ;

DECLARE @driver INT = 1000 ;

 

WITH    generator

          AS ( SELECT   1 AS id

               UNION ALL

               SELECT   id + 1

               FROM     generator

               WHERE    id < @driver

             )

    INSERT  INTO big_table

            SELECT  id ,

                    FLOOR(( id - 1 ) / @div) ,

                    ( id - 1 ) % @mod ,

                    STR(id, 10, 0) ,

                    REPLICATE('x', 100)

            FROM    ( SELECT TOP ( @limit )

                                @driver * ( g1.id - 1 ) + g2.id id

                      FROM      generator g1

                                CROSS JOIN generator g2

                    ) iv

    OPTION  ( MAXRECURSION 0, FORCE ORDER ) ;

Listing 4: The recursive CTE approach

This solution performs reasonably well, generating a million rows in about 13 seconds, which is the best yet. However, I had many false starts before reaching this solution, and in fact temporarily discarded the idea completely because I couldn't make it work. On the first couple of very small test runs the code seemed to work, but it crashed when I finally set @driver to 1,000 to generate my million rows, producing the error message:

Msg 530, Level 16, State 1, Line 8

The statement terminated. The maximum recursion 100 has been exhausted before statement completion.

At the time, I didn't consider the possibility that the recursion level was a soft limit that could be set by a hint and it wasn't until a few weeks after my first attempt that I heard about the hint, during a conversation with a SQL Server expert. The recursion level can be set to any value up to 32,767, after which a setting of zero means "no limit".

The recursion level wasn't the only little problem I had with the query: on my first attempt to use the inline view I ran into this error:

Msg 156, Level 15, State 1, Line 1

Incorrect syntax near the keyword 'option'.

When I removed the option() clause, the error changed to:

Msg 102, Level 15, State 1, Line 32

Incorrect syntax near ';'

It took me 15 minutes staring at the text before I noticed that I hadn't given the inline view an alias; this is not a requirement in Oracle and it actually causes an Oracle error to be reported in some cases if you do include one. It always seems to be the smallest things that waste the most time when you're in an unknown environment!

A note on ANSI join support in SQL Server vs. Oracle
T SQL Server syntax for ANSI joins is far better than Oracle's. It was a nasty surprise when I first started to write some ANSI code and found that things like the aliases for inline joins and columns, which SQL Server allows, simply didn't work in Oracle.

Before we move on to the next strategy, you may be wondering why Listing 4 (and my original Oracle source) uses a small recursive CTE with a cross join rather than just using a single large CTE in a statement like that shown in Listing 5 (which gets rid of the FORCE ORDER hint).

WITH    generator

          AS ( SELECT   1 AS id

               UNION ALL

               SELECT   id + 1

               FROM     generator

               WHERE    id < @driver    -- set to something BIG

             )

    INSERT  INTO big_table

            SELECT  id ,

                    FLOOR(( id - 1 ) / @div) ,

                    ( id - 1 ) % @mod ,

                    STR(id, 10, 0) ,

                    REPLICATE('x', 100)

            FROM    generator g1

    OPTION  ( MAXRECURSION 0 ) ;

Listing 5: A simpler, but much slower, CTE approach

The answer is that recursion is expensive in most coding environments, so I had simply assumed it probably would be in SQL Server 2008. In my Oracle example, the basic cost of generating 1,000,000 rows jumped from 7 seconds to 11 seconds when I switched from a small join to a single use of the CTE; the SQL Server 2008 example jumped from 13 seconds to 48 seconds.

Strategy 4: A Table-Valued Function

After temporarily abandoning the recursive CTE approach, I thought I'd try another "data-free" strategy, namely a table function (known in Oracle-speak as a pipelined function). I'll start with defining the table function, as shown in Listing 6, and then explain the rationale.

CREATE FUNCTION table_function ( @i_in AS INTEGER )

RETURNS @table_result TABLE ( id INT )

AS

    BEGIN

        DECLARE @loop INT = 1 ;

        WHILE @loop <= @i_in

            BEGIN

                INSERT  INTO @table_result

                VALUES  ( @loop ) ;

                SET @loop = @loop + 1 ;

            END ;

        RETURN

    END

Go

Listing 6: The TVF approach

The table function has a return type which, as its name suggests, is a table and I have, in effect, declared a variable of type table and then written a simple loop that inserts rows into this table (in a variant of this test, I used the recursive CTE mechanism to generate the result set for the table function). The table name is @table_result, which indicates, through the @ symbol, that the table is a form of temporary table.

A note on temporary tables
One of the comments on the first article in this series pointed out that there are three different types of temporary table: #table_name for local scope, ##table_name for global scope, and @table_name for inner-local scope, so I've got plenty of work to do on checking how they work,what they cost , and whether I've used the right type here.

To use this function, I simply need to tweak the definition of my CTE, as shown in Listing 7.

WITH    generator

          AS ( SELECT   id

               FROM     table_function(@driver)

             )

    INSERT  INTO big_table

            SELECT  id ,

                    FLOOR(( id - 1 ) / @div) ,

                    ( id - 1 ) % @mod ,

                    STR(id, 10, 0) ,

                    REPLICATE('x', 100)

            FROM    ( SELECT TOP ( @limit )

                                @driver * ( g1.id - 1 ) + g2.id id

                      FROM      generator g1

                                CROSS JOIN generator g2

                    ) iv

    OPTION  ( FORCE ORDER ) ;

Listing 7: Using the TVF with the generator CTE

This code took 15 seconds to complete, which is 2 seconds longer than the recursive CTE. I had assumed that the difference would be virtually non-existent, but perhaps the 2 seconds is the impact of the temporary table, or a difference due to the memory calls made for a table function. If this wasn't just for a quick and dirty data builder I'd need to find out why the difference was as large as it was. In the meantime, though, I'll just put this anomaly on my to-do list.

Again it's worth a quick check to see what the impact would be of changing the cross-join into a single call to the table function for 1,000,000 rows: the answer is similar to the example with the recursive CTE; the run time changed from 15 seconds to 48 seconds. Again, we see that building a large object in memory can be very CPU intensive, while the SQL approach, in effect generating and consuming in a continuous stream, is much more efficient.

Strategy 5: Using a Temp Table with the CTE

At this point, two things occurred to me. First, I really needed to know how SQL Server was generating the data, so an investigation into generating and reading execution plans had become very important. Secondly, it struck me that I had been thinking too much in terms of Oracle-style solutions.

I had to try to think in terms of what might be a natural "SQL Server way" of attacking this problem, and probably there were a lot of SQL Server-style things I should have done before I got to this point (I'm sure you'll let me know!).

Perhaps if I stopped playing around with recursive CTEs and table functions I could do something more efficient. The code in Listing 8 looks very similar to Listing 4 but there is an important intermediate step that I thought might make a difference.

WITH    generator

          AS ( SELECT   1 AS id

               UNION ALL

               SELECT   id + 1

               FROM     generator

               WHERE    id < @driver

             )

    SELECT  *

    INTO    #generator

    FROM    generator

OPTION  ( MAXRECURSION 0 ) ;

 

INSERT  INTO big_table

        SELECT  id ,

                FLOOR(( id - 1 ) / @div) ,

                ( id - 1 ) % @mod ,

                STR(id, 10, 0) ,

                REPLICATE('x', 100)

        FROM    ( SELECT TOP ( @limit )

                            @driver * ( g1.id - 1 ) + g2.id id

                  FROM      #generator g1

                            CROSS JOIN #generator g2

                ) iv

OPTION  ( FORCE ORDER ) ;

 

go

Listing 8: Using a temp table with the CTE

The difference is that I've created a temporary table to hold my driving data set. One of the mantras of the Oracle enthusiast is that "you don't need temporary tables" (it's not true of course, but we usually hide them behind CTEs and non-mergeable views). In contrast, I'd noticed that in SQL Server they are quite openly and commonly used. With this temporary table in place I can be confident (I think!) that I won't be re-generating a recursive CTE 1,000 times, or calling a table function 1,000 times. So, if any of my strategies so far was going to be noticeably quicker than the rest, my money was on this one. I was wrong. In fact, this version of the code took slightly longer than all the other versions. I have no idea why, of course, but maybe it's just something to do with a slight difference in the way the temporary table is used in this case; time to start learning about XEvents, perhaps?

Concluding notes

After a few hours of playing around, I've got a way of generating useful volumes of data. I haven't found a mechanism that is quite as fast as Oracle, but the timing is in the same ballpark, and when I got around to adding the PRIMARY KEY constraint and other indexes, the timing advantage swung the other way.

An interesting point here is that in Oracle, I added the PRIMARY KEY constraint with a single command but in SQL Server I had to add a NOT NULL constraint first. Is this a significant difference? When you know where to look (and I do when I'm using Oracle) you can uncover all sorts of interesting details. In fact, behind the scenes, adding a PRIMARY KEY constraint in Oracle requires two tablescans, one to add the NOT NULL constraint and one to create the index. So the database goes through the same amount of work as SQL Server, but Oracle is hiding two actions behind one command.

Of the various mechanisms I've tried, I'm going to use the recursive CTE of strategy 3 in the future, simply because it's the most self-contained of the options I've considered.

I still need to sort out a reasonable method of generating pseudo-random data though, and I am curious about the join method that SQL Server has selected for the cross join that I've been using in all my non-procedural examples, so it's nearly time to start looking closely at execution plans.

I'll finish off with a couple of points about differences and similarities between the million row tables built by Oracle and SQL Server:

  • In Oracle my table spanned 18,261 blocks (pages) of 8KB while SQL Server spanned only 16,950 pages
    Note to self: by default Oracle leaves 10% free space in a block for row lengths to increase on update, does SQL Server do something similar by default? (It's got a fillfactor for indexes)
  • In Oracle, the primary key index spanned 2,087 blocks (again with 10% free space; the equivalent of "fillfactor 90 pad_index off" in SQL Server) while SQL Server spanned 2,228 pages. Note to self: is this because of a different storage strategy, or because of a different free space allocation?

In the next couple of articles I will be looking more closely at where my data is, and how SQL Server can choose to access it - because without this type of information it's not possible to design an efficient database.


If all you want to do is to generate test data in a table incredibly quickly .... 
“I generated better data in only seconds…” Michael Gaertner of Quintech was impressed with SQL Data Generator, a tool for populating databases with realistic test data. Download your free trial of SQL Data Generator.



This article has been viewed 7302 times.
Jonathan Lewis

Author profile: Jonathan Lewis

Jonathan Lewis is well-known in the Oracle world as a freelance consultant with 22 years of experience with the Oracle RDBMS engine. His specialist skills are in the area of physical database design, and solving performance issues. Despite the differences in the software, he finds that the fundamental principles of solving performance issues don't really seem to change as you move from Oracle to SQL Server.

Jonathan is the author of 'Cost Based Oracle – Fundamentals' published by Apress, and 'Practical Oracle 8i – Designing Efficient Databases' published by Addison-Wesley, and has contributed to three other books about Oracle. His blog is at http://jonathanlewis.wordpress.com, where his first note on SQL Server 2008 appears at: http://jonathanlewis.wordpress.com/2010/02/04/sql-server/

Search for other articles by Jonathan Lewis

Rate this article:   Avg rating: from a total of 32 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Re: Crossing The Great Divide
Posted by: Jeff Moden (view profile)
Posted on: Saturday, May 29, 2010 at 11:04 PM
Message: I have a real live appreciation for what your going through and it's proof positive that just because you know one dialect of SQL doesn't mean that you can jump right in on another dialect of SQL. I went through much the same agony trying to make the switch from SQL Server to Oracle several years ago.

On the subject of sys.Objects or sys.All_Objects not having enough rows; that would be true for most new databases. However, even on a brand new installation, Master.sys.all_columns will have at least 4000 rows in it and a simple Cross Join will easily render up to 16 million rows. The addition of a second Cross Join will render 64 billion rows which is more than most people will be willing to wait for.

Another example is that although SQL Server doesn't have a dedicated programable random number generator built in, the code to produce constrained random numbers is quite easy using a bit of math.

Combining the two methods above will allow you to easily render a million row test table in scant seconds.

The lesson is (or at least what I'm taking away from your article) is that "SQL is not SQL" and that to be really good at your craft, you need to know more than just what books can offer on the subject.

As a side bar, recursive CTE's (known as "Subquery refactoring" in Oracle) are just another form of RBAR and I strongly recommend staying away from them for data generation and just about everything else for that matter.

Finally, the following code contains parts of my standard test table generator... you may find it useful in crossing the great divide. ;-)

  
--===== Create and populate a 1000000 row test table.
  -- Column "RowNum" has a range of 1 to 100,000 unique numbers
  -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
  -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
  -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
  -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
SELECT TOP 1000000 --Takes about 12 seconds on an 8 year old single p4 1.8 GHZ
      
SomeID = IDENTITY(INT,1,1),
      
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
      
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
      +
CHAR(ABS(CHECKSUM(NEWID()))%26+65),
      
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
      
SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME)
      
INTO dbo.JBMTest
  
FROM MASTER.dbo.SysColumns t1,
      
MASTER.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN
  --===== A table is not properly formed unless a Primary Key has been assigned
  -- Takes about 6 seconds to execute.
ALTER TABLE dbo.JBMTest
      
ADD PRIMARY KEY CLUSTERED (SomeID)
      

Subject: .
Posted by: Jeff Moden (view profile)
Posted on: Saturday, May 29, 2010 at 11:05 PM
Message: And I apologize for the mess this forum made of my otherwise nicely indented code.

Subject: Hints
Posted by: Emtucifor (view profile)
Posted on: Wednesday, June 02, 2010 at 12:59 PM
Message: --Hint 1:
IF OBJECT_ID('dbo.big_table', 'U') IS NOT NULL DROP TABLE big_table;

--Hint 2 (likely surprises about varchar and char):
PRINT CASE WHEN '' = ' ' THEN 1 ELSE 0 END
PRINT
CASE WHEN '' IS NULL THEN 1 ELSE 0 END
PRINT
CASE WHEN CONVERT(CHAR(3), 'a  ') = CONVERT(VARCHAR(3), 'a  ') THEN 1 ELSE 0 END
SELECT
  
qa = '"' + a + '"',
  
la = LEN(a),
  
dla = DATALENGTH(a),
  
dlta = DATALENGTH(RTRIM(a)),
  
qb = '"' + b + '"',
  
lb = LEN(b),
  
dlb = DATALENGTH(b),
  
dltb = DATALENGTH(RTRIM(b))
FROM (
  
SELECT CONVERT(VARCHAR(3), 'a  '), CONVERT(CHAR(3), 'b  ')
  
UNION ALL SELECT CONVERT(VARCHAR(3), 'a'), CONVERT(CHAR(3), 'b')
)
X (a, b)
-- notice alternate alias syntax of "alias = "

--Hint 3:
DECLARE
  
@div INT = 50,
  
@mod INT = 20000,
  
@limit INT = @div * @mod,
  
@driver INT = 1000;
/* Note: The next two queries take the same amount of time on my box. CTEs are not always bad, and I do not think a blanket statement that it's best to stay away from CTEs can be well supported. */

--Hint 4:
WITH
  
L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  
L1   AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),
  
L2   AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),
  
L3   AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),
  
L4   AS(SELECT 1 AS c FROM L3 AS A, L3 AS B),
  
L5   AS(SELECT 1 AS c FROM L4 AS A, L4 AS B),
  
generator AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS id FROM L5)
INSERT big_table -- INTO not required
SELECT
  
id,
   (
id - 1) / @div, -- integer division produces integers without floor()
  
(id - 1) % @mod,
  
STR(id, 10), -- conversion to varchar from int can be implicit if you don't need padding
  
REPLICATE('x', 100)
FROM generator
WHERE id <= 1000000;

--Hint 5:
WITH generator AS (
  
SELECT TOP 1000000
      id
= Row_Number() OVER (ORDER BY a)
  
FROM
      
(SELECT a = 1 FROM MASTER.dbo.syscolumns) C1
      
CROSS JOIN MASTER.dbo.syscolumns C2
)
INSERT big_table
SELECT
  
id,
   (
id - 1) / @div,
   (
id - 1) % @mod,
  
STR(id, 10),
  
REPLICATE('x', 100)
FROM generator;

Subject: P.S.
Posted by: Emtucifor (view profile)
Posted on: Wednesday, June 02, 2010 at 1:00 PM
Message: I stole the generator code from Itzik Ben-Gan http://sqlservercode.blogspot.com/2007/04/interview-with-itzik-ben-gan-author-of.html

Subject: Comments
Posted by: Jonathan Lewis (view profile)
Posted on: Saturday, June 12, 2010 at 5:23 AM
Message:
Jeff,

Thank you for your comments - the newid() as random number generator is particularly useful. I was interested to see your use of the checksum() function since I couldn't find it in my local copy of BOL (which I had thought was the most recent version). With the right input, though, a quick Google search found the right reference page. At some stage I may do some experimentation to discover how random the result really is - but at first sight it certainly seems "random enough" to be a reasonable model of real data.

<i>"recursive CTE's (known as "Subquery refactoring" in Oracle) are just another form of RBAR and I strongly recommend staying away from them for data generation"</i>

I find it a little strange that you say this, when my first example with a recursive CTE was MUCH faster than the "row by agonising row" approach (often called "slow by slow" in the Oracle world) and slightly faster than the cross join on a subset of sys.all_columns.

It would be interesting to see how recursive CTEs compare (in complexity of code as well as performance) with strategies that use the HierarchyId type.


Emtucifor,

Thanks for your collection of hints - I wonder how many other ways there are of dropping a possibly non-existent table without raising an error. It might be quite entertaining to set a challange on that.

NULLs are a major source of errors in the Oracle world - so it's nice to have the warning that nulls and empty strings are not the same thing: and to check carefully exactly how trailing blanks behave in char/varchar/constant comparisons.

There are a number of interesting details in the examples you've given to generate a large data set ; but I have to say that one of the things I was trying to avoid was building (and sorting) a large result set in memory - and both your examples do this with the "rownumber() over" analytic function.


Anyone else:
I left a note to myself asking about a "fillfactor" for tables: <i>"by default Oracle leaves 10% free space in a block for row lengths to increase on update, does SQL Server do something similar by default? (It's got a fillfactor for indexes)"</i>

If anyone can give a definitive answer I'd love to hear it. (And if the answer is NO, I'd love to know why not.)



 










Phil Factor
Automated Script-generation with Powershell and SMO
 In the first of a series of articles on automating the process of building, modifying and copying SQL Server... Read more...



 View the blog
SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

How to Import Data from HTML pages
 It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Hosted Team Foundation Server 2010 Review
 Team Foundation Server (TFS) has expanded its remit to support the whole software development process,... Read more...

The Parodist: A SQL Server Application
 Every year, we ask Phil Factor to celebrate the holiday season with an article on SQL Server... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk