Av rating:
Total votes: 16
Total comments: 19


Arthur Fuller
The Database From Hell
19 February 2008

The Database From Hell

In which Arthur Fuller comes to the aid of a friend who is doing his best to administer The Database From Hell, and manages to set up a reasonable test database with which to solve some of the headaches that only a 65 million row, 600 column, table can cause.

This article is about two subjects: a database administered by a friend and colleague that we affectionately call the Database From Hell, and Red Gate’s forthcoming SQL Data Generator, currently in beta release. Before we get to talking about The Database From Hell, a brief word about SQL Data Generator and the database:

Data Generator, as its name implies, is all about manufacturing data to test performance, scalability and validity. Beta release 1 already boasts some wonderful features, and more are promised for the next round.

The 65 million row, 600 column, table

The Database From Hell is a marketing database whose real-world implementation comprises a single table with 600+ columns and 65 million rows. Its data derives from dozens of questionnaires. This single table stores respondent information (name, address, zip code, etc.) along with the answers to various questionnaires. Most of the 600+ columns are char(1) or char(2); some are dates recorded as character. The original data comes from a mainframe, gathered from dozens of sources.

Typically, a client wants a list of people who match some combination of attributes. For example, the client might want to launch a campaign whose target is the group of black or Hispanic Americans who are married, own a BMW, have purchased a n HDTV, have more than two children, and make more than $90,000 per year. The variations are almost endless. All these columns contain codes. All are nullable.

My own version differs somewhat from the real-world database. For one thing, I broke the single table into three, called Respondents, Questionnaires and RespondentAnswers. The latter has foreign keys pointing to the former two tables. I also cut down the number of columns, and changed all the char(1) columns to bit columns, primarily to save disk space.

Even in my smaller version, indexing poses a problem. My friend’s approach has been to create a covering index for each new query posed by his client. (A covering index includes all the columns of interest, plus the primary key.) This potentially raises the problem that a very detailed query might use too many columns to be indexed. In addition, every query demands the time taken to create the covering index.

The Test version with only 20 million rows

I don’t have sufficient disk space to create the table on my development machine, so I created a smaller version, with only 40-something columns and only 20 million rows. But I had an idea for a unique approach to this problem, and I wanted proof of concept.

Data Generator to the rescue

Data Generator to the rescue! It can build the required rows very quickly, and then you’re in a position to answer some important questions.

I should point out that Data Generator is actually a collection of data generators, each tailored toward some specific kind of data. For each column in your table(s), you assign a generator. As shipped, the product includes the following generators, each grouped according to their domain and purpose:

Group

Name

Description / Examples

Personal

 

 

 

Color

AliceBlue, AntiqueWhite

 

Contact Title

Marketing Manager, Owner, Partner

 

Email

Generated email address

 

Email (internet-safe)

A1.example.com

 

Extension

1234

 

First Name

Arthur. Alice, Brian, Betty

 

First Name (female)

Ava, Bette, Charlene

 

First Name (male)

Allan, Boris, Charles

 

Initials

F. ,B., HR

 

Last Name

Smith, Browne, Clarke, Davis

 

Name (full)

Leonard di Capreo

 

Nick Name

Chris, Tommy, JLo

 

Phone Number

(310) 555-1212, 212-4365

 

Social Security Number

950-95-8060

 

Title

Mr., Mrs., Ms., Dr.

Geographical

 

 

 

Address Line (Street #)

123 W. Second Ave.

 

Address Line 1

Suite 208, Seventh Floor, Bldg. A.

 

Country

Long name of country (Canada)

 

Country Code

3-letter ISO code (USA)

 

Country Code Short

2-letter ISO code (US)

 

P.O. Box

P.O. Box 139, PO Box 140

 

Region

IK-MU, NC-JY

 

US City

New York, Los Angeles, Chicago

 

US State (long name)

California, New York

 

US State code

CA, NY, KT

 

ZIP Code

83243, 96112

 

ZIP+4 Code

96212-1234

Generic

 

 

 

File List

Generates list from a given file

 

Regex Generator

Used for formatted data

 

SQLColumn

Reads data from a SELECT statement

 

String Generator

Generates strings

Shopping

 

 

 

Product Name

Deluxe Frommelgrammet

Payment

 

 

 

Credit Card Date

09/08

 

Credit Card Issue Number

3, 1, 5, 7, 6

 

Credit Card Number

Generated 16-digit number

 

Currency Code

USD, CAN, EUR

Business

 

 

 

5-digit IDs

67815, 98765

 

Account Number

RET36662

 

Company Name

Sipzapon Corporation, ABC Inc.

 

Department

Sales, Service, Finance

 

Folder (DOS)

C:\Program Files\Red Gate

 

Folder (Web)

http://pen.net/index.html

 

Sector

Aerospace, Horticulture, Transport

 

Source of Leads

Sales, Word of Mouth, Referral

 

URL

http:www.artfulsoftware.com/default.aspx

 

Despite its variety, if you’re anything like me, you’ll immediately spot holes in the list. For example, Canadian Social Insurance Numbers are formatted differently than US Social Security Numbers. Fortunately, you can create your own custom generators and add them to the list. For more information, see David Connell’s article and C# project (how to write a generator for sql data generator 1.0 beta).

Upon loading the program, the first thing you do is establish your connection to a server and a database. Then Data Generator creates a new Project called Project1. (In my experience, this leads to unnecessary keystrokes. If you mean to use an existing project, you must select File | Open, then respond to the dialog that asks if you want to save the new project, then you can open your old project. I would prefer the program to run without opening a default project, leaving me to create a new one or select from the list of recently used projects.)

With this information, Data Generator reads the database and presents a list of its tables. You select a table and specify table-level settings, notably the data source (generate the data, or use an existing data source such as a table or query) and the number of rows to generate.

Next, you expand the table’s column list, and specify various attributes to apply to the selected column. The attributes you see depend on the nature of the column. You can choose among the data generators relevant to the column type. Depending on how thoroughly you specified the column values and table relationships in your database, some generation-choices may be pre-defined (as in the case of identity keys and also foreign keys). In the case of nullable columns, you may also specify the percentage of nulls you want in the generated data.

When you have specified all the column properties as thoroughly as you wish, you save the project. Next, you can specify project settings, including scripts to run before and after data generation. In the case of the Database From Hell, I couldn’t figure out any way to populate a particular column with either a trigger or a calculated column formula, so I added a post-generation script that called a user-defined function and updated the column.

Wrestling with the Test Database From Hell

My version of the database from Hell was not identical to the real thing. For design reasons, I made some changes. I also used only 50 “answer” columns, due primarily to lack of disk space There are three tables in my version:

  • Respondents: a list of people, 10 million rows.
  • Questionnaires: a list of questionnaires, 1000 rows.
  • RespondentAnswers: one row per respondent-questionnaire combination, containing foreign keys to the other two tables and all the bit and char columns; 20 million rows.

On my hardware, generating this many rows took half an hour, including the time it took to run the post-generation script to update the “special” column. The manufactured data was flawless. All my constraints were respected. For variety’s sake, I supplied a different percentage of nulls for all the bit and char(1) columns.

We could take several approaches in querying this table:

Build up a WHERE clause that mentions the values of interest in any combination of the columns.

Create a covering index just before running each new query, including in its key the columns of interest. There’s a problem, here, of course – we can’t index a bit column. But we could translate them into char(1) columns whose possible values are Y, N and null. This lets us an index that covers the RespondentID and all the columns of interest, but does not escape the need to recreate the index for each new query. But at least the index would quickly isolate the rows of interest.

After trying the first two approaches, I remained dissatisfied with the results, so I decided to try something else, and came up with the idea of the “special” column, an amalgam of the values in the bit and char(1) and char(2) columns. To achieve this, I created a function that examines a bit column and returns one of three character values – Y, N or space. Actually, the UDF I created accepts 40 parameters, all the bit columns in the table. It then examines each value and assembles a 40-character string consisting of some combination of Ys and Ns and space characters. The “special” column is then updated with this result string.

There are three important gains in this approach. First, only one index needs to be created, no matter what the query specification is. Second, it nicely avoids the maximum number of columns that can be included in an index. Finally, all the required values are already present in the index, which eliminates the need to perform a table scan. How? By using the LIKE operator.

An example query:

SELECT  RespondentAnswerID,

        RespondentID,

        QuestionnaireID,

        Answer0

FROM    dbo.RespondentAnswers

WHERE   Answer0 LIKE 'Y_NN_?N%'

-- no rows returned; 13 seconds

And another:

SELECT  RespondentAnswerID,

        RespondentID,

        QuestionnaireID,

        Answer0

FROM    dbo.RespondentAnswers

WHERE   Answer0 LIKE 'NNN NYYY%'

-- 3913 rows returned in 8 seconds

For brevity, these examples address only the first several “columns”, but the idea holds for the entire actual column.

In terms of performance, it’s not entirely surprising that the fastest queries specify values of interest for the starting positions within the column. The worst-case queries would not care about the values contained in the beginning columns, and interrogate values near the end of the column. However, even these queries do not need to visit the actual table.

Listing 1 shows the function that I wrote to populate the “special” column.

Data Generator offers the ability to run a pre- and post-generation script. I ran the following update statement afterwards as a post-generation script.

UPDATE  dbo.RespondentAnswers

SET    

Answer0 = dbo.BitToChar_ufn(

                   Answer1, Answer2, Answer3, Answer4, Answer5

                  ,Answer6, Answer7, Answer8, Answer9, Answer10

                  ,Answer11, Answer12, Answer13, Answer14, Answer15

                  ,Answer16, Answer17, Answer18, Answer19, Answer20

                  ,Answer21, Answer22, Answer23, Answer24, Answer25

                  ,Answer26, Answer27, Answer28, Answer29, Answer30

                  ,Answer31, Answer32, Answer33, Answer34, Answer35

                  ,Answer36, Answer37, Answer38, Answer39, Answer40

                  )

It took slightly over an hour and twelve minutes to populate the column with 20 million values, but as mentioned, I only needed to do this once. Thereafter any of my queries on any combination of the columns could exploit the single index.

The index is a compound index, using the Answer0 column plus the FK pointing to the Respondents table. Thus there is never a need to visit the actual table: all the information we need resides in the index.

Not surprisingly, retrieval times varied with the distance from the left end of the columns of interest. Conversely, values toward the left end of the string resulted in much quicker retrievals. Even the slowest queries, however, were substantially faster than a full table scan.

so?

Obviously, not every database can take advantage of the approach outlined here, but this sort of requirement comes up quite often in databases geared toward marketing applications. If you have such a database, you may find this technique a significant performance booster.

The illustration for this article was derived from some of the illustrations from Eric Liberge's excellent 'Tonnerre Rampant' from Soleil and are used with permission. Many thanks to Eric for his generosity

 

 

 



This article has been viewed 4824 times.
Arthur Fuller

Author profile: Arthur Fuller

Arthur Fuller has been developing database applications for more than 20 years. He frequently works with Access ADPs, Microsoft SQL 2000 and 2005, MySQL, and .NET.

Search for other articles by Arthur Fuller

Rate this article:   Avg rating: from a total of 16 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: a couple more ideas
Posted by: Matt H (view profile)
Posted on: Thursday, February 21, 2008 at 8:43 AM
Message: Interesting article! Got me thinking about a couple more things you could try.

1. To query the questions toward the end more quickly, you could also create a column that reverses the string you created. Then, reverse the string in the where clause, and it's as fast as searching questions at the beginning.

2. Even with the above suggestion, it's only efficient to query the beginning or the end. What about creating another string, this one much longer, of the form '1Y 2N 3N 4_ 5Y 6Y'? Then, use full-text indexing and treat each response you're interested in as a search term. I haven't tested the performance of this approach myself, but my suspicion is that it could come close to the covering index approach, without the headaches of creating a new index for each query.

Subject: it is possible to index a bit column
Posted by: Rohit (not signed in)
Posted on: Tuesday, February 26, 2008 at 6:59 AM
Message: it is possible to index a bit column though its pointless to do so unless your data is heavily skewed in favour of one of the values

check this out --
http://msmvps.com/blogs/greglow/archive/2004/12/20/26432.aspx

Subject: Building on Matt's idea
Posted by: Arun Philip (not signed in)
Posted on: Wednesday, February 27, 2008 at 8:22 AM
Message: Building on Matt's idea, I was thinking of using computed columns that start from midway through the text.

So, if the answer column has data 'ABCDEFG...', we could have computed column 1 as CDEF... and cc 2 as 'EFG...'

But yes, I like the whole idea of combining data in this manner (data purists may not like it though!)

Subject: Lots of columns
Posted by: Anthony (Abev) (not signed in)
Posted on: Wednesday, February 27, 2008 at 9:12 AM
Message: I get scared if I have a table > 20 columns :) But 600!? I have a lot to learn...

Subject: No!
Posted by: Flavio (not signed in)
Posted on: Wednesday, February 27, 2008 at 11:01 AM
Message: Anthony, you don't have a lot to learn!! This is completly bad practices... this table should be vertically partitioned. probably by demographics.All forms have sections and each section could be represented as a different table in the database. If someone really needed the "all-together-crepy-table" just create a view :)

Subject: bitwise
Posted by: Roel (not signed in)
Posted on: Thursday, February 28, 2008 at 4:05 AM
Message: Why not convert the answer to an (big)int column, representing the bitvalue, eg '0110' = 5

querying can now be done like :
select 1
where (answer & 4) = 4

I am curious what that would do with performance!

Subject: about bitwise
Posted by: Anonymous (not signed in)
Posted on: Thursday, February 28, 2008 at 4:18 AM
Message: bigint supports up to 64 bits-answers - so, if you have more answers you should use several columns to keep the date
also, to preserve the logic you need 2 operation - to check whether needed bits are set and to check whether unneeded bits are unset, eg (answer&a) = a and (answer&b)=0
but by the way - bit representation simplifiers task "if any of answer is true/false" - (answer&4) <> 0 - for any answer true and (answer&4) <> 4 - for any answer is false.

Subject: about bitwise
Posted by: Anonymous (not signed in)
Posted on: Thursday, February 28, 2008 at 4:26 AM
Message: bigint supports up to 64 bits-answers - so, if you have more answers you should use several columns to keep the date
also, to preserve the logic you need 2 operation - to check whether needed bits are set and to check whether unneeded bits are unset, eg (answer&a) = a and (answer&b)=0
but by the way - bit representation simplifiers task "if any of answer is true/false" - (answer&4) <> 0 - for any answer true and (answer&4) <> 4 - for any answer is false.

Subject: about bitwise
Posted by: Roel (not signed in)
Posted on: Friday, February 29, 2008 at 4:33 AM
Message: It was just a quick thought. Indeed max. 64 bit.

I am curious about a reply from Arthur, his comment on trying to store/query bitwise.

Subject: re: about bitwise
Posted by: Andrew Clarke (view profile)
Posted on: Saturday, March 01, 2008 at 5:40 AM
Message: Roel,
Arthur is struggling with his login to Simple-talk, which we are trying to fix. He is very much wanting to reply though! It may have to be 'anonymous' though!
Andrew

Subject: MDDB?
Posted by: Tad Richard (not signed in)
Posted on: Saturday, March 01, 2008 at 8:37 AM
Message: Am I missing something?

Isn't this "Database From Hell" a poster child for using a Multidimensional Database (i.e. using SQL Server Analysis Servises)?

Why keep it in a standard relational model?

Subject: Bitwise approach
Posted by: Anonymous (not signed in)
Posted on: Saturday, March 01, 2008 at 6:31 PM
Message: This is from Arthur, the author of said piece. First off, apologies for my tardiness. This has been an insane week, first with the launch of Server 2008, VS.Net 2008 and the February CTP of SQL 2008; all this followed by CodeCamp. I've been a busy beaver.

I will definitely investigate the bitwise approach. In the real-world example, I can see that at best it's only going to be a partial solution. My smaller example addressed only bit columns, but the real db contains numerous char(1) and char(2) columns as well; they could be aggregated along the lines in the article, and all the bit columns deal with as you suggest. I will definitely check this out and report on results. This will be even more fun now that I finally have a 64-bit OS and 64-bit SQL 2008 to play with!

Arthur

Subject: Indexing a bit column
Posted by: Anonymous (not signed in)
Posted on: Saturday, March 01, 2008 at 6:34 PM
Message: As rohit points out, it is possible to index a bit column. I actually posted a correction to the article but too late, and the sentence went in uncorrected. What I intended to say was exactly as rohit phrased it: while it's possible to index a bit column, unless its values are widely skewed in one direction or the other, it's pointless.

Arthur

Subject: Bitwise approach
Posted by: Anonymous (not signed in)
Posted on: Sunday, March 02, 2008 at 1:28 PM
Message: Upon thinking this over, I cannot see how it escapes the core problem that the bit columns are nullable, and need to be. Consider a question such as "Do you own a dual-layer DVD burner?" and suppose the respondent does own a burner but doesn't know whether it can handle dual-layer media. She can't answer Yes or No; she skips the question. The null value in said column is meaningful -- it is neither yes nor no. This poses a significant problem to the bitwise approach; as far as I can see, it's a show-stopper. It does raise the question of whether a base 3 system might work. I'm giving that one some thought but so far haven't an answer.

Arthur

Subject: To Anthony and others who questioned 600 columns
Posted by: Anonymous (not signed in)
Posted on: Sunday, March 02, 2008 at 1:38 PM
Message: No, Anthony, the only thing to learn here is that sometimes you don't get to design the database -- you simply inherit it and must make do.

As I indicated in the opening, I didn't create this database, and in fact neither did my friend and colleague. Rather, he inherits a dump from a mainframe once a month and installs that, overwriting last month's version. It arrives as a single flat file. In my version I broke out the respondent data and the questionnaire data, leaving the core table with only an FK to both of these tables.

For what it's worth, I also looked at changing the model more dramatically, and creating a row for every not-null bit-column per respondent. It's extremely rare that any given respondent would fill in even 30% of the columns, let alone half or more, so the number of rows in this table wouldn't get inordinately huge. But I soon gave up this avenue of inquiry, since it inescapably leads to HAVING clauses.

Arthur

Subject: Vertical DB instead?
Posted by: Joe Celko (not signed in)
Posted on: Monday, March 03, 2008 at 8:14 AM
Message: Look at Vertica, Sand, WX2 and Sybase offering for columnar SQL products, They are great for sparse data like,

Subject: Bitwise approach
Posted by: Roel (not signed in)
Posted on: Monday, March 17, 2008 at 4:22 PM
Message: Hi Arthur,

I am still curious if the bitwise approach did help out... hope i'll see a post someday. I'll keep on checking this article every now and than.

Regards,

roel

Subject: spell check the table
Posted by: Anonymous (not signed in)
Posted on: Wednesday, May 28, 2008 at 1:06 PM
Message: build a word of bits for each row by checking the "finite" values (lt 5) then for each word do a hash coding to a number and turn on that bit in a long string. The "pre" query can create the hash-code and answer in the negative very quickly

Subject: very good
Posted by: satishallamsetty (view profile)
Posted on: Wednesday, August 27, 2008 at 2:27 AM
Message: message undersating is deficulty, so u can give more eamples