Click here to monitor SSC
  • Av rating:
  • Total votes: 20
  • Total comments: 4
Steven Sanderson

SQL Data Generator: A Case Study

07 January 2008

Steve Sanderson, a Red Gate developer, wonders why he no longer fliches at the thought of the laborious task of stocking his test databases with lashings of convincing test data. Its because he's managed to get a preview copy of the Beta SQL Data Generator.

Most developers are far too busy with code to pay much attention to minor details like testing their applications with realistic volumes of data. Having no effective way to obtain realistic test data, it's easier to dodge the task than to tackle it properly.

Historically, the common trick has been to use old copies of live data for testing, but there are three key problems with this approach:

  1. Company policy and privacy legislation can prohibit this inappropriate use of customers' personal data. It also exposes a risk of mistakenly issuing real emails to customers or accidentally invoking live credit card transactions.
  2. You're testing for last year's data volume instead of next year's.
  3. No old data exists if you're writing a brand new product or feature.

So using real data isn't a good option, but the obvious alternative of generating data programmatically isn't that much more attractive. The task of writing data generation scripts manually has always been too time-consuming, and is sometimes difficult – you need to account for foreign key dependencies and figure out how to write T-SQL to create randomised dates or person names. And, of course, manual scripts break immediately whenever your data schema changes.

Nevertheless, you do have to find an effective solution to testing your application with sufficient test data. If an application goes live and is not able to cope with real-world data volumes, then you put the customer's business at risk, as well as the reputation of the software vendor.

In this article, Steven Sanderson, a web developer at Red Gate Software, explains how he has already made good use of Red Gate's new SQL Data Generator tool to generate some realistic test data for a "lead nurturing" facility for a sales intranet.

SQL Data Generator is currently in beta. If you feel that the tool could be useful to you in a similar (or even completely different) scenario to that described here, then we encourage you to download the free trial, test it out, and let us know what works for you and what doesn't!

The Case Study

I was tasked with adding a "lead nurturing" facility to my company's sales intranet. The goal was to provide a configurable system by which potential customers could be sent a series of automated emails to encourage them to buy.

We have a table of Download records, one row for each time a potential customer downloads a trial version of one of our products.

Figure 1

We process the Download table at regular intervals, creating a new Lead record for each new Download that matches certain criteria. A Lead is assigned (via some application logic) to one of a set of predefined LeadGroups. Each LeadGroup has a series of scheduled, automated emails called LeadActions. These emails are triggered to occur at different time intervals after the point when the lead was created.

Therefore, there are two processes to be tested for scalability:

  • The batch-processing of Download records (i.e. conversion to Leads and assignment to LeadGroups)
  • The regular task which determines which LeadActions (i.e. emails) are now due to be sent, considering all active Leads

Scalability Testing

My goal was to set up a one-click process for emptying the Lead table, adding a lot of realistic new Download records (e.g. 5000 new records), and clearing and repopulating the LeadAction table with, let's say, an average of 8 actions per LeadGroup.

I could then test the batch processing code for scalability, easily adjusting the data volumes as required. Each time I change my application code, I can reset and retest with a fresh new set of data.

Configuring the Tables:

I started by creating a new SQL Data Generator project, and attaching it to my development database, as shown in Figure 2

 

Figure 2

Next, I located and enabled data generation for the Downloads table by checking the box, as shown in Figure 3:

Figure 3

Now, looking at the preview window shown in Figure 4, you can see that the software had already inferred the correct type of data to go into each column. Realistic email addresses, names, and valid dates:

Figure 4

I could change the type of data generated for each column, choosing from a menu that includes common categories such as personal data (names, phone numbers etc.), geography (cities, countries, zip codes etc.), payment (credit card numbers, currency codes), business (company/product names, industry sectors etc.) and many others (random filenames, website URLs etc). I could also provide custom data patterns, either as rows in a text file, or more simply with a Regular Expression that the data must match (car registration numbers? Easy!). I could also pull in random data samples via SQL queries or CSV files.

In this case, I was satisfied with the automatic choices, except the CreationDates were too widely dispersed. I wanted them all to be in the last year, so I clicked on the CreationDate column and amended the settings, as shown in Figure 5:

Figure 5

Finally, for the Downloads table, I specified the number of rows to be generated:

Figure 6

Next, emptying the Leads table was trivial. After enabling data generation for Leads, in the generation settings for that table, I asked for zero rows, with pre-deletion enabled:

Figure 7

The last table to configure was LeadActions. I enabled data generation for this table, but this time, as you can see in Figure 8, the default data types were less useful:

Figure 8

I obtained more appropriate Name values by configuring a custom generator, in the form of a regular expression, as shown in Figure 9:

Figure 9

It also made more sense to set the OccursAfterMinutes range to be from 0 to 86400 (since in this application, 60 days is the maximum time an email can be sent after the Lead is generated, and 60 days is 86,400 minutes).

Figure 10

The preview window now looked as shown in Figure 11:

Figure 11

Finally, I said I wanted an average of 8 actions per LeadGroup, which I configured as shown in Figure 12:

Figure 12

Generating the Data

With the configuration complete, I was just one button click away from a shiny new set of data:

Figure 13

I Clicked "Generate Data" and, one second later, my database was populated and I could put my application to the test!

If I was feeling ambitious and wanted to test my SQL queries or batch processes against ten million Download records, I could just change the "Number of rows" setting and regenerate.

Of course, one of the most powerful features of the tool, and the one that makes this so much more efficient than generating data via manual scripting, is that whenever I change the target database schema, SQL Data Generator will automatically pick up the changed schema without losing any generation settings that are still usable.

SQL Data Generator and Source Control

SQL Data Generator project files can be stored in source control. This makes it simple to build up my data generation settings over time, adding in new tables as I create them. The project can then be shared with colleagues, via source control, and they can collaborate by adding in their new tables or other changes.

Conclusion

SQL Data Generator provides a far simpler method to produce arbitrary volumes of realistic test data, and this radically changes the whole prospect of scalability testing. Data generation is no longer a tedious task to be evaded with excuses about deadlines. And I feel more confident, knowing that my application will cope with ten times more data than it's ever likely to encounter!

As noted at the start, SQL Data Generator is in beta. We encourage anyone who feels that this tool might be useful to them to download it, test it out and provide us your feedback on the Red Gate forums.

Steven Sanderson

Author profile:

Steven is a web developer at Red Gate Software.

Search for other articles by Steven Sanderson

Rate this article:   Avg rating: from a total of 20 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Interesting
Posted by: SQLWayne (view profile)
Posted on: Thursday, January 10, 2008 at 11:59 AM
Message: I recently learned of a site, http://www.fakenamegenerator.com/, that provides name and address information similar to SQL Data Generator. Obviously SDG is a lot more flexible since you can specify columns and ranges.

I'm curious how the credit card numbers are generated. In the case of the web site, they use a public PHP algorithm that generates a CC number with two characteristics: it passes the credit card checksum, and it is based on a non-used bank prefix, so if it does go through your system, it'll be accepted past the CRC, but will bounce since it's a dead account series. I'm curious if SDG uses a similar algorithm.

Also, are the sample name data stored in a table? It would be nice to be able to influence that data. Where I'm at, we have a huge Hispanic population, so it'd be nice to get a list of Hispanic names from whatever source and upload them into SDG in order to provide what would seem to be a better mix of names for location-specific feel.

I will download and play with it, sounds like it'll be a lot of fun.

Subject: Command Line Execution
Posted by: Dru (not signed in)
Posted on: Saturday, March 01, 2008 at 6:10 AM
Message: Can I execute this program from the command line so it can play nice with my automated testing?

Subject: Question
Posted by: ichr@mm (not signed in)
Posted on: Thursday, June 12, 2008 at 11:11 AM
Message: Hi, i am making new generator using c# 2.0 and i can not find a way to get the "generated" data. I mean, Datetime generator has a property that is the "offset", and that takes the value from another column and makes his job. I want to do that to make a column to be de difference between 2 others columns

Thanks and good work

Subject: Controlling the distribution of generated data
Posted by: ranjeet (view profile)
Posted on: Tuesday, August 16, 2011 at 11:42 PM
Message: HI
Is there any means by which we can control the distribution of generated test data.

i.e I have a 'VISIT' table with a column name 'VISIT TYPE' [with any one of three types of input data 'Inpatient','Outpatient' or 'Emergency'], so i would like to generate the data in such a manner that 60% of the generated data should be 'Inpatient', 20% should be 'Outpatient' and 20% should be 'Emergency'.[If we are generating 100 rows for visit table then 'VisitType' column should contain 60-Inpatient , 20-Outpatient,20-Emergency data rows]

Regards,
Ranjeet

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.