Most developers are far too busy with code to pay much attention to minor details like testing their applications with realistic volumes of data. Having no effective way to obtain realistic test data, it’s easier to dodge the task than to tackle it properly.
Historically, the common trick has been to use old copies of live data for testing, but there are three key problems with this approach:
- Company policy and privacy legislation can prohibit this inappropriate use of customers’ personal data. It also exposes a risk of mistakenly issuing real emails to customers or accidentally invoking live credit card transactions.
- You’re testing for last year’s data volume instead of next year’s.
- No old data exists if you’re writing a brand new product or feature.
So using real data isn’t a good option, but the obvious alternative of generating data programmatically isn’t that much more attractive. The task of writing data generation scripts manually has always been too time-consuming, and is sometimes difficult – you need to account for foreign key dependencies and figure out how to write T-SQL to create randomised dates or person names. And, of course, manual scripts break immediately whenever your data schema changes.
Nevertheless, you do have to find an effective solution to testing your application with sufficient test data. If an application goes live and is not able to cope with real-world data volumes, then you put the customer’s business at risk, as well as the reputation of the software vendor.
In this article, Steven Sanderson, a web developer at Red Gate Software, explains how he has already made good use of Red Gate’s new SQL Data Generator tool to generate some realistic test data for a “lead nurturing” facility for a sales intranet.
SQL Data Generator is currently in beta. If you feel that the tool could be useful to you in a similar (or even completely different) scenario to that described here, then we encourage you to download the free trial, test it out, and let us know what works for you and what doesn’t!
The Case Study
I was tasked with adding a “lead nurturing” facility to my company’s sales intranet. The goal was to provide a configurable system by which potential customers could be sent a series of automated emails to encourage them to buy.
We have a table of Download records, one row for each time a potential customer downloads a trial version of one of our products.
We process the Download table at regular intervals, creating a new Lead record for each new Download that matches certain criteria. A Lead is assigned (via some application logic) to one of a set of predefined LeadGroups. Each LeadGroup has a series of scheduled, automated emails called LeadActions. These emails are triggered to occur at different time intervals after the point when the lead was created.
Therefore, there are two processes to be tested for scalability:
- The batch-processing of Download records (i.e. conversion to Leads and assignment to LeadGroups)
- The regular task which determines which LeadActions (i.e. emails) are now due to be sent, considering all active Leads
My goal was to set up a one-click process for emptying the Lead table, adding a lot of realistic new Download records (e.g. 5000 new records), and clearing and repopulating the LeadAction table with, let’s say, an average of 8 actions per LeadGroup.
I could then test the batch processing code for scalability, easily adjusting the data volumes as required. Each time I change my application code, I can reset and retest with a fresh new set of data.
Configuring the Tables:
I started by creating a new SQL Data Generator project, and attaching it to my development database, as shown in Figure 2
Next, I located and enabled data generation for the Downloads table by checking the box, as shown in Figure 3:
Now, looking at the preview window shown in Figure 4, you can see that the software had already inferred the correct type of data to go into each column. Realistic email addresses, names, and valid dates:
I could change the type of data generated for each column, choosing from a menu that includes common categories such as personal data (names, phone numbers etc.), geography (cities, countries, zip codes etc.), payment (credit card numbers, currency codes), business (company/product names, industry sectors etc.) and many others (random filenames, website URLs etc). I could also provide custom data patterns, either as rows in a text file, or more simply with a Regular Expression that the data must match (car registration numbers? Easy!). I could also pull in random data samples via SQL queries or CSV files.
In this case, I was satisfied with the automatic choices, except the CreationDates were too widely dispersed. I wanted them all to be in the last year, so I clicked on the CreationDate column and amended the settings, as shown in Figure 5:
Finally, for the Downloads table, I specified the number of rows to be generated:
Next, emptying the Leads table was trivial. After enabling data generation for Leads, in the generation settings for that table, I asked for zero rows, with pre-deletion enabled:
The last table to configure was LeadActions. I enabled data generation for this table, but this time, as you can see in Figure 8, the default data types were less useful:
I obtained more appropriate Name values by configuring a custom generator, in the form of a regular expression, as shown in Figure 9:
It also made more sense to set the OccursAfterMinutes range to be from 0 to 86400 (since in this application, 60 days is the maximum time an email can be sent after the Lead is generated, and 60 days is 86,400 minutes).
The preview window now looked as shown in Figure 11:
Finally, I said I wanted an average of 8 actions per LeadGroup, which I configured as shown in Figure 12:
Generating the Data
With the configuration complete, I was just one button click away from a shiny new set of data:
I Clicked “Generate Data” and, one second later, my database was populated and I could put my application to the test!
If I was feeling ambitious and wanted to test my SQL queries or batch processes against ten million Download records, I could just change the “Number of rows” setting and regenerate.
Of course, one of the most powerful features of the tool, and the one that makes this so much more efficient than generating data via manual scripting, is that whenever I change the target database schema, SQL Data Generator will automatically pick up the changed schema without losing any generation settings that are still usable.
SQL Data Generator and Source Control
SQL Data Generator project files can be stored in source control. This makes it simple to build up my data generation settings over time, adding in new tables as I create them. The project can then be shared with colleagues, via source control, and they can collaborate by adding in their new tables or other changes.
SQL Data Generator provides a far simpler method to produce arbitrary volumes of realistic test data, and this radically changes the whole prospect of scalability testing. Data generation is no longer a tedious task to be evaded with excuses about deadlines. And I feel more confident, knowing that my application will cope with ten times more data than it’s ever likely to encounter!
As noted at the start, SQL Data Generator is in beta. We encourage anyone who feels that this tool might be useful to them to download it, test it out and provide us your feedback on the Red Gate forums.