Robert Chipperfield

Random names and the Birthday Paradox

Published Friday, September 07, 2007 10:08 AM

I'm currently in the middle of giving Red Gate's sample "Widgets" database a bit of a face lift, in preparation for the new version of SQL Data Compare 6 that's coming soon. As part of this, I wanted to populate a pretty generic "Contacts" table with some sample names.

So, off I go to Mr Google, and find some lists of the Top {n} Baby Names, Top {n} Surnames and so on, and make myself a spreadsheet with both of those lists. Then combine that with a few RAND() calls and VLOOKUP()s, and I have my list of names. All good.

However, we have testers here, and they're rather good. Out of my 60 odd names that I generated, there were quite a few duplicates - not just duplicated surnames, but duplicated firstname AND surname combinations. This surprised me quite a bit - I had about 50 surnames, and 25 first names, which I figured should give me 1,250 possible combinations. So why so many duplicates in a set of only 60 results?

Being rather early on a Friday morning, I opted for the brute force method of attacking the problem, and just added some more names - now I've got 100 surnames and 70 first names. But still, the duplicates were there. At this point I added a COUNTIF() function to the end of each result row so I could see the scale of the problem at a glance, and it was quite scary.

Upping the number of generated names to 199, I got the following distribution of appearances:
  • Once: 26
  • Twice: 66
  • Three times: 63
  • Four times: 24
  • Five times: 20
To put it another way, I only had 90 unique results out the 199 I'd generated. Ouch.

I've been well and truly bitten by the Birthday Paradox here (the thing that states that if you have more than 23 people in a room, the chances of two of them sharing the same birthday are more than 50%). A quick back-of-the envelope calculation suggests that with the size of my data sets, I'll reach the 50% probability point after about 25 results.

So, what to do?

If I was coding this in C#, my first instinct would be to generate a combination, check if it was already in the set of results, and throw it out if it was, but the results I've got above should demonstrate that this is probably a lot less efficient than you might think.

A more interesting option is this: you have n possible combinations, and want m results. For result i, you generate a random number between (n/m)*i and (n/m)*(i+1). This guarantees you won't get duplicates, and still gives you a pretty good random distribution, albeit not perfect.You could probably improve this with more complex distributions if you want truly random data, but that's beyond my knowledge of statistics.

Comments

 

Steve S said:

This is really a problem about finding permutations of a set (since we can forget about the firstname/lastname issue and just consider the n permutations as you mention).

There's a nice algorithm at http://en.wikipedia.org/wiki/Permutation#Algorithm_to_generate_permutations. Since you will only be inspecting the first m results, you can terminate the algorithm early (when j = m).
September 7, 2007 7:28 AM
 

Steve S said:

Oh, sorry, I mean n! permutations.
September 7, 2007 7:29 AM
 

RobertChipperfield said:

If I understand that code correctly, it degenerates to firstname = i / n and surname = i mod n in the situation where you only have two things you're generating permutations of.

The problem isn't finding all the permutations, it's about randomly selecting them in a way that guarantees no duplicates. If I just selected the first 10 names, I'd end up with Joe Bloggs, Fred Bloggs, Chris Bloggs, and so on, if you catch my drift?

Of course, you could first generate a 1:1 mapping to randomise the order of all the permutations, then select the first n items from that randomised list. But then I think you'd hit the same problem as I did - that generating the mapping has the same problem of de-duplicating it, unless you use a pseudo-random generator that guarantees not to repeat anything until you've got a certain number of values out. These do exist.
September 7, 2007 10:01 AM
 

Steve S said:

Sorry - I didn't do a good job of explaining myself.

Let's start by defining a mapping from integers in the range 1 to f*s (where f & s are the number of first and surnames you have) to the set of combinations of firstnames and surnames. This is trivial - for item i, combine the (i/f)th firstname with the (i % f)th surname.

The problem is now reduced to choosing m of those integers in the range 1 to f*s, without repeating the same choice. This is equivalent to picking a random permutation of the integers 1...(f*s) and then mapping the first m of them back to the combination of names they represent. There aren't going to be any duplicates, because that's the definition of a permutation.

The algorithm from Wikipedia does this pretty quickly, though it does require O(f*s) memory. It gives you "uniformly" random permutations, based on you choosing a uniformly random value for k.
September 7, 2007 11:03 AM
 

RobertChipperfield said:

Ahh... got you. Sorry, I was being a bit dense there. Yes, that'd do it entirely, and be more satisfactory than my solution in the general case.

Thanks for the explanation! :-)
September 7, 2007 11:06 AM
 

drsql said:

I would use a relational database to generate this kind of data any you can eliminate probability from the equation.  (of course, I am a data architect that is probably not surprising.  I would use a relational database to solve most any programming problem.)

For example, in SQL  Server, just do this:

create table firstName
(
   name    varchar(30) primary key
)
create table lastName
(
   name    varchar(30) primary key
)
go
insert into firstName
select 'Barney'
union all
select 'Fred'
union all
select 'Wilma'
union all
select 'Betty'
go
insert into lastName
select 'Flintstone'
union all
select 'Rubble'
go

select top 4 firstName.name, lastName.name
from   firstName
         cross  join lastName
order by  newid()

You will always get 4 unique names (since you have two sets of unique values that you are working with.)  

And if you need to add more names:

select top 4 firstName.name, lastName.name
from   firstName
         cross  join lastName
where not exists (select *
                         from   destination
                         where destination.firstName = firstName.name
                             and destination.lastName = lastName.name)
order by  newid()
September 12, 2007 1:59 PM
 

RobertChipperfield said:

Hi drsql,

Thanks for the comment - I suppose given the final destination was a database anyway, it would've made sense to use T-SQL, but alas mine isn't that great!

I like the idea of the "order by newid()" - a neat way of randomising the output :-).

Cheers,
Rob
September 12, 2007 2:54 PM
 

KevinHHood said:

When I have had to select a random sub-set of an easily creatable set of items, I have used the following algorithm:

(1) Create an array of all n items.  Lets call the array Item[] and assume that it's zero-based.  (2)  Use a random number generator to randomly pick one of the n items with index between - say the item at index 'i' where 0 <= i <= n-1.  (3) Take Item[n-1] and copy it to Item[i] and consider the array to now have n-1 items, all of which have not been picked.  (4) Repeat step 2 generating a random number between n and n-2 this time (one of the remaining n-1 items).  (5) Copy Item[n-2] to the selected array element.  (6...) Repeat until you have pulled m elements from the array at which point the array's lower n-m elements will contain all the unselected items.
September 26, 2007 11:06 AM
 

KevinHHood said:

I didn't proof that very well.

In step (2) the words 'with index between' should have been deleted and in step (4) it should read 'between 0 and n-2'.
September 26, 2007 11:17 AM
 

eliteab » Blog Archive » Random names and the Birthday Paradox said:

October 24, 2007 12:02 AM
 

http://www.simple-talk.com/community/blogs/robertchipperfield/archive/2007/09/07/36985.aspx said:

March 25, 2008 3:24 AM
You need to sign in to comment on this blog

About RobertChipperfield

I'm a software engineer at Red Gate, where I've worked since September 2006. I've worked on a wide range of products, including SQL Doc, SQL Data Compare, SQL Log Rescue, SQL Multi Script, and ANTS Profiler.

I'm currently working on the new Exchange Server Archiver tool, which should be heading towards release at the start of 2009.

Outside of work, I enjoy amateur radio, electronics, and of course the usual assortment of computer-related technologies, from hardware all the way through to high-level software.


















<September 2007>
SuMoTuWeThFrSa
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456
Go With the Flow
 Knowing enough about the routes that messages take is vital to being an effective Exchange admin,... Read more...

When Email Collaboration Could Have Changed History
 In our mission to make history relevant to the busy IT executive, we speculate how Email might have... Read more...

Bunnikins!
 When an IT manager is selected as a victim of office politics of a large corporate, it is time for him... Read more...

Exchange Database Technologies
 One of the most misunderstood technologies in Exchange Server, regardless of its version, is the... Read more...

Top Tips for Exchange Admins
 Michael Francis hands out imaginary Olympic medals to the winner of the August 'Top Tips for Exchange... Read more...