Data Is Crazier than You Think

As a society, we have an unrealistic respect for data, especially if it has a decimal point somewhere and uses metric units. We who are in the business of data need to cultivate a renewed interest in the sceptical and rigorous science of statistics: it is too important to leave to 'Data Scientists'. If the data is wrong, or the way we analyse or report it is misleading, much of what we do is pointless

We deal with databases in this trade, but so many of us do not entirely understand data analysis. I guess this why we recently invented the term “Data Science” and attached a large paycheck to it. No longer do we need to be so aware of the details of data before getting started in IT.  The pioneers in Information Technology had to have a good grounding in math, elementary statistics, and the way that data was encoded. Nowadays, it is less central.

There is a Dilbert cartoon of the Pointy-Haired Boss called ‘The Wrong Data’

‘Use the CRS database to size the market’That data is wrong.’Then use the DIBS database’That data is also wrong. ‘Can you average them?’Sure. I can also multiply them, too.

When I ask the kids what they know about floating point math and rounding error I get the strange impression of  talking to a display case in a fish market; open mouths, dead glazed eyes staring back from cold dead meat. When we did statistical work with FORTRAN, you had to know this level of detail in the tools. Today, much of this is hidden in the tools, but the software cannot entirely protect you from needing to know the actual nature of data: You can still fail spectacularly.

Thinking about Data

Tee shirt slogan: “On a scale from 1 to 10, what color is your favorite letter of the alphabet?” I will bet you started to answer that question! That is the joke. But if you actually did answer the question, then you either need some help or you are an absurdist stand-up comedian. People do not understand scales and measurement and I have a whole book on this (Joe Celko’s Data, Measurements and Standards in SQL; 2009; Morgan Kaufmann; ISBN: 978-0-12-374722-8).

One of the barriers to understanding math is the way that the media increasingly use Pseudo-mathematics to reinforce belief and opinion: it sets a bad example. I’m 110% certain of this.  If you like reading math books, get a copy of “Street-Fighting Mathematics” by Sanjoy Mahajan (2010; ISBN: 978-0262514293). The first chapter is on dimensions and it starts with false comparisons between the net worth of Exxon ($119 Billion after 125 years) and the GDP of Nigeria ($ 99 Billion per year). There is no way to legitimately compare these measurements, but people do! Perception is not math.

I have Mother Celko’s Law of Decimal Places: “A statistic or measurement impresses the reader to the square of the decimal places.” This a means that telling your non-techie manager that the average weight of a floobsnizzle is 23 kilograms is not as impressive as telling him the average weight of a floobsnizzle is 23.12 kilograms.

Then I have Mother Celko’s Law of Units. When I tell my non-techie manager that the average weight of a floobsnizzle is ~50 pounds, it does not look as scientific as ~23 kilograms. To an American, especially, Metric units are scientific while US customary units are for the grocery store. This is more than Americans not knowing the SI system; they do not even know that US customary units are not Imperial units! It is a mindset that says “European stuff is cool” that they got from fashion magazines as much as technical books.

Central Tendency Measures

Given a set of data and one attribute in it, what is the best way to summarize the attribute as a single value? This is usually called a measure of central tendency. Informally, you do this all the time when you declare “Those kids are fat!” in the everyday world. The set of kids can be vaguely defined (kids in <insert name of country here>), and the term “fat” is certainly fuzzy. A fat ballet dancer is probably different from a thin Sumo Wrestler.

When you put a number to this, however, it looks better. If I tell you most statistics are fake, it is not as convincing as telling you 79% of all statistics are invented on the spot. But what number to use and how do we get it?

  • Average is the easiest one for SQL programmers because it is built into the language. The other two common ones are the median and the mode, but they are not built in. You can Google SQL for them; they are not hard.
  • The mode is the most frequently occurring value in a set. If there are two such values in a set, statisticians call it a bimodal distribution; three such values make it trimodal, and so forth. Most SQL implementations do not have a mode function, since it is easy to calculate. But if the data is multi-modal, then there probably is no central tendency at all. Think about a third world country that is effectively without a middle class; you are either a starving peasant or a member of the rich royal family.
  • The median is defined as the value for which there are just as many cases with a value below it as above it. If such a value exists in the data set, this value is called the statistical median by some authors. If no such value exists in the data set, the usual method is to divide the data set into two halves of equal size such that all values in one half are lower than any value in the other half. The median is then the average of the highest value in the lower half and the lowest value in the upper half, and is called the financial median by some authors. Then we have the weighted median based on subsets around the middle; it is easier to explain with a small example {1,2,2,3,3,3}. The financial median is (2+3)/2 = 2.5, but the weighted median is computed as (2+2+3+3+3)/5 = 2.6. The weighted median shows that half of the set is 3’s which shifts the central measure toward that side.

Darrell Huff wrote a classic book, “How to lie with Statistics” that has been in print since 1954. Yes, it is that good and you should read it. One of his example is list of salaries which look out of date in 2014.



















The mode is $2,000.00. This makes the company look pretty cheap. The arithmetic mean is $5,700.00. That makes the pay look pretty good for 1954. The median is $2,500.00. This is a better measure of the central value of the set! People will, however, pick whichever number reinforces their political position.

This is about as far as most people get with central tendency. But there are other measures!

The Geometric Mean is sometimes a better measure of central tendency than the simple arithmetic mean when you are analyzing change-over-time. The geometric mean is more appropriate than the arithmetic mean for describing proportional growth, both exponential growth (constant proportional growth) and varying growth. The geometric mean of growth over periods yields the equivalent constant growth rate that would yield the same final amount.

At this point, I am talking too much math, so I will steal from Wikipedia. Suppose an orange tree yields 100 oranges one year and then 180, 210 and 300 the following years, so the growth is 80%, 16.6666% and 42.8571% for each year respectively. Using the arithmetic mean calculates a (linear) average growth of 46.5079% (80% + 16.6666% + 42.85261% divided by 3). However, if we start with 100 oranges and let it grow 46.5079% each year, the result is 314 oranges, not 300, so the linear average over-states the year-on-year growth. Instead, we can use the geometric mean. Growing with 80% corresponds to multiplying with 1.80, so we take the geometric mean of 1.80, 1.166666 and 1.428571, i.e. 
thus the “average” growth per year is 44.2249%. If we start with 100 oranges and let the number grow with 44.2249% each year, the result is 300 oranges. This is a better measure of the trend.

You really need to read Wikipedia’s account of Pythagorean Means. These are the three Pythagorean means. They get this name because they are based on planar geometry model of distances

Simpson Paradox

Simpson’s paradox has nothing to do with Homer Simpson or O. J. Simpson; it is named after the British statistician  Edward Hugh Simpson who wrote about it in the 1950’s. Simpson’s paradox happens when a  trend that appears in separate groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. It tends to happen when you partition your data into subsets where each of the subsets is a little skewed and of odd sizes, and illustrates well the dangers of the consequences of errors in population sampling.

The real problem is that people believe that correlation is causality, and that you can prove this with simple frequency analysis.

 This result is often encountered in social-science and medical-science statistics, and is particularly confounding when frequency data are unduly given causal interpretations. Simpson’s Paradox disappears when causal relations are considered.

One of the best-known real-life examples is the University of California, Berkeley 1973 gender bias lawsuit. The charge was that university favored man over women in graduate school admission rates in the aggregate.

Applicant Count

Admission Rate







But when examining the individual departments, it appeared that no department was significantly biased against women. In fact, most departments had a small but statistically significant bias in favor of women. The data from the six largest departments are listed below.

Department Name



Applicant Count

Admission Rate

Applicant Count

Admission Rate































The research showed that women tended to apply to competitive departments with low rates of admission even among qualified applicants (i.e. English Department), while men tended to apply to departments with high rates of admission and a smaller pool of qualified applicants (i.e. engineering and chemistry). 3

Bayesian Statistics

This school of statistics is named for Thomas Bayes (1702-1761), but a lot of other people have contributed. Bayes hung up in English Christian theology and Hume’s philosophy of his day. The central idea of Bayesian probability is that you can improve your estimates with new information.

On the other side of the table we have frequentist statistics. In a frequentist model, the unknown parameters are treated as having fixed but unknown values that are not capable of being treated as random variates in any sense, and hence there is no way that probabilities can be associated with them.

Despite the success of Bayesian models, most undergraduate courses are based on frequentist statistics. The frequentist assumption is that each sample is pulled from a universe of possible outcomes that are equally likely and independent of each other. We need to stop doing this.

I am going to assume that all good geeks have heard of the Monty Hall problem. It was used on an episode of the television show NUMB3RS ( = OBpEFqjkPO8), made poplar by Marilyn vos Savant in her newspaper column “Ask Marilyn” (vos Savant, Marilyn (Sept 1990)  Parade Magazine: 16) and many other places.

The problem was originally posed in a letter by Steve Selvin to the American Statistician in 1975. The problem gets its name from the American television game show “Let’s Make a Deal” which was hosted by Monty Hall. Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a Lamborghini; behind the others, wet goats. You pick a door, and the host, who knows what’s behind all the doors, opens another door. This second door always has a wet goat. He then says to you, “Do you want to stay with your pick or to switch?”

We need to get a little notation:

probability of event A given event B is written Pr(A|B)
probability of event B is written Pr(B)

the Pr(A) is between zero (impossibility) and one (absolute certainty)

Pr(A|¬A) = 0 Pr(A|A) = 1

Bayes’ Theorem can be written: 

Pr(A|B) = (Pr(B|A) × Pr(A)) / Pr(B)

Assume we pick Door #1 and then Monty shows us a wet goat behind Door #2. Now let event A mean the car is behind Door #1 and event B mean that Monty shows us a wet goat behind Door #2.

Then plug in the probabilities:

Pr(A|B) = (Pr(B|A) × Pr(A)) / Pr(B) = ( 1/2 × 1/3) / (1/3 × 1/2+1/3 × 0+1/3 × 1) = 1/3.

The tricky calculation is Pr(B). Remember, we are assuming we initially chose Door #1. We now know there is a wet goat behind Door #2, so we know the car is either behind Door #1 or Door #3. Since we started with the probability that the car is behind Door #1 is 1/3 and the sum of the two probabilities must equal 1, the probability the car is behind Door #3 is 1â1/3 = 2/3. You could also apply Bayes’ Theorem directly, but this is simpler.

The real power is avoiding panic. Mammograms are pretty accurate these days. They can identify about 80% of the breast cancers  in 40 year old women and produce a false positive only about 10% of the time. We also have the prior knowledge that about 0.4% of women have breast cancer from medical stats. Call this 40 out of 10,000. About 32 of those 40 will get a true positive. But about 1000 of the original 10,000 population will get a false positive. Plug in the numbers and turn the crank, and you get ~3% that the positive result is true. But we also see that a mammogram miss about 1 in 5 breast cancers, which might be more of a problem.


Tags: ,


  • Rate
    [Total: 1    Average: 5/5]
  • Pete Danes

    Friend of mine came up with what may be an even simpler approach to the goat-Lamborghini problem: changing your answer changes the correctness of your answer. Since your original answer has 2 out of 3 chance of being wrong, changing it now means you have a 2 out of 3 chance of being right.

  • Nate

    This is a great post and we sorely need more folks talking about this in the database field – too often we delegate it to "the business" who often know even less math than many database professionals. They just want to see a chart showing the "forecast" that the computer came up with.

  • Robert young

    If Bayes were only used in such analytically solvable problems, then the controversy wouldn’t still exist. Bayes is just codified bias.

  • KHU

    Here is a new reference for the three door problem
    or Scientific American 1959

    "..den der amerikanische Spieltheoretiker Martin Gardner bereits 1959 als Beitrag für den Scientific American (die einzige Zeitschrift, die Marilyn vos Savant abonniert hat) verfaßt hatte.

    Gardner hatte damals die Denksportaufgabe nur anders verkleidet. In seinem Beispiel für das "wunderbar verwirrende kleine Problem" ging es um drei zum Tode verurteilte Gefangene, von denen einer am Ende begnadigt wird.

    Seinen Namen darf der Gefängniswärter vor der Exekution nur verklausuliert enthüllen, indem er jedem Gefangenen einen der zwei Todeskandidaten benennt – ähnlich wie im TV-Studio die Tür mit der Nieten-Ziege geöffnet wird.

    Bei Gardners Gefangenenspiel für mathematikbegeisterte Leser blickte am Ende der Verlierer nicht auf eine harmlose Ziege, sondern auf eine Henkerschlinge.."

  • Celko

    More on Bayes
    Mcblackne’s “The Theory That Would Not Die” goes into the use Bayes in cracking the enigma machine, naval submarine searches and other very real world problems. Today, naive Bayes classifiers are routinely used in text analysis and have a good track record.

    Fancier Bayes classifiers work quite well, when the data has some “feedback” mechanism in it. My guess is that this is where we get Tracy-Widom distribution phenomena.

  • Anonymous

    Very Nice!
    I’ve seen far too many "data analysts" who have no idea where the data came from or how it was measured, but are willing to stake the future of the company upon the conclusions they reached from using it.

    It’s so sad, but the "statistically informed" have historically used their understanding of mathematics to dupe the public into believing what they want them to believe rather than informing them of the truth. They’ve become today’s "snake oil salesmen."

    One of my favorite statistical examples of the Simpson Paradox was contained in the pamphlet passed around when AIDS first became widely known. The statistics showed that 78% of those with AIDS were homosexual men, 27% were IV drug users, but only 12% were both. It seems like if you’re in one of these groups, then you should do the other too so you’ll reduce your chances of catching it.

  • Robert young

    Conditional probability <> Bayes, although the algebra used in Bayes makes use of it. Bayes is all about "prior" evidence that manipulates current data; it’s a philosophy of existence, not mathematics. The point is that Bayes puts no boundaries on acceptably correct priors; anything goes. If all you’re doing is applying conditional probabilities analytically, that’s not Bayes as actually practiced.

    Here’s what Wasserman ("All of Statistics", pg.185) has to say (and I had to type this in from my dead trees copy; you’re welcome!!):
    "In parametric models, with large sample sizes, Bayesian and frequentist methods give approximately the same inferences. In general, they need not agree."

    And that’s because with small samples, priors can overwhelm what we concretely know: the sample data. That’s purposeful bias.

    Even the most zealous Bayesians admit the same thing: enough data means that the answer is the same however you do the arithmetic.

  • sterbalr

    Selling things
    The key time Nigeria’s GDP and Exxon would be relevant is upon sale. Neither of them are likely to be purchased soon.

  • Celko

    More on Bayes
    Mcblackne’s “The Theory That Would Not Die” goes into the use Bayes in cracking the enigma machine, naval submarine searches and other very real world problems. Today, naive Bayes classifiers are routinely used in text analysis and have a good track record.

    Fancier Bayes classifiers work quite well, when the data has some “feedback” mechanism in it. My guess is that this is where we get Tracy-Widom distribution phenomena.

  • Ray Herring

    Error in Data Measurements and Standards
    Hi Joe,
    Based on this article I immediately ordered a copy of the book you mentioned in early on. "Joe Celko’s Data, Measurements and Standards in SQL; 2009; Morgan Kaufmann; ISBN: 978-0-12-374722-8.". Spot reading the book and index convince me that it was a good investment and I look forward to referencing it often in training sessions with my developer team.

    Imagine then my chagrin when I found a major error in the introduction! On page XV your state "An "ISO automobile" is simply more precise and accurate than an "Imperial automobile" because of the measurements used." I am a fan of, and look forward to the wide-spread adoption of the Metric system in the US.
    That system is more rationale, convenient, and flexible. I think those traits also promote enhanced accuracy and precision. However, the system is not inherently more "accurate" or "precise". Whether one uses decimal inches or fractional inches, it is just as possible to produce high precision measurements using Imperial units. The system of units does not impose any inherent limits on accuracy.

    The example you give 1mm ~ 0.03937 inch while 1/32 inch = 0.03125 inch so in this case the "Imperial" measurement is a finer grained choice for the least significant bit of a scaled measurement.

    The system of units does not make a measurement more precise. The quality of the measuring device, the skill of the operator, the demands of the designer/consumer are more important.

    There is a famous naval saying sums up the precision problem quite well, "Measure it with a Micrometer, Mark it with Chalk, Chop it with an Axe".

    European engineering gained a reputation not because of the metric system of units but because of a fanatical pursuit of accuracy and precision. I completely agree that adopting the metric system rationalized a confusing babel and certainly assisted and promoted enhanced accuracy.

    As a poor analogy, the United Kingdom currency is not inherently more or less valuable because it is now on a decimal basis. Its value is simply measured using different, and more convenient units.

  • Anonymous

    I found this to be a truly outstanding article, and have printed it to keep next to my reference material !!!

  • Celko

    Engineering to ISO versus
    [quote] However, the system is not inherently more "accurate" or "precise". Whether one uses decimal inches or fractional inches, it is just as possible to produce high precision measurement. [/quote]

    I agree with your basic premise. Yes, the math works even if we used decimal furlongs. But ..

    [quote]There is a famous naval saying sums up the precision problem quite well, "Measure it with a Micrometer, Mark it with Chalk, Chop it with an Ax". [/quote]

    But how are the units used? Engineering drawings are kept to the millimeter for everything from locomotive engines to cigarette lighters. Drawings in US customary units can be kept to 1/16”, 1/32” , 1/64” or sometimes decimal inches (or decimal feet in Civil Engineering — I worked for a State Highway department in my youth). So an “ISO engineer” will have a MUCH better chance of putting items not made in the same shop together as projects.

    Many years ago, the NATIONAL LAMPOON humor magazine had a fake Volkswagen ad that showed a VW bug floating down a river. The caption was “If Ted Kennedy had been driving a Volkswagen, he would be president today!” It was based on the fact that US cars made with a mix of US customary and Metric units leaked and fell apart. I have an old turn of the century NYT article which tells the reader that the US domination of the auto industry is assured by our US customary units. How di that work for us?

    an aside:
    I assume you know that US customary units are not Imperial units. We are still on Queen Anne’s wine standards for quarts (check me on that). Google the Imperial versus US versus Liters beer controversy during the UK’s metrication days. We serious drinkers were quite worried. 🙂