Click here to monitor SSC
  • Av rating:
  • Total votes: 97
  • Total comments: 28
Joe Celko

Look-up Tables in SQL

01 February 2011

Lookup tables can be a force for good in a relational database. Whereas the 'One True Lookup Table' remains a classic of bad database design, an auxiliary table that holds static data, and is used to lookup values, still has powerful magic. Joe Celko explains....

History

Tables, in a properly designed schema, model either an entity or a relationship, but not both. Slightly outside of the tables in the data model, we have other kinds of tables. Staging tables bring in "dirty data" so we can scrub it and then insert it into base tables. Auxiliary tables hold static data for use in the system, acting as the relational replacement for computations.

This is not a new idea. If you can find an old text book (1960's or earlier), there is a good chance you will find look-up tables in the back. Finance books had compound interest, net present value (NPV) and internal rate of return (IIR). Trigonometry books had sines, cosines, tangents and maybe haversines and spherical trig functions.

There were no cheap calculators; and slide rulers were good to only three decimal places and required some skill to use. Look up tables were easy for anyone to use and usually went to five decimal places.

I still remember my first Casio scientific calculator that I bought with my Playboy Club Key account in 12 monthly installments. The price of that model dropped to less than one payment before I paid it off. These machines marked the end of look-up tables in textbooks. Today, you can get a more powerful calculator on a spike card in the check-out line of a school bookstore.

Basic Look-up Table Design

The concept of pre-calculating a function and storing the outputs can be carried over to databases. Programmers do it all the time. Most of the time, they are used for display purposes rather than computations. That is, you are likely to see a table which translates an encoding into a description that a human being can understand.

The ISO-11179 data element-naming conventions have the format [<role>_]<attribute>_<property>. The example used in the base document was an attribute of "tree" with properties like "tree_diameter", "tree_species" and so forth. Some properties do not apply to some attributes -- "employee_diameter" is not something we need to model just yet and "employee_species" is a bit insulting.

The attribute properties that deal with encodings for scales are the candidates for look-up tables. Here is a list and definitions for some of the basic ones I introduced my my book SQL PROGRAMMING STYLE.

  • "_id" = Identifier, it is unique in the schema and refer to one entity anywhere it appears in the schema. A look-up table deals with attributes and their values, not entities, so by definition, this is not used in such tables. That is why things like "_category_id" or "_type_id" are garbage names.
    Never use "<table name>_id"; that is a name based on location and tell you this is probably not a real key at all. Just plain "id" is too vague to be useful to anyone and will screw up your data dictionary when you have to find a zillion of them, all different, but with the same data element name and perhaps the same oversized data type.
  • "_date" or "dt" = date, temporal dimension. It is the date of something -- employment, birth, termination, and so forth; there is no such column name as just a date by itself.
  • "_nbr" or "num" = tag number; this is a string of digits or even alphanumrics that names something. Do not use "_no" since it looks like the Boolean yes/no value. I prefer "nbr" to "num" since it is used as a common abbreviation in several European languages.
  • "_name" or "nm" = this is an alphabetic name and it explains itself. It is also called a nominal scale.
  • "_code" or "_cd"= A code is a standard maintained by a trusted source outside of the enterprise. For example the ZIP code is maintained by the United States Postal Service. It has some legal authority to it.
  • "_size" = an industry standard or company scale for a commodity, such as clothing, shoes, envelopes or machine screws. There is usually a prototype that defines the sizes kept with a trusted source.
  • "_seq" = sequence, ordinal numbering. This is not the same thing as a tag number, since it cannot have gaps. It also has a rule for successors in the sequence.
  • "_cat" = Category, an encoding that has an external source that has very distinct groups of entities. There should be strong formal criteria for establishing the category. The classification of Kingdom is in biology is an example.
  • "_class" = an internal encoding that does not have an external source that reflects a sub-classification of the entity. There should be strong formal criteria for the classification. The classification of plants in biology is an example.
  • "_type" = an encoding that has a common meaning both internally and externally. Types are usually less formal than a class and might overlap. For example a driver's license might be for multiple kinds of vehicles; motorcycle, automobile, taxi, truck and so forth.
    The differences among type, class, and category are an increasing strength of the algorithm for assigning the type, class, or category. A category is very distinct; you will not often have to guess if something "animal, vegetable or mineral" to put it in one of those categories.
    A class is a set of things that have some commonality; you have rules for classifying an animal as a mammal or a reptile. You may have some cases where it is harder to apply the rules, such as the egg laying mammal in Australia, but the exceptions tend to become their own classification -- monotremes in this example.
    A type is the weakest of the three, and it might call for a judgment. For example, in some states a three-wheeled motorcycle is licensed as a motorcycle. In other states, it is licensed as an automobile. And in some states, it is licensed as an automobile only if it has a reverse gear.
    The three terms are often mixed in actual usage. For example, a blood_type has a laboratory procedure to obtain a value of {A, B, AB, O}. of you want to know for sure. Stick with the industry standard, even if violates the definitions given above.
  • "_status" = an internal encoding that reflects a state of being which can be the result of many factors. For example, "credit_status" might be computed from several sources. The words "status" comes from "state" and we expect that there are certain allowed state changes. For example, your marital status can change to "Divorced" only if it is "Married" currently.

Here is where programmers start to mess up. Consider this table, taken from an actual posting:

CREATE TABLE Types

(type_id INTEGER,

 type_name VARCHAR(30));

Is this for blood, work visas or what? The table name cannot be more vague. There is no key. The first column is absurd as well as vague. An attribute can be "<something>_type" or "<something>_id" but never both. Entities have identifiers; scalar values do not. Think about a table of mathematical constants and tell me the identifier of pi, e or phi. Type_id is stupid for the same reason. Hey, why not carry this silliness from "type_id" to "type_id_value" and beyond.

Another version of the same disaster, taken from actual postings, is to add a redundant, non-relational IDENTITY table property.

CREATE TABLE Product_Types

(product_type_id INTEGER IDENTITY NOT NULL, -- is this the key?

 product_type_code CHAR(5) NOT NULL -- is this the key?

 type_generic_description VARCHAR(30) NOT NULL);

All these simple look-up tables need is a column for the <attribute>_<property> as the key and the description or name or both. If you don't get the difference between a name and a description, consider the name "Joe Celko" and "Creepy looking white guy" which is a description. A look-up table of three-letter airport codes will probably return a name. For example, the abbreviation code "MAD" stands for "Barajas International Airport" in Madrid.

An encoding for, say, types of hotels might return a description, like types

hotel type

description

R0

Basic Japanese Ryokan, no plumbing, no electricity, no food

R1

Japanese Ryokan, plumbing, electricity, Japanese food

R2

Japanese Ryokan, like R1 with internet and television

R3

Japanese Ryokan, like R2 with Western meal options

A product code will probably return both as name and a description. For example, the name might be "The Screaming Ear Smasher" and the description be "50000 Watt electric guitar and amplifier" in the catalog.

If you build an index on <attribute>_<property> key, you can use the INCLUDE feature to carry the name and/or description into the index and the table itself is now redundant.

One True look-up Table

The One True look-up Table (OTLT) is a nightmare that keeps showing up. The idea is that you put ALL the encodings into one huge table rather than have one table for each one. I think that Paul Keister was the first person to coin the phrase "OTLT" (One True Look-up Table) and Don Peterson (www.SQLServerCentral.com) gave the same technique the name "Massively Unified Code-Key" or MUCK tables in one of his articles.

The rationale is that you will only need one procedure to maintain all of the encodings, and one generic function to invoke them. The "Automobiles, Squids and Lady GaGa" function! The technique crops up time and time again, but I'll give him credit as the first writer to give it a name. Simply put, the idea is to have one table to do all of the code look-ups in the schema. It usually looks like this:

CREATE TABLE OTLT -- Generic_Look_Ups?

(generic_code_type CHAR(10) NOT NULL, -- horrible names!

 generic_code_value VARCHAR(255) NOT NULL, -- notice size!

 generic_description VARCHAR(255) NOT NULL, -- notice size!

PRIMARY KEY (generic_code_value, generic_code_type));

The data elements are meta-data now, so we wind up with horrible names for them. They are nothing in particular, but magical generics for anything in the universe of discourse.

So if we have Dewey Decimal Classification (library codes), ICD (International Classification of Diseases), and two-letter ISO-3166 country codes in the schema, we have them all in one, honking big table.

Let's start with the problems in the DDL and then look at the awful queries you have to write (or hide in VIEWs). So we need to go back to the original DDL and add a CHECK() constraint on the eneric_code_type column. Otherwise, we might "invent" a new encoding system by typographical error.

The Dewey Decimal and ICD codes are digits and have the same format -- three digits, a decimal point and more digits (usually three); the ISO-3166 is alphabetic. Oops, need another CHECK constraint that will look at the generic_code_type and make sure that the string is in the right format. Now the table looks something like this, if anyone attempted to do it right, which is not usually the case:

CREATE TABLE OTLT

(generic_code_type CHAR(10) NOT NULL

   CHECK(generic_code_type IN ('DDC', 'ICD', 'ISO3166', ..),

 generic_code_value VARCHAR(255) NOT NULL,

   CONSTRAINT Valid_Generic_Code_Type

    CHECK (CASE WHEN generic_code_type = 'DDC'

                   AND generic_code_value

                     LIKE '[0-9][0-9][0-9].[0-9][ 0-9][ 0-9]'

                THEN 'T'

                WHEN generic_code_type = 'ICD'

                   AND generic_code_value

                     LIKE '[0-9][0-9][0-9].[0-9][ 0-9][ 0-9]'

                THEN 'T'

                WHEN generic_code_type = 'ISO3166'

                   AND generic_code_value LIKE '[A-Z][A-Z]'

                THEN 'T' ELSE 'F' END = 'T'),

 generic_description VARCHAR(255) NOT NULL,

PRIMARY KEY (generic_code_value, generic_code_type));

Since the typical application database can have dozens and dozens of codes in it, just keep extending this pattern for as long as required. Not very pretty is it? Before you think about some fancy re-write of the CASE expression, SQL Server allows only ten levels of nesting.

Now let us consider adding new rows to the OTLT.

INSERT INTO OTLT (generic_code_type, generic_code_value, generic_description)

VALUES ('ICD', 259.0, 'Inadequate Genitalia after Puberty'),

        ('DDC', 259.0, 'Christian Pastoral Practices & Religious Orders');

If you make an error in the generic_code_type during insert, update or delete, you have screwed up a totally unrelated value. If you make an error in the generic_code_type during a query, the results could be interesting. This can really hard to find when one of the similarly structured schemes had unused codes in it.

When I update the OTLT table, I have to lock out everyone until I am finished. It is like having to carry an encyclopedia set with you when all you needed was a magazine article

The next thing you notice about this table is that the columns are pretty wide VARCHAR(n), or even worse, that they are NVARCHAR(n) which can store characters from a strange language. The value of (n) is most often the largest one allowed.

Since you have no idea what is going to be shoved into the table, there is no way to predict and design with a safe, reasonable maximum size. The size constraint has to be put into the WHEN clause of that second CHECK() constraint between generic_code_type and generic_code_value. Or you can live with fixed length codes that are longer than what they should be.

These large sizes tend to invite bad data. You give someone a VARCHAR(n) column, and you eventually get a string with a lot of white space and a small odd character sitting at the end of it. You give someone an NVARCHAR(255) column and eventually it will get a Buddhist sutra in Chinese Unicode.

Now let's consider the problems with actually using the OTLT in a query. It is always necessary to add the generic_code_type as well as the value which you are trying to look-up.

SELECT P1.ssn, P1.lastname, .., L1.generic_description

   FROM OTLT AS L1, Personnel AS P1

  WHERE L1.generic_code_type = 'ICD'

    AND L1.generic_code_value = P1.disease_code

    AND ..;

In this sample query, you need to know the generic_code_type of the Personnel table disease_code column and of every other encoded column in the table. If you got a generic_code_type wrong, you can still get a result.

You also need to allow for some overhead for data type conversions. It might be more natural to use numeric values instead of VARCHAR(n) for some encodings to ensure a proper sorting order. Padding a string of digits with leading zeros adds overhead and can be risky if programmers do not agree on how many zeros to use.

When you execute a query, the SQL engine has to pull in the entire look-up table, even if it only uses a few codes. If one code is at the start of the physical storage, and another is at the end of physical storage, I can do a lot of caching and paging. When I update the OTLT table, I have to lock out everyone until I am finished. It is like having to carry an encyclopedia set with you when all you needed was a magazine article.

Now consider the overhead with a two-part FOREIGN KEY in a table:

CREATE TABLE EmployeeAbsences

(..

 generic_code_type CHAR(3) -- min length needed

       DEFAULT 'ICD' NOT NULL

       CHECK (generic_code_type = 'ICD'),

generic_code_value CHAR(7) NOT NULL, -- min length needed

  FOREIGN KEY (generic_code_type, generic_code_value)

   REFERENCES OTLT (generic_code_type, generic_code_value),

..);

Now I have to convert the character types for more overhead. Even worse, ICD has a natural DEFAULT value (000.000 means "undiagnosed"), while Dewey Decimal does not. Older encoding schemes often used all 9's for "miscellaneous" so they would sort to the end of the reports in COBOL programs. Just as there is no Magical Universal "id", there is no Magical Universal DEFAULT value. I just lost one of the most important features of SQL!

I am going to venture a guess that this idea came from OO programmers who think of it as some kind of polymorphism done in SQL. They say to themselves that a table is a class, which it isn't, and therefore it ought to have polymorphic behaviors, which it doesn't.

Look-Up Tables with Multiple Parameters

A function can have more than one parameter and often do in commercial situations. They can be ideal candidates for a look-up when the computation is complex . My usual example is the Student’s T-distribution, since I used to be a statistician. It is used for small sample sizes that the normal distribution cannot handle. It takes two parameters, the sample size and confidence interval (how sure do you want to be about your prediction).

The probability density function is:

Got any idea just off the top of your head how to write this in T-SQL? How many of you can identify the Greek letters in this thing? Me neither. the nice part about using this in the real world is that you don't need all the possible values. You work with a set of three to ten confidence intervals and since it is meant for small samples, you don't need a lot of population values. Here is a table cut and pasted from Wikipedia.com.

One Sided

75%

80%

85%

90%

95%

97.5%

99%

99.5%

99.75%

99.9%

99.95%

Two Sided

50%

60%

70%

80%

90%

95%

98%

99%

99.5%

99.8%

99.9%

1

1.000

1.376

1.963

3.078

6.314

12.71

31.82

63.66

127.3

318.3

636.6

2

0.816

1.061

1.386

1.886

2.920

4.303

6.965

9.925

14.09

22.33

31.60

3

0.765

0.978

1.250

1.638

2.353

3.182

4.541

5.841

7.453

10.21

12.92

4

0.741

0.941

1.190

1.533

2.132

2.776

3.747

4.604

5.598

7.173

8.610

5

0.727

0.920

1.156

1.476

2.015

2.571

3.365

4.032

4.773

5.893

6.869

6

0.718

0.906

1.134

1.440

1.943

2.447

3.143

3.707

4.317

5.208

5.959

7

0.711

0.896

1.119

1.415

1.895

2.365

2.998

3.499

4.029

4.785

5.408

8

0.706

0.889

1.108

1.397

1.860

2.306

2.896

3.355

3.833

4.501

5.041

9

0.703

0.883

1.100

1.383

1.833

2.262

2.821

3.250

3.690

4.297

4.781

10

0.700

0.879

1.093

1.372

1.812

2.228

2.764

3.169

3.581

4.144

4.587

11

0.697

0.876

1.088

1.363

1.796

2.201

2.718

3.106

3.497

4.025

4.437

12

0.695

0.873

1.083

1.356

1.782

2.179

2.681

3.055

3.428

3.930

4.318

13

0.694

0.870

1.079

1.350

1.771

2.160

2.650

3.012

3.372

3.852

4.221

14

0.692

0.868

1.076

1.345

1.761

2.145

2.624

2.977

3.326

3.787

4.140

15

0.691

0.866

1.074

1.341

1.753

2.131

2.602

2.947

3.286

3.733

4.073

16

0.690

0.865

1.071

1.337

1.746

2.120

2.583

2.921

3.252

3.686

4.015

17

0.689

0.863

1.069

1.333

1.740

2.110

2.567

2.898

3.222

3.646

3.965

18

0.688

0.862

1.067

1.330

1.734

2.101

2.552

2.878

3.197

3.610

3.922

19

0.688

0.861

1.066

1.328

1.729

2.093

2.539

2.861

3.174

3.579

3.883

20

0.687

0.860

1.064

1.325

1.725

2.086

2.528

2.845

3.153

3.552

3.850

21

0.686

0.859

1.063

1.323

1.721

2.080

2.518

2.831

3.135

3.527

3.819

22

0.686

0.858

1.061

1.321

1.717

2.074

2.508

2.819

3.119

3.505

3.792

23

0.685

0.858

1.060

1.319

1.714

2.069

2.500

2.807

3.104

3.485

3.767

24

0.685

0.857

1.059

1.318

1.711

2.064

2.492

2.797

3.091

3.467

3.745

25

0.684

0.856

1.058

1.316

1.708

2.060

2.485

2.787

3.078

3.450

3.725

26

0.684

0.856

1.058

1.315

1.706

2.056

2.479

2.779

3.067

3.435

3.707

27

0.684

0.855

1.057

1.314

1.703

2.052

2.473

2.771

3.057

3.421

3.690

28

0.683

0.855

1.056

1.313

1.701

2.048

2.467

2.763

3.047

3.408

3.674

29

0.683

0.854

1.055

1.311

1.699

2.045

2.462

2.756

3.038

3.396

3.659

30

0.683

0.854

1.055

1.310

1.697

2.042

2.457

2.750

3.030

3.385

3.646

40

0.681

0.851

1.050

1.303

1.684

2.021

2.423

2.704

2.971

3.307

3.551

50

0.679

0.849

1.047

1.299

1.676

2.009

2.403

2.678

2.937

3.261

3.496

60

0.679

0.848

1.045

1.296

1.671

2.000

2.390

2.660

2.915

3.232

3.460

80

0.678

0.846

1.043

1.292

1.664

1.990

2.374

2.639

2.887

3.195

3.416

100

0.677

0.845

1.042

1.290

1.660

1.984

2.364

2.626

2.871

3.174

3.390

120

0.677

0.845

1.041

1.289

1.658

1.980

2.358

2.617

2.860

3.160

3.373

0.674

0.842

1.036

1.282

1.645

1.960

2.326

2.576

2.807

3.090

3.291

Unlike the calculus, nobody should have any trouble loading it into a look-up table.

In the January 2005 issue of The Data Administration Newsletter (www.TDAN.com) I published an article on a look-up table solution to a more difficult problem. If you watch the Food channel on cable or if you just like Memphis-style BBQ, you know the name "Corky's". The chain started in 1984 in Memphis by Don Pelts and has grown by franchise at a steady rate ever since. They will never be a McDonald's because all the meats are slow cooked for up to 22 hours over hickory wood and charcoal, and then every pork shoulder is hand-pulled. No automation, no mass production.

They sell a small menu of items by mail order via a toll-free number or from their website (www.corkysbbq.com) and ship the merchandise in special boxes sometimes using dry ice. Most of the year, their staff can handle the orders. But at Christmas time, they have the problem of success.

Their packing operation consists of two lines. At the start of the line, someone pulls a box of the right size, and puts the pick list in it. As it goes down the line, packers put in the items, and when it gets to the end of the line, it is ready for shipment. This is a standard business operation in lots of industries. Their people know what boxes to use for the standard gift packs and can pretty accurately judge any odd sized orders.

At Christmas time, however, mail-order business is so good that they have to get outside temp help. The temporary help does not have the experience to judge the box sizes by looking at a pick list. If a box that is too small starts down the line, it will jam up things at some point. The supervisor has to get it off the line, and re-pack the order by hand. If a box that is too large goes down the line, it is a waste of money and creates extra shipping costs.

Mark Tutt (On The Mark Solutions, LLC) has been consulting with Corky's for years and set up a new order system for them on Sybase. One of the goals of the new system is print the pick list and shipping labels with all of the calculations done, including what box size the order requires.

Following the rule that you do not re-invent the wheel, Mr. Tutt went to the newsgroups to find out if anyone had a solution already. The suggestions tended to be along the lines of getting the weights and shapes of the items and using a Tetris program to figure out the packing.

Programmers seem to love to face every new problem as if nobody has ever done it before and nobody will ever do it again. The "Code first, research later!" mentality is hard to overcome.

The answer was not in complicated 3-D math, but in the past 4 or 5 years of orders in the database. Human beings with years of experience had been packing orders and leaving a record of their work to be mined. Obviously the standard gift packs are easy to spot. But most of the orders tend to be something that had occurred before, too. Here are the answers, if you will bother to dig them out.

First, Mr. Tutt found all of the unique configurations in the orders, how often they occurred and the boxes used to pack them. If the same configuration had two or more boxes, then you should go with the smaller size. As it turned out, there were about 4995 unique configurations in the custom orders which covered about 99.5% of the cases.

Next, this table of configurations was put into a stored procedure that did a slightly modified exact relational division to obtain the box size required. A fancy look-up table with a variable parameter list!

Joe Celko

Author profile:

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko

Rate this article:   Avg rating: from a total of 97 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Style and look types
Posted by: Seven24 (view profile)
Posted on: Wednesday, February 02, 2011 at 9:51 PM
Message: First, a small pet peeve of style. I'm not sure about everyone else, but I hate writing underscores in column names (or table names or database names). They are a pain in the tokhes. I much prefer PascalCase (with the first letter capitalized) to using underscores. Easier to read and write IMO.

Second, unless your idea of a data dictionary is one that does not include the table name, how is having a column (especially a surrogate key) named just "id" going to mess up your data dictionary? What would you suggest: TablenameSurrogateKey? I agree that the *foreign key* needs to have a more specific name to its purpose but the parent key, especially if it is a surrogate, can just be named "id".

Third, I'm not a big fan of abbreviations. Back in the day, using abbreviations like "nbr" was important because of limitations on the length of column names. Most databases allow for at least 100 characters and therefore I prefer explicit names. Thus, "Number" instead of "Nbr" and "Category" instead of "Cat". You have the space and good IDEs have autocomplete. Make it easier on the next person so that they do not figure out your magic abbreviation system.

I definitely agree on the one lookup to rule them all. The problem almost universally stems from a lack of effort to determine what a given "lookup" really is. Often, it is categorized as a "drop list choice" but with no specifics about what it means. "Translation: I need a list of things." One of its biggest failings is expansion. You create a lookup item called "Foo" and "Bar". Six months later you discover that "Bar" needs an additional attribute. Either all lookup items get it or none of them get it. Another problem is that it creates a horrendous bottleneck. You end up with gazillions of foreign key relations to a single table. Every time you change one value in the lookup table, three dozen other tables have to be checked by the system. Lastly, it creates the problem of miscategorization. Suppose you have items Alpha and Beta which are of "type" Gamma in your lookup list. Someone then comes along and changes Alpha's "type" to Theta. Suddenly users cannot figure out why Alpha is showing in the wrong drop list but shows on reports correct (because the report designer did not check the "type" in his Joins). Definitely a headache I suggest designers avoid.

Subject: On OLTL
Posted by: Mikhail Opletayev (view profile)
Posted on: Thursday, February 10, 2011 at 4:06 PM
Message: While I agree with a lot of what you say about OLTL, there is also a practical aspect to it. A lot of companies use OLTL due to the very tangible practical benefits it provides:

1) Adding new lookups is a data change, rather than a schema change. This can be a major concern to a lot of organizations. Not only data changes tend to happen quicker, it's very common to have multiple environments, such as DEV, QA, UAT, PROD, etc. The fact that you can promote your code from DEV to QA without having to promote a database schema first can be a major factor in the development process.

2) Lookups are generally used for visualizing data, translating codes into a human readable form. Many visual applications don't resolve lookups in SQL due to the need to transfer a lot of redundant information over the network. Instead, an application would load all lookups at the start and do a client-side lookup. Needless to say, OLTL fits this profile quite well.

3) It's quite common for lookups to support effective dates. ICD would need it as well as DRG, CPT, cost centers, physician specialties, drug codes, and many other lookups in a medical application. Replicating this logic across multiple tables can be quite a headache.

4) In single-tenant systems, where a separate schema is created per set of data (e.g. per client) a large number of tables can be a maintenance nightmare. While we can argue the merits of the single-tenant design, there are many financial institutions that will simply not allow their data to coexist with someone else's. Many will not buy your product unless it can provide a completely isolated environment. Imagine maintaining 15-20 lookup tables over hundreds of environments.

While by no means a perfect solution, I think there are quite a few situations where OLTL is a superior practical solution.

Subject: Style and research
Posted by: Celko (view profile)
Posted on: Friday, February 11, 2011 at 1:18 PM
Message: >> I hate writing underscores in column names (or table names or database names). They are a pain in the tokhes. I much prefer PascalCase (with the first letter capitalized) to using underscores. Easier to read and write IMO. <<

That question has been researched and you are wrong. Latin alphabet users read lowercase letters much easier and with fewer errors than uppercase. So punch cards and 3270 terminals really stink. But the Latin eye is trained to jerk to the uppercase letters and to seek spacing for words. You can physically measure it.

CamelCase is awful; the eye jumps to the middle then back to the start of each word. PascalCase is only slightly better. A capitalized word is the start of a sentence, paragraph or other semantic unit

Centuries of lined paper make the ISO use of the underscore the best way to write code. If you want to get into it in painful details, start with “Eye Tracking Methodology: Theory and Practice” by Duchowski, Andrew. I did this kind of research for the DoD, but without the good tools we have today.

>> Second, unless your idea of a data dictionary is one that does not include the table name, how is having a column (especially a surrogate key) named just "id" going to mess up your data dictionary? <<

You miss the point of what a data element is. Given a VIN (Vehicle Identification Number) , it is universal. It does not change names or meaning from place to place. You are confusing a table with a file. In a file system, however, the fields get meaning from the application program. All names are local, not schema level.

Of course the table name would never be part of the Data Dictionary entry for a column. I would never have “Motorpool.vin” and “AutoInsurance.vin”. In your world, a magical “id” is a field in a file and has nothing to do with a proper data model.

>> What would you suggest: TablenameSurrogateKey? I agree that the *foreign key* needs to have a more specific name to its purpose but the parent key, especially if it is a surrogate, can just be named "id". <<

Did you ever read the definition that Dr. Codd gave of a Surrogate Key? The system handles it and you never even see it, much less name it. I think you are confusing Surrogate Key with an exposed physical locator that is used by bad SQL programers to make “mock pointer chains” instead of relational. references. The old Sybase IDENTITY and the magical word “id” somehow get into the schema. I always wanted to see “id” replaced with a correct name -- “physical_insertion_attempt_cnt” , which tells the truth about the flaw.

>> Third, I'm not a big fan of abbreviations. Back in the day, using abbreviations like "nbr" was important because of limitations on the length of column names. Most databases allow for at least 100 characters and therefore I prefer explicit names. Thus, "Number" instead of "Nbr" and "Category" instead of "Cat". You have the space and good IDEs have auto-complete. Make it easier on the next person so that they do not figure out your magic abbreviation system. <<

The first FORTRAN I used had six letter limits! Today, the limit in most SQLs today is 128 characters. Because of the FIPS standards. I am a bit softer on abbreviations, but agree that making a data element name into a cryptogram is a bad idea. I noticed that ny9u did not mind using “IDE” instead of spelling it out. It is well-known in our trade, so it is not a problem.


Subject: RE: Style and research
Posted by: Seven24 (view profile)
Posted on: Saturday, February 12, 2011 at 12:22 AM
Message: >> That question has been researched and
>> you are wrong. Latin alphabet users read
>> lowercase letters much easier and with
>> fewer errors than uppercase. So punch
>> cards and 3270 terminals really stink.
>> But the Latin eye is trained to jerk to
>> the uppercase letters and to seek spacing for
>> words. You can physically measure it.

Nonsense. (Punch cards?! Really?) Show me a study conducted in the last 10 years. IMO, your evidence is outdated and it is clear have not used a recent IDE (Integrated Development Environment. Better?). Your argument is akin to saying that everyone should drive 10 MPH because horses do not get spooked. You need to move into the 21st century and re-evaluate whether your assumptions about usability are still relevant using modern tools.

My IDE highlights the line on which I'm working which makes seeing the underscore pretty much impossible. Highlighting an object name will block the text also masking the underscore. Furthermore, it is harder to type than without. PascalCase (NOT camelCase) is significantly easier to read than underscores and significantly easier to type and therefore more efficient. There just is no reason to use underscores unless you are forced into a single case for object identifiers (which would an exceptional case in today's age).

>> Centuries of lined paper make
>> the ISO use of the underscore
>> the best way to write code.

A. Best way to write code using what and when? When was this analysis conducted? Anyone that has used an IDE in the past five years would argue against this conclusion. Extraordinary claims require extraordinary and in this case that evidence must be recent.
B. Can someone name the last time a developer wrote or printed to lined paper, copious amounts of code? Anyone? In fact, in my 20 years of development I cannot ever remember printing to lined paper and that goes back far enough to when I wrote entirely using a line printer...without lines on the paper.

Again, if you are going to make this claim, show me a study conducted within the last five to ten years using modern IDEs.

RE: Data dictionary

No Joe, you missed the point. I specifically referred to a ***surrogate key*** not a natural key. A surrogate key, by definition has no meaning.

>> Did you ever read the definition
>> that Dr. Codd gave of a Surrogate Key?
>> The system handles it and you never
>> even see it, much less name it.

That is a non sequitur. How can you create a surrogate key without ever seeing or naming it? By this logic, no database can ever have a surrogate key. If I can never see it nor name it, how could I ever use it in a query? How would we ever know it exists?

Surrogate keys are a fundamental part of the implementation of a database and thus their naming is part of design.

>>The old Sybase IDENTITY and the
>>magical word “id” somehow get into the
>> schema. I always wanted to see “id” replaced
>> with a correct name
>> -- “physical_insertion_attempt_cnt” ,
>> which tells the truth about the flaw.

Again, you dodged the question. If you have dozens of tables with surrogate keys (an anathema to your mind but that is an orthogonal discussion), what do you name them? You must provide a name, so what is it? "tablename_id"? Horrible and repetitive. I suppose you could name it "physical_insertion_attempt_cnt" but suppose you have 20 tables with surrogate keys? Using your logic that a column represents a fundamental piece of information and given that a surrogate key is inherently arbitrary, naming them the same thing such as "id" seems logical. They all represent the same form of arbitrary data.

>> I noticed that ny9u did not mind
>> using “IDE” instead
>> of spelling it out. It is well-known in our
>> trade, so it is not a problem.

That's because I'm not designing a database that will need be managed by other developers in the future; I'm posting an informal response to an article. Just as I wouldn't refer to the President of the United States as "Pres", I might use the abbreviation in the context of an informal discussion.

Subject: Typograqphy, and surrogates
Posted by: Anonymous (not signed in)
Posted on: Friday, February 18, 2011 at 1:19 PM
Message: The Readability of Print Unit at the Royal College of Art under Professor Herbert Spencer with Brian Coe and Linda Reynolds did important work in this area and was one of the centres that revealed the importance of the saccadic rhythm of eye movement for readability—in particular, the ability to take in (i.e., recognise the meaning of groups of) around three words at once and the physiognomy of the eye, which means the eye tires if the line required more than 3 or 4 of these saccadic jumps. More than this is found to introduce strain and errors in reading (e.g. Doubling).1988.
====================
The Legible Typeface at The Danish Design School (2010 May 11) also had some interesting papers.

===================
>> Nonsense. (Punch cards?! Really?) Show me a study conducted in the last 10 years. IMO, your evidence is outdated and it is clear have not used a recent IDE (Integrated Development Environment. Better?). Your argument is akin to saying that everyone should drive 10 MPH because horses do not get spooked. You need to move into the 21st century and re-evaluate whether your assumptions about usability are still relevant using modern tools. <<

Hit Google Scholar and look at IEEE and ACM research. The really basic research on typography was done in the later 19-th century for newspapers and book publishers, then it was picked by the US Army for World War II. We know the biometrics but not have the tools we do today.

The horse/car is a false analogy. Human brains have not changed in the last ten years; 100 years or even 1000 years. Visual centers are still where they have been since we evolved; color vision is still the same, etc. The later research builds on it and adds more accuracy to the measurements.

>> My IDE highlights the line on which I'm working which makes seeing the underscore pretty much impossible. Highlighting an object name will block the text also masking the underscore. Furthermore, it is harder to type than without. PascalCase (NOT camelCase) is significantly easier to read than underscores and significantly easier to type and therefore more efficient. There just is no reason to use underscores unless you are forced into a single case for object identifiers (which would an exceptional case in today's age). <<

Might want to read: “Improving the legibility of digital type: a comparison of some current videotex fonts with a new design” (2003, Purdue University). They did reading tests that revealed that a new font (Hillman 1) was read fastest, than existing video text on low-resolution digital typography.

And ISO disagrees with you in their Standards.

I did collected research for AIRMICS. A lot of it was Ben Schneiderman at the University of Maryland, but the SEI at Carnegie-Mellon collected tons of stuff.

>>Can someone name the last time a developer wrote or printed to lined paper, copious amounts of code? Anyone? In fact, in my 20 years of development I cannot ever remember printing to lined paper and that goes back far enough to when I wrote entirely using a line printer...without lines on the paper.<<

I think you missed the point about “lined paper”; Latin alphabet users made their own horizontal lines with lead before they picked up a quill. The reason we had “green bar” paper back in my day was that line printers were not very good – type bounced up and down from the base line. When the printers got better, the eye creates the needed baseline.

When we tested screens in the 1980's , we bounced the letters, to make it look like green bar printouts without the base lines. It did not work.

>> No Joe, you missed the point. I specifically referred to a ***surrogate key*** not a natural key. A surrogate key, by definition has no meaning. That is a non sequitur. How can you create a surrogate key without ever seeing or naming it? <<

Do you see the values of the indexes, the hashing values, etc? Do you use them in a query? The SQL engine may or may not use them, the The SQL engine handles updates, inserts and delete. All you have is ADMIN if you are at that user level.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

>> Surrogate keys are a fundamental part of the implementation of a database and thus their naming is part of design. <<

Fundamental? A database works just fine without true surrogates or the usual exposed physical locators. Real keys are fundamental.

>> Using your logic that a column represents a fundamental piece of information and given that a surrogate key is inherently arbitrary, naming them the same thing such as "id" seems logical. They all represent the same form of arbitrary data. <<

Ever hear the old joke about cats?
“We had three cats when I was a kid”
“What were their names?”
“cat, cat and cat”
“How did you tell them apart?”
“Who cares? They don't come when you call them anyway.”

You need a schema object name to CREATE, ALTER and DROP them, but that is at the ADMIN level and not USER level. A true surrogate situation would probably follow the X/Open syntax:

CREATE SURROGATE <schema object name> ON <base table>(<column list>);

Other products like INFORMIX have a syntax for “pre-joining” multiple tables, which is also another surrogate.



Subject: ypograqphy, and surrogates
Posted by: Seven24 (view profile)
Posted on: Friday, February 18, 2011 at 3:02 PM
Message: RE: Readability

First, please dispense with using standards we are not discussing. Thus, at no time should references to camelCase enter into this discussion since it was never suggested and both agree is an inferior approach to object identifiers.

Second, Show me a study conducted in the last 10 years that **specifically and directly** compares the use of PascalCase with underscores and concludes the later is superior. That is the point of this discussion: you claim underscores are more readable, I claim that PascalCase is better. Bonus points if said study was done using modern IDEs. In short, you have yet provide any evidence that underscores are superior to PascalCase and I have provided examples where underscores are worse. You keep talking about theoretical ideals about typography, while I'm referring to the actual tools used by developers. For example, just recently, I sent an email to someone with an underscore in their email address and my email program converted it into a mailto tag and underlined the address....which completely and totally masked the underscore. Reading it later, it's 50-50 whether I'd remember whether the gap was due to an underscore or space.

> The horse/car is a false analogy.
> Human brains have not changed in
> the last ten years; 100 years
> or even 1000 years.

The analogy is apt because while the operator of the vehicle has not changed, the vehicles themselves HAVE changed and THAT is the point of the discussion. It is not whether humans can read X better than Y in a vacuum; it is whether they can do so using the actual tools in which they are doing their work.

> Might want to read: “Improving the
> legibility of digital type: a comparison
> of some current videotex fonts with a
> new design” (2003, Purdue University).
> They did reading tests that revealed
> that a new font (Hillman 1) was read
> fastest, than existing video text on
> low-resolution digital typography.

How is this even remotely relevant to a discussion about the merits of PascalCase vs. underscores?

> And ISO disagrees with you in their Standards.

Why do you suppose this is the standard? I bet there are database systems which only allow a single case for identifiers. If that were the situation, then PascalCase is obviously not an option. ISO is appealing to all possibilities which means they must accommodate the lowest common denominator in terms of features.

RE: Surrogate Keys

Let's step back as you went off in left field about the merits of using surrogate keys. Forgot our differences of opinion about whether Codd provided for such a concept. Forgot whether you think they belong in the database model. Do you acknowledge that many (if not all at some point in time) developers use surrogate keys in their *actual* implementations for whatever reason? What do you suggest for the naming of those columns?

RE: Cat joke

Here, let me help you with that:

“We had three cats when I was a kid”
“What were their names?”
“cat1, cat2 and cat3”
“How did you tell them apart?”
“Who cares? They don't come when you call them anyway.”

When I refer to surrogate keys, I'm referring to the arbitrary identifier that millions of developers add to their databases for a host of reasons. I'm not, nor am I asking you to, defend nor disparaging the practice. I'm not asking about the merits of adding surrogate keys to your schema. I'm simply trying to get from you what you would suggest as the naming for such columns which are used in every database I have ever encountered. If I have three tables named Alpha, Beta and Gamma (which in an actual implementation would represent three distinct entities) and each has a system-generated surrogate key, accessible to developers for use in queries and indexes, what would you recommend the names for these three columns? (Using your preference for underscores and to hopefully get you past that point), alpha_id, beta_id and gamma_id? Wouldn't that violate your principle of not using the table name in the column name?

Subject: great example of storing pre-calculated data
Posted by: AlexK (view profile)
Posted on: Sunday, February 20, 2011 at 10:28 AM
Message: Hi Joe,

I really liked this article, especially the last example when storing pre-calculated data delivers predictable performance. This is a great advice: storing pre-calculated data may lead to great performance gains, and much simpler selects. I am frequently using this approach in my work.

The only little thing I am a little bit skeptical about is your take on camelCase and PascalCase. Somehow they seem to be very popular: they are a de-facto standard these days. Developers under 35 seem to have accepted them as a worldwide standard - we can all visit stackoverflow.com and see for ourselves which code styles are actually used in right now.

Somehow I don't feel that a research done several decades ago against a group of DoD employees is very relevant to today's Google's employees that come from different countries all over the world, have very different attitudes and programming styles and such.

For several years I myself worked with a team where literally every team member comes from a different country. So we had some people who in their native languages write from top to bottom, and some who write from right to left. Besides, some did not have uppercase letters in their native languages. This was a great learning experience which taught all of us a lot about writing clear code. There were some misunderstandings, that's for sure, yet somehow nobody had any problems with camelCase and PascalCase.

Because currently many in IT are not native Latin alphabet users, maybe the research you are referring to should be reevaluated against today's typical IT teams.

What do you think?

Subject: How did we get here?
Posted by: BobT (not signed in)
Posted on: Monday, February 21, 2011 at 7:46 AM
Message: To All,

How did this degrade into a spitting match about naming conventions? Nuff said.

Joe,

I have been a avid admirer of you readings. We agree 80% of the time. That said, I agree with natural keys in principle, but I have found thru my years that when getting the best out of query performance, surrogate keys of integers/longs are easier for the query optimizers to work with. Yes in makes merging two companies data a nightmare when the both use surrogates, but it is not a doomsday scenario. I tend to look at, what is best for the user. They will always want the best performance. (or the appearance of best performance) There are times when I do things that are not the easiest for me as a developer because it is not best for the user experience.

Subject: Looking for a good example
Posted by: sgray128 (view profile)
Posted on: Monday, February 21, 2011 at 8:32 AM
Message: Joe:
Thanks for the hard work. I'd like to ask a favor... I read all your articles with great interest, hoping to improve my craft. In the article above, you give lots of 'do nots'; it would help if you followed it with an example of your recommended technique. I'm clear that you don't like a particular technique, but not clear on what the right one is.

steve@4penny.net
www.4Penny.net

Subject: storing pre-calculated data
Posted by: Anonymous (not signed in)
Posted on: Monday, February 21, 2011 at 9:11 AM
Message: I never understood why people re-calculate the same thing for each row. After all, SQL is a data base language.

>> I don't feel that a research done several decades ago against a group of DoD employees is very relevant to today's Google's employees that come from different countries all over the world, have very different attitudes and programming styles and such. <<

How do you feel about ALL of the human factors research in use today? I keep a copy MIT's MEASURE OF MAN and LA MODULAR when I design furniture.

DoD paid for the research, so it was a mix of college students who needed extra credit, soldiers who were available doing ADP work and contractors who had an obligation to provide employees for research. I think most human factors research is the study of college freshmen who have bad grdes and free time :)

>> For several years I myself worked with a team where literally every team member comes from a different country. So we had some people who in their native languages write from top to bottom, and some who write from right to left. Besides, some did not have uppercase letters in their native languages. This was a great learning experience which taught all of us a lot about writing clear code. <<

I taught college with that same kind of environment. In fact, I did an article for the JOURNAL OF STRUCTURED PROGRAMMING on this topic. I could look at their code and spot the Chinese and Arabic speakers. Jerry Weinberg was the first researcher to look at this stuff in his classic PSYCHOLOGY OF COMPUTER PROGRAMMING.

>> Because currently many in IT are not native Latin alphabet users, maybe the research you are referring to should be reevaluated against today's typical IT teams. <<

Typographical research is back in vogue, thanks to e-book readers. Programming text research faded because we kept getting the same results over and over. How many times can you discover that indentation makes code readable, that all uppercase stinks, that eye bounce to uppercase letters, alphabets use a baseline and how rivers work, etc?

What I need to look up is the effect of color on code. There is a test for concussions where you are shown a flash card with a word in a colored ink (the RED in green ink, etc.) and you are asked to reply with the word or the color.

Your rate is pretty constant over your life, until you have brain damage. It has to do with switching hemispheres.

I am sooo left-brained, I tilt to one side. I cannot read colored code with any comfort or speed. This also has to do with eye damage from measles as a child, etc.

Other people I know cannot work without it. Why anyone thinks red letters on a black background with a blinking smurf for a cursor is readable, is beyond me.


Subject: Not deleting rows
Posted by: DavidB64IL (view profile)
Posted on: Monday, February 21, 2011 at 10:24 AM
Message: Nice article. One thing I would mention is adding an effective date or delete flag, because once a lookup code is used by a row on another table you should not delete it from your lookup table else you loose its description.

Subject: Surrogates
Posted by: ltca (view profile)
Posted on: Monday, February 21, 2011 at 11:05 AM
Message: Joe's articles (along with the comments that follow) always make me think.

I'm always fascinated by the discussion of surrogate vs. natural keys. Yes I agree that every system I have ever worked on has used surogates, sometimes exclusively. The one entity I have the most trouble defining a natural key for is a person. What attributes uniquely identify a person in a system?
1. Name? Hardly.
2. SSN? Several problems here, depending on the system. How many of you will give me your SSN to register on my web site? I work with state agencies and, although rare, SSNs are not always unique.
3. Email address? Required to register on a web site but can change. Not ideal for a PK.
4. Some sort of user defined code or number (CustomerNbr)?

I've always wondered how other systems handled this. As for me, I've always used an integer, usually an identity column as a surrogate PK.

Subject: Transaction
Posted by: DavidK (not signed in)
Posted on: Monday, February 21, 2011 at 11:43 AM
Message: First thing I noticed:
"When I update the OTLT table, I have to lock out everyone until I am finished."
Maybe that was true with SQL2000, but in SQL2005 and above you just need to start and end a transaction. Is that so hard. Maybe it is. Maybe that's why Oracle makes everything a transaction. So, you hardly ever have to worry about locks. Yes, it slows things down a tiny bit with transactions. But it speeds other things up. As in this situation.

Subject: @davidk - Transaction
Posted by: JohnA (not signed in)
Posted on: Monday, February 21, 2011 at 7:30 PM
Message: I am pretty sure starting a transaction and ending a transaction locks everyone out of at least that row (possibly that page or perhaps the table depending on other factors) until you are done. By default SQL Server will make it an implicit transaction as well.
I agree that the length of the transaction is so short as to be relatively insignificant though with a OTLT scenario you'd have to be somewhat careful to avoid deadlock situations (but even that can be fairly easily managed) and is not drastically different from the situation with many smaller tables.
From my view the single biggest problem with OTLT is the lack of an ability to easily setup true relationships to protect data integrity. I have designed for them before because the overhead of maintenance required for 50-odd lookup tables was more than the small project team could tolerate. But I felt dirty afterward. :-)

Subject: OTLT
Posted by: JamesT (not signed in)
Posted on: Tuesday, February 22, 2011 at 2:57 AM
Message: I am a healthcare IT programmer and medical doctor. This article expresses an opinion on OTLT which may be appropriate for systems used in business but would be counter-productive in healthcare.
Healthcare systems employing OTLU developed in the 1970s are still in use now and we lack credible replacements. Hospitals may have IT systems but very few, if any, are fit to be used to guide and document clinical care. This is unfortunate because the process of providing medical care is almost entirely information management.
Various national and international organizations have developed a standard terminology for medicine (e.g. WHO International classification of disease ICD-10, Systemized Nomenclature for Medicine SNOMED CT etc) which can only be implemented as OTLU. SNOMED for example, consists of over a million lookup values which cover all aspects of medicine and its treatment. A clinical record database designed to document clinical care would have many thousands of fields. HL-7 has developed a standard reference information model for healthcare which identifies thousands of potential entities. SNOMED CT defines a method for fuzzily and hierarchically grouping the terms to assist browsing. The hierarchy allows terms to be allocated to multiple group parents. It would not necessarily be practical or efficient to divide the SNOMED OTLU into thousands of separate small lookup tables and to 'hard code' the relationship between these lookup tables and patient record value fields within the database physical schema.

Subject: Lookup names and ID's
Posted by: DirkS (not signed in)
Posted on: Tuesday, February 22, 2011 at 8:50 AM
Message: I enjoyed reading Joe's article, and agree with his points. I am a former microbiologist and now work for an environmental consulting company in Canada designing and building environmental (ecological monitoring-based) information systems for clients. I began using lookup tables right from the start. I stayed away from OTLTs, as they are a nightmare to maintain and to query. They don't make sense. OTLTs are a sign, in my opinion, that one has to go back and review the entities and the relationships between them.

I name columns and tables using a system that borrows from Hungarian notation ("family name preceeds given name"). I use underscores "_" in column names to improve readability, but use PascalCase in table names instead (underscores in tables names are rarely used, only as a very last resort). Table names are prefixed by role. I use "tbl" to indicate base table, while look-up table names are prefixed "tlkp". The next word after the prefix is the module name, followed by the actual table name. This way I know the type and role of the entity and what part of the database it is referencing. I use abreviations sparingly, but they are necessary when building geodatabases using GIS software, since the column names have a character size limit.

I'm still not clear on what Joe is talking about in regard to the use of "_id". In my database, all primary keys in base tables and look-up tables use the format 1<attribute>_id.

tree_id --> primary key
tree_size
tree_diameter
tree_species
tree_cmts

We track fish captured in nets. Often, fish are not assigned a fish number in the field, or in a report. And they tend to be a mish-mash of alphanumeric and numeric (try as I might, the field guys just won't settle on a all-numeric system, and there's legacy data to content with). So each fish is uniquely identified by a fish_id number. This id column is the primary key. So I'm still not clear if what Joe is saying is to not use <attribute>_id as a primary key??? I reference the fish_id in other entities where one has to document fish diet, and associate the diet items back to a fish id - which becomes the foreign key in the fish diet items table. I need the fish_id, since I can not rely on the field and report fish numbers - they are not always unique, or not always available.


Subject: Reading Studies are NOT Code Reading Studies
Posted by: Andy Dent (not signed in)
Posted on: Tuesday, February 22, 2011 at 6:18 PM
Message: I'm going to go out on a limb and say that any studies of generalized readers have only a moderate relationship to the task of reading code by a person habituated to reading code in a browser.

They are about as relevant as the "This is your brain on Tetris" http://www.wired.com/wired/archive/2.05/tetris.html studies.

In both cases there is some relevance but a true scientific study needs to be done evaluating the eyes of professional programmers in their usual environment.

Subject: Darrin Morgan
Posted by: Anonymous (not signed in)
Posted on: Wednesday, February 23, 2011 at 1:23 AM
Message: You know, whenever you put two or more "data" people together in a room (even a chat-room) you will inevitably start an argument on naming standards (otherwise knows as meta data).

I agree with everything you said. It's a sad world where few people understand (or care about) the history of databases and data modelling - we ARE likely to repeat the mistakes of the past (and I remember them well from my IBM/370 days). So few companies have data modelling tools, and fewer still insist on those artefacts being part of source control and the audit trail.

I should add that Camel casing works only in those RDBMS which store meta data in a case sensitive fashion - which Oracle does NOT; so getting meaningful meta data out of the catalog/dictionary is damn near impossible unless *you* make it clear. Much easier to use the underscore for practical reasons. Professionally, I use multiple RDBMS and swapping naming conventions makes no sense at all.

Subject: Syntax
Posted by: BuggyFunBunny (view profile)
Posted on: Wednesday, February 23, 2011 at 6:38 AM
Message: Early in my initiation into OO programming, I found that Allen Holub made the most sense, with regard to OO design. He's written a few books, some on writing compilers; he's no slouch. In the late 90's his writing used this_is_a_field syntax.

Here: http://www.javaworld.com/javaworld/jw-07-1999/jw-07-toolbox.html?page=7

is a page from his Bank of Allen series. Some years later, by now but I don't recall when it happened, he gave up succumbed to the Borg. It's too bad.

With regard to those who claim that comprehension research/testing is poppycock, there's a reason road signs aren't camelCase, PascalCase or ALL CAPS.

Subject: Road signs
Posted by: Seven24 (view profile)
Posted on: Wednesday, February 23, 2011 at 3:26 PM
Message: @BuggyFunBunny - Funny, I do not remember ever seeing road signs that used underscores or were all lower case. Further, last I checked the environment in which you read road signs is entirely different than the one in which you read code.

Subject: RE: PascalCase
Posted by: Seven24 (view profile)
Posted on: Wednesday, February 23, 2011 at 3:30 PM
Message: @Anonymous
> I should add that Camel
> casing works only in those
> RDBMS which store meta
> data in a case sensitive
> fashion - which Oracle does NOT

A. No one is suggesting camelCase. Only PascalCase is being suggested.

B. Suppose you spent 99% of your time working with databases that provided for upper and lower case letters for identifiers. Why on earth would you use all lowercase and underscores when it is harder to write, read and isn't required? Do you also code as if using a monochrome editor since you *might* have to switch to one some day?

Subject: Mixed case
Posted by: BuggyFunBunny (view profile)
Posted on: Wednesday, February 23, 2011 at 5:48 PM
Message: is what Road Signs have, and_underscores_in_code_whether_programs_or_DDL are what make spacing possible. Neither I nor Holub advocated all lowercase, although that's what's comfortable, after years of *nix/C. There's not much reason to have to muck with the shift key (although the _ does require it, dang).

Subject: Antiquated Insistence
Posted by: BuggyFunBunny (view profile)
Posted on: Wednesday, February 23, 2011 at 8:41 PM
Message: -- Do you also code as if using a monochrome editor since you *might* have to switch to one some day?

Not directed at anyone (that I know of) on this thread, but if I had a nickel for every "modern" coder who insists on only 80 columns in code, I'd be a rich man. We have ever wider monitors, with gobs of horizontal real estate (and diminishing vertical), yet young-uns insist on using 1960's COBOL/FORTRAN mannerisms. Why?

Subject: OTLT is top!
Posted by: Teun Spaans (not signed in)
Posted on: Friday, March 04, 2011 at 12:11 AM
Message: Over the years I have advocated OTLT several times.

There are several advantages to them:
1) Less coding is needed: you no longer need a separate table, query, stored procedure for every type of code table. The extra parameter the author objects to is a small price.
2) The application also needs just one or two maintenance functions - I say 2 because you may wish to have Two True Lookup tables - one for numeric and one for alphabetical codes.

The author does have a point when it comes to validations. From an architectural point, the author seems to favour to have all validations in the database. That is a very sound principle, though there are also schools who favour to have these validations in the business layer.
Like everything, OTLT and TTLT should be used with care. Personally I favour them for very simple lookups - those where we have a simple code-description.

Usually I prefer to add date-valid-from and date-valid-till attributes. Implementing this for every code table separately increases the extra programming involved.
In situations where special validations are required, such as in the authors example, we can reconsider if OTLT/TTLT is the ideal solution.

And then I havent mentioned systems where databases are replicated over many installations, or for separate clients.

Subject: Nobody's mentioned this...
Posted by: bienv (view profile)
Posted on: Friday, March 04, 2011 at 8:03 AM
Message: The Microsoft ecosystem, particularly the .Net framework, tends to favor PascalCase. T-SQL uses underline-word-separation inconsistently.

I'm the lead developer on a data warehouse and a SharePoint-based intranet implementation. The main store of data is Oracle (with lots of all-caps and underline-spaces), but my development takes place in SQL Server and other Microsoft technologies. I find Visual Basic to be easier to read than C#, and VB code is practically all PascalCase.

So what do you think I'm using in the data warehouse? PascalCase. Occasionally I'll use a underscore to denote a secondary table (e.g. "Sales_Backup").

I would like to see a study that compares readibility of PascalCase versus underline_case in the Microsoft ecosystem in which I do the vast majority of my programming, including that which provides my income.

Subject: Disagree about lookup tables not being entities
Posted by: Wolf (view profile)
Posted on: Friday, March 04, 2011 at 8:11 AM
Message: You made this statement in the article:

"A look-up table deals with attributes and their values, not entities, so by definition, this is not used in such tables. That is why things like "_category_id" or "_type_id" are garbage names."

My opinion is that lookup tables *are* entities. They may consist only of an integer (surrogate key) and a string (the human readable item) but that is enough to qualify them as an entity.

For example I have a table called WordList which is used to auto-correct case in fields. It looks like:

CREATE TABLE [dbo].[WordList](
[WordList RID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](50) NOT NULL,
CONSTRAINT [WordList_PK] PRIMARY KEY CLUSTERED
(
[WordList RID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [WordList_AK] UNIQUE NONCLUSTERED
(
[Name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

(And yes, I use *gasp* spaces and [] in field names too. :))

WordList is a full blown entity. There's a unique index on the word itself (to prevent duplicate entries) it's clustered on the RID, and it participates in audit logging so we can track who's doing what to the table. But in spite of all that it's just a lookup table...

Now let me touch briefly on One True Lookup Tables. They are indeed tempting from certain POVs. In fact I use a limited form in my Company table--the entity is "Company" but there are bit fields to list the type of company (such as first line vendor, second line vendor, weapon's manufacturer, tax authority, and so on).

The point being all the sub-entities in the table (all first line vendors, for example) are still *companies*, and share identical attributes with a tax authority, (ie a name, an address, phone number, primary contact, etc.)

The advantage to this approach is that if, for instance, a first line vendor should become a customer it takes only changing one bit field to accomplish rather than duplicating the (significant) amount of data. Not to mention keeping the same company in multiple tables is hard to keep syncronized...

Finally I'd like to touch on the argument about underscore vs PascalCase. I think everyone is overlooking a very obvious fact.

Underscores are visually distinct from PascalCase. Further, [] quoted identifiers with spaces are a third visually distinct element, along with use of ALL_CAPS with underscores.

*Each style has its own purpose*

I don't view underscore usage as mutually exclusive with PascalCase and/or ALL_CAPS. They are all visually distinct and should be used together to create a visual language.

Personally I use underscores when dealing with indexes. For example [Account_AK] instally tells me that A) this is a user-defined element, it's related to the table Account and it's an alternate key. I also use _FK, _PK, etc in a similar manner.

Anything inside [] is a user-defined SQL element (Table, index, column). If there's an underscore it's either an index, a relationship, or some other table-scale structure. If there's no underscore it's either a table or a column, and context will always disambiguate.

If it's in PascalCase inside square brackets it's a two-word phrase naming a distinct entity, for example [PageGroup] is a table for the entity "Page Group", while [PageGroup Member] is the entity for page group members. The second word is always related to the first. Thus [Cash Order] would be a detail table for cash, while [CashOrder] is an entity to itself.

If it's PascalCase without square brackets it's a variable. This is true in both T-SQL (the @ notwithstanding) and in .Net.

Oh, and with the excetion of a couple of one or two two-byte integer keys (for space conservation in huge tables) all keys are RIDs, and named [<table> RID] as the primary key and usually [<table RID] as a foreign key unless there are multiple references (a manager and a tech in the same table, for instance). Even then the [* RID] format is retained ([Manager RID], [Tech RID]).

My point is all these styles are used *together* to provide unambiguous information about the code and the database structure.

Arguing about eye-jerks overlooks one important fact. In my particular case using [] exclusively for SQL elements means the eye can easily redact detail when quickly determining structure in a SQL statement (especially with elegant white space usage).

So use whatever you like. Mix and match. But be rigorously consistant!

Subject: Disagree about lookup tables not being entities
Posted by: Wolf (view profile)
Posted on: Friday, March 04, 2011 at 8:24 AM
Message: You made this statement in the article:

"A look-up table deals with attributes and their values, not entities, so by definition, this is not used in such tables. That is why things like "_category_id" or "_type_id" are garbage names."

My opinion is that lookup tables *are* entities. They may consist only of an integer (surrogate key) and a string (the human readable item) but that is enough to qualify them as an entity.

For example I have a table called WordList which is used to auto-correct case in fields. It looks like:

CREATE TABLE [dbo].[WordList](
[WordList RID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](50) NOT NULL,
CONSTRAINT [WordList_PK] PRIMARY KEY CLUSTERED
(
[WordList RID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [WordList_AK] UNIQUE NONCLUSTERED
(
[Name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

(And yes, I use *gasp* spaces and [] in field names too. :))

WordList is a full blown entity. There's a unique index on the word itself (to prevent duplicate entries) it's clustered on the RID, and it participates in audit logging so we can track who's doing what to the table. But in spite of all that it's just a lookup table...

Now let me touch briefly on One True Lookup Tables. They are indeed tempting from certain POVs. In fact I use a limited form in my Company table--the entity is "Company" but there are bit fields to list the type of company (such as first line vendor, second line vendor, weapon's manufacturer, tax authority, and so on).

The point being all the sub-entities in the table (all first line vendors, for example) are still *companies*, and share identical attributes with a tax authority, (ie a name, an address, phone number, primary contact, etc.)

The advantage to this approach is that if, for instance, a first line vendor should become a customer it takes only changing one bit field to accomplish rather than duplicating the (significant) amount of data. Not to mention keeping the same company in multiple tables is hard to keep syncronized...

Finally I'd like to touch on the argument about underscore vs PascalCase. I think everyone is overlooking a very obvious fact.

Underscores are visually distinct from PascalCase. Further, [] quoted identifiers with spaces are a third visually distinct element, along with use of ALL_CAPS with underscores.

*Each style has its own purpose*

I don't view underscore usage as mutually exclusive with PascalCase and/or ALL_CAPS. They are all visually distinct and should be used together to create a visual language.

Personally I use underscores when dealing with indexes. For example [Account_AK] instally tells me that A) this is a user-defined element, it's related to the table Account and it's an alternate key. I also use _FK, _PK, etc in a similar manner.

Anything inside [] is a user-defined SQL element (Table, index, column). If there's an underscore it's either an index, a relationship, or some other table-scale structure. If there's no underscore it's either a table or a column, and context will always disambiguate.

If it's in PascalCase inside square brackets it's a two-word phrase naming a distinct entity, for example [PageGroup] is a table for the entity "Page Group", while [PageGroup Member] is the entity for page group members. The second word is always related to the first. Thus [Cash Order] would be a detail table for cash, while [CashOrder] is an entity to itself.

If it's PascalCase without square brackets it's a variable. This is true in both T-SQL (the @ notwithstanding) and in .Net.

Oh, and with the excetion of a couple of one or two two-byte integer keys (for space conservation in huge tables) all keys are RIDs, and named [<table> RID] as the primary key and usually [<table RID] as a foreign key unless there are multiple references (a manager and a tech in the same table, for instance). Even then the [* RID] format is retained ([Manager RID], [Tech RID]).

My point is all these styles are used *together* to provide unambiguous information about the code and the database structure.

Arguing about eye-jerks overlooks one important fact. In my particular case using [] exclusively for SQL elements means the eye can easily redact detail when quickly determining structure in a SQL statement (especially with elegant white space usage).

So use whatever you like. Mix and match. But be rigorously consistant!

Subject: workflow question
Posted by: huambo (view profile)
Posted on: Tuesday, September 18, 2012 at 8:51 AM
Message: Greetings,
I am working on a model to support workflows and came up with the following table design:

Workflow (Workflow_Code) -- PK

Values: Visit, Treatment, Validation, Billing.

WorkFlow_Status (
WorkFlow_Status, -- PK
Description
)

values:

1 Pending
2 Arrived
3 Cancelled
4 Ready for Treatment
5 Treated Billable
...
7 Too Sick to Treat
...
10 Validated
11 Validation Error
12 Ready for Billing
...
16 Billed
17 Not Billed

Workflow_State (
DepartmentCode, -- PK
WorkFlow_Code, -- PK
Workflow_Status, -- PK
)

Now, I can define the workflow statuses per department like so:

Workflow_State
DeptA, Visit, 1
DeptA, Visit, 2
DeptA, Visit, 3

etc...

by combining all the statuses that I need in each workflow per department.

My question is: is this good design? How do I guarantee that each visit record is in the correct workflow status? I can check to make sure that the right status exists in the status table but to enforce the correct state, i need department, workflow, status to go against the workflow_state table???

I am afraid this is looking a lot like the MUCK table nightmare you discussed earlier.

Thanks,
Huambo








 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.