A Tale of Identifiers

Identifiers aren't locators, and they aren't pointers or links either. They are a logical concept in a relational database, and, unlike the more traditional methods of accessing data, don't derive from the way that data gets stored. Identifiers uniquely identify members of the set, and it should be possible to validate and verify them. Celko somehow involves watches and taxi cabs to illustrate the point.

1208-New_Punched%20Card2.jpg

Programmers still have problems with the concept of an identifier. Perhaps it is true that the more things change, the more they stay the same. So let me start with a bit of history and start with punch cards and magnetic tapes. These were the first way we had to store data and I am pretty sure that 95%+ my readership has never used them. You might want to look at this page to get an idea of what you missed. Or see an old science fiction movie.

Magnetic tapes followed punch cards. Tape was a big improvement. Tapes were denser, so you got more data on a reel of tape than in a deck of cards. If you ever dropped a deck of punch cards and dropped a reel of tape, you know which one you prefer. Tapes can rewind; punch cards do not. However the first tape drives did not read backwards; even if they had, the software was based on punch cards, so it assumed forward reading. Today, in T-SQL, cursors can be declared to read forward, too. The old mindset is hard to leave.

Tapes and punch cards have very different physical properties. But they have the same logical properties. They physically allow only sequential access. To get to a record, you read the media from left to right. There was a first record, a last record and the guys in the middle were located by counting records. The count was either absolute (n-th record) or relative (next, current, prior, etc). The fields in the records are also read from left to right. There is nothing set-oriented about this data model. Obviously, the record number depended on the order of insertion. We spent a lot of time sorting and merging the decks of punch cards and tapes on various fields to get them ready for processing. We worried about stable versus unstable sorts and a lot of other physical issues that modern programmers don’t have to consider.

1208-New_MagTapeHeaven2.jpg

Here is where we got the idea of a sort key versus an identifier. For example, a billing application would have account numbers to identify the customers, but the tape would be sorted on ZIP codes for printing the statements for mailing. But we needed to have the Master tape kept in account order for merging updates and inserts into it. It was also how we looked up account data. The result was that we did extractions from Master tapes for reporting.

When a record was deleted from a tape, it was marked with a flag at the start of the record. So there was a logical count of active records and a physical count of actual records for locating data on the tape. Eventually, the old Master tape would be merged with a Transaction tape and the deleted records never moved over.

This is where the idea of a PRIMARY KEY started. Using the sort order was the only practical way to get to the data. Think about random access on hundred a meters of magnetic tape; not a good idea.

Indexes

Along came spinning magnetic media in various forms and the world changed again. Disks, drum and other spinning magnetic media were originally known as random access storage devices. You could put a read-write head anywhere without doing a rewind! But we did not take advantage of this at first.

Sequential reads that let spinning media spin and held the read-write head still were faster and that was the way the software was written. But random access made sorting faster, so we were happy. Eventually, we invented ISAM (Indexed Sequential Access Method) and its descendants. The file is sorted, but we have an index that lets us place the read-write head directly on or near a particular record. The usual analogy is an unabridged dictionary with a thumb or notch index. The notch lets you open the book to any letter of the alphabet in one operation.

The index would have the track and sector where the data was stored so it could move the read-write head there. The track and sector were a physical locator, but the user never had to see them. But back in those days, we did have to maintain them and run housekeeping routines on them. You also had to explicitly invoke an index in your code; there were no optimizers to do your thinking for you.

Pointers and Linked Lists

A bit later we realized that the track and sector values, written as a “row id” could be used inside a record to point to the next record in a path. The “row id” is not an identifier any more than the record count on a magnetic tape was. It is a physical locator.

This lead to navigational databases. As Dr. Codd was to RDBMS, Charles Bachman was to navigational databases (he received the ACM Turing Award in 1973 and was elected a Distinguished Fellow of the British Computer Society in 1977). He and Codd had a famous “shoot out” in 1974 which featured a simple machine shop scheduling application. SQL did not exist, so the experimental Alpha language was used for the relational version and a CODASYL/COBOL product for the network version. The network version was complicated and had two bugs in it. This ended the navigational model in the popular mind. Bachman is largely forgotten by DB programmers these days, thanks to the domination of SQL and RDBMS, but his pioneering work in database systems is still the basis for many large scale commercial databases.

The navigational model lead to various access structures which all had different names in different products. The records located by pointers are usually called nodes, using terms from graph theory. The simplest connection between nodes is a single pointer that can be followed in one direction only. Remember a deck of punch cards? The terms most often used are (Parent, Child) or (Master, Detail). See how many times you can find a newbie on a forum who uses those terms for tables in SQL. This navigational view of data has no place in SQL, but it is hard to escape the old mindset.

A link was a two-way pointer chain. Remember the later tapes drives that could re-wind and read backwards? Ever hear a newbie invent the term “link table” because they don’t know how to model a relationship? Yep, the more things change, the more things stay the same.

One of the major problems with the navigational model is that you have to navigate it. That is, you need to have an explicit path, not a relationship. To use Chris Date’s classic “Parts & Suppliers” example, consider a DB where each supplier is the parent of a chain of parts. This design is fine if I always want to get a supplier then look at his parts. But if I want to find all the suppliers who stock #5 machine screws, I am in trouble. I have to scan the catalog of each supplier, one at a time. In SQL, I would have a relationship table for “Part_Sources” and query it.

Newbies use IDENTITY to build “mock-pointers” in SQL. Like pointers, IDENTITY is a physical property of a node, er, I mean table. The term IDENTITY column is technically wrong. A column has a data type, all data types in SQL are NULL-able, can be assigned values, have operators, etc. IDENTITY has none of this; it is like a pointer. The hardware picks a value, do you do not assign it. It is not NULL-able. It makes no sense to do math on it, even tho it looks like a numeric. And like a pointer or link, it can be attached to any kind of node, er, table. In one record, we point to an automobile, then to a squid and so forth. I like to think of IDENTITY as a kabbalah number. This is the mystic Hebrew number that God puts on everything in the universe. If you know that number, you have all kinds of magic powers over the object. You can change automobiles into squids!

Identifiers

Identifiers are a logical concept in RDBMS. They are a subset of attributes of an entity or relationship that identify individual member of the set. A VIN identifies an automobile, an ISBN identifies a book, a DUNS identifies a company and so forth. An identifier is specific to one kind of thing; you do not order a book with a VIN.

A common error is thinking that if something is unique (or declared as an IDENTITY property), then it is an identifier or key. Yes, an identifier has to be unique, but not the reverse. The number pi is a unique constant, but does not identify any entity. This error shows up in data element names in look-up tables with monstrosities like “postal_code_id” and worse. It is just a “postal_code”, like pi is a constant. Codes and identifiers are totally different kinds of attribute properties.

Identifiers can be multi-column and long. A three dimensional (x, y, z) coordinate system is multi-column. The International Bank Account number (IBAN) is long.

So what? We have 64 bit hardware and terabytes of fast cheap storage today. I have row constructors in SQL and good text editors. If I have an industry standard key, then I have to use it to get two things I want for data integrity.

Validation & Verification

 

Validation means I can look at a data element value and see that it has the correct form. For example, I know a ZIP code is a string of exactly five digits; a string with less or more digit is not valid; string with alphas or special characters is not valid. I also know the lowest ZIP code is in Holtsville, New York (00501) and the highest is Ketchikan, Alaska (99950).

Verification means there is a test for an attribute like weight (put it on a scale) or trusted source to verify the code or identifier. In the case of the ZIP code, the trusted source is the United States Postal Service (USPS). ZIP codes change, some of them in the proper range are not currently issued, and so forth. This is vital for data integrity and audits.

Many industry standards have check digits and regular expressions that also validate them at data entry time.

IDENTITY has neither validation or verification. It can be any positive or negative integer of any size. The same data put on two different machines will be assigned different IDENTITY values based on the hardware Not a lot of help here. Go to any SQL Server forum and you will find a posting about someone who used BCP to move data to a new machine, but forgot to protect the IDENTITY mock-pointers so he got the tables re-numbered as they were inserted. At least pointer chains were not exposed and we had utility programs in case the chain got broken.

Another problem is that the IDENTITY values are non-deterministic. Given this table:

These three insertions are logically identical,

But it does not work that way. Likewise, deleting and re-inserting a row should not change a proper attribute’s value, but it will not work with an IDENTITY column.

A Cautionary Tale

“A man with a watch knows what time it is. A man with two watches is never sure.” – Segal’s Law

1208-New_taxi_medallion%20small.jpg

Does this mean that proper identifiers will guarantee data integrity? Nope, no system is perfect. Let me tell you a tale about a taxi company and their motor pool. The taxi company had four identifiers for its motor pool. The industry standard one was the VIN (Vehicle Identification Number, ISO Standard 3779) since that is required by law for titles, tags, insurance and other things. It is 17 alphanumeric characters long. It is broken into fixed length fields that tell you the manufacturer, the plant, make, model and year of the vehicle. In the old days, the VIN was on the engine block where only a mechanic could see it. Today, it appears on a metal tag on the dash board and, for some cars, etched into all the window glass.

While having the VIN where it can be seen is good for theft prevention and detection, the truth is that most people have no idea what the VINs for their vehicles are. When a passenger steps into a taxi, he is not going to read and remember the VIN. But it is permanent and universal.

The license tag was another identifier. In particular, tags for taxis had a special format, so you could tell that they were a commercial vehicle. While not as permanent as the VIN, the tag would stay with the vehicle while it was used commercially and renewed with a sticker every year. If the vehicle was retired from commercial service, it needed a new tag.

The passenger sees a large three-digit fleet number on the doors of taxi, next to the company name. It is a lot easier to say “Metro Cab #025” instead of trying to rattle off all 17 characters of a VIN or the 6 to 8 characters of a commercial auto tag. The fleet numbers are pretty much sequential, perhaps with a gap when a vehicle is taken out of service. Gaps were not desirable. The sequence also made assigning routine maintenance easier — look at the calender date and the last two digits of your fleet number. When they match, get an oil change, rotate the tires and vacuum the back seat.

The fourth identifier is a city-issued medallion number. This method of raising revenue actually goes back to England and ferry boats. The city mints a fixed number of the medallions and then auctions them. In those days, they were literally a metal disk. The taxi driver has to display the medallion to lawfully operate in the city. Since there are only a fixed number of medallions, they become extremely valuable. In 2006, for example, a New York City medallion was worth about $300, 000.00 when it went for auction.

For the youngsters in my readership, Checker Cabs were an American company that made one model vehicle for decades — the American version of the famous London taxis. Besides having big rear seating, a huge trunk, a flat rear foot well and other features for it specialized work, the fenders and doors were designed to swap out with a few screws. The idea was that the motor pool cloud handle simple accidents and get the vehicle back in service.

Supposedly, each of these identifiers (VIN, auto tag, fleet number and medallion) should map to one and only one vehicle. Now here is where the story gets interesting. The motor pool supervisor realized that a certain percentage of the fleet is in the shop at all times. If you take the medallion off of a taxi in a service bay, you can loan it to your partner in crime, instead of putting in the safe.

The scam was to rotate auto tags and doors to give the appearance of a properly run fleet of taxis. It was done manually because this was before personal computers were common. It is a good programming problem since the maintenance logs have to balance and every vehicle had to look as if it were in service most of the time.

The scam fell apart when a taxi was pulled over by a traffic cop on a routine traffic violation. The officer noticed that the fleet numbers were different on the driver and passenger side doors. He then checked the medallion, auto tags and VIN against the city records. Nothing matched.

Conclusion

There are lots of morals here:

  1. Learn what an identifier is
  2. Un-learn what a locator/link/pointer is
  3. Think in sets not sequences
  4. Use industry standards to get validation and verification
  5. Don’t wear two watches unless you can keep them in synch with a radio time signal.
Tags: , , ,

  • 12576 views

  • Rate
    [Total: 0    Average: 0/5]
  • Seven24

    Identifiers
    > It makes no sense to do math on it,

    Funny that, wouldn’t a host of numeric values fit that bill? US postal codes are numeric and yet it makes no sense to do math on them. Item sequences for line items in an order are numeric and yet it makes no sense to do math on those. Phone numbers, social security numbers, street numbers…

    > it can be attached to any
    > kind of node, er, table. In one
    > record, we point to an automobile,
    > then to a squid and so forth

    Straw man. If you improperly create a table of arbitrary “things”, then the manner in which you create a key to identify them is the least of your problems.

    > An identifier is specific to
    > one kind of thing; you do not
    > order a book with a VIN.

    What is the number 2125551212? Is it a SSN? Is it a phone number? Is it a part number? No? How about another example. “BOB”. Is that a name in upper case? A three letter ISO for a currency? An acronym? A floatation device? No? One more: 12345. Is that a zip code? A password (hope not)? A sequence in some list? The code on your luggage?

    You must have CONTEXT to know to what the identifier applies. Having just the value with no context as to what it represents is meaningless. Thus, while it makes sense to talk of an identifier in conjunction with the type of thing it identifies, that same value might be used as an identifier for many different things. E.g., an accounting system might have 90210 as an accounting code even though it happens to also be a valid postal code.

    > Verification means there is a
    > test for an attribute like
    > weight (put it on a scale)
    > or **trusted source** to
    > verify the code or identifier.

    That’s the crucial element: a trusted or authoritative source. While it is true that some identifiers have checksums that can be used to determine if it *might* be valid, the only means to truly determine if an identifier is valid is to query an authoritative source.

    > IDENTITY has neither
    > validation or verification.

    Sure it does based on the previous comment: querying an authoritative source. Is 12345 valid in the Foobar system? Query the Foobar system and find out. Is “BOB” a valid currency code? Go to a source that lists all the ISO approved currencies and find out. Asking something akin to “Is 5 a valid identifier?” with no context is equivalent to asking questions with nothing but pronouns: “Did you get that thing that was forwarded from that person about that issue?”

    RE: logically identical

    You quietly switched arguments here. There are two distinct use cases here: Identity columns as user-visible values and Identity columns used as surrogate keys. If you are using an Identity column as a surrogate key, then by definition, you shouldn’t be showing it to end users *and* another candidate key that uniquely identifies the data is needed. On this last point, I absolutely agree that it is far too common for new developers to overlook the need for a unique constraint on something *other* than the surrogate key.

    >> Use industry standards to get
    > validation and verification

    Suppose there is no industry standard *or* the industry standard value isn’t retrievable at transaction time? There is no universal identifier for any human being on the planet and even if there were, there may be legal or realistic restrictions in your ability to get it. Industry standards are absolutely the right way to go *if* they are *universally* used (unlike say ISAN you mentioned in a previous post) and are available at time of entry.

  • Anonymous

    Context versus domain
    One of my articles here has a section on tag numbers, which are used for encoding but not calculations.

    Identifiers and all data elements have to have a domain, which is not the same as a context. A domain is stricter than a mere data type, but relate3d in the sense that a value cannot exist without it.

    The ISBN 978-1-60620-906-6 is drawn from the ISBN domain and it always identifies “A Complete Grammar of Esperanto”; calling out the name “Bob” will return a different person depending where I am when I call it out. Bob has a data type of Alphas, but not a specific domain.

    Your mindset is still back in files. The alphanumeric punch card columns were read into the context of a COBOL or FORTRAN application program and thus got meaning. There are a lot of classic horror stories about hanging the wrong and having it produce output from a program.

    My personal experience was with hospital data decades ago. Magnetic tape records were 80 columns because of 80 column punch cards. We keyed a lot files on SSN at the start of the tape. The billing program cheerfully cut the lab test data into fields and printed it out after confirming a valid SSN in the front of the tape.

    This is why IDENTITY is not an identifier; it has only context local to one instance of one table on one machine in one database product. There is no external authority or verification method. It is an exposed physical locator.

    IDENTITY is not a surrogate key according to Codd because it is exposed to the user. Say I have an (x,y,z) coordinate system as my natural key and an IDENTITY as the surrogate key. According to Dr. Codd if I try to put in a duplicate (x,y,z), the surrogate key mechanism will detect the error and disallow it. But unless I did UNIQUE (x,y,z) on my table (i.e. a second key), I can get different IDENTITY values for the same entity. I have to hand-code all of the surrogacy constraints myself!

    After 40+ years in this business, I think that there is an industry standard for just about everything.

    The idea that you need to identify EVERY human being on the planet is a Red Herring. You do business with a subset that has a role in your data model.

    Having said that, actually, there is a universal human being identifier; it is a ghod-awful numeric string that you get from a full DNA work-up using a portable machine from Fujitsu. Medical researcher use it for global population studies, but we are not that far away from an ATM machine telling you to stick your finger in a slot for a tissue scan.

    I am quite happy to stick with the mere 160 million DUNS numbers when I need to identify companies or people with whom I am doing business.

    The few times I have needed to invent an encoding scheme, I have a set of rules and methods for doing it. You can Google up my articles on the design of data.

  • Seven24

    RE: Context v. Domain
    > This is why IDENTITY is not an identifier;
    > it has only context local to one instance
    > of one table on one machine in one database
    > product. There is no external authority
    > or verification method. It is an
    > exposed physical locator.

    First, are we discussing an identity as an value available to non-technical users like developers and DBAs or a value only visible to non-technical users (i.e, the end users). It makes all the difference in the world. I agree that surrogate keys should never be displayed to end users.

    Second, at the SS administration, which generates SSN, is the SSN an identifier? In that scenario, there is no external authority as they are the authority generating the values. What about postal codes at the USPS? Are those an identifier in their system since there is no external authority?

    I can think of many scenarios in my experience in building actual systems (vs. theoretical ones) where there was no external authority to issue a value or the value wasn’t readily available at time of entry. Films are good example. ISAN numbers are barely used (especially in the US), there are multiple competing identifier authorities out there and many studios do not faithfully register ISAN numbers for their films. What about IMDB? Know what their identifier is? Yep, you guess it; an identity column.

    My mindset is in my experience building real world systems. A significant portion of the data elements that are modeled do not have an external authority nor verification method beyond existence in the proper table in the proper system. Customers, abstract concepts such as geographical sales regions which do not map to countries, legal entities which have no DUNS value and so on. Another example in the film world: scripts. A submitted script often does not even have a title or the title changes frequently and there is no requirement of uniqueness. Even the authors can change frequently. There is no external authority to issue identifier.

    > But unless I did UNIQUE (x,y,z)
    > on my table (i.e. a second key),
    > I can get different IDENTITY values
    > for the same entity.

    Your first sensible argument against Identity columns. Yep. This is THE most common mistake made when using surrogate keys (however they are generated or whatever their data type): failure to include a unique constraint on something else in the table. No question that is the biggest danger with their use for the very reason you set out: you can create logical duplicates that are physically claimed to be different. Still, there are many times when generating an arbitrary, non-changing value that is hidden from end users is useful.

    Think of identity columns like drugs. In small quantities and used properly, they can beneficial; used improperly and they’ll kill you.

  • Michael Beeby

    Card Decks
    Get a marker pen, Hold the deck sideways, draw a diagonal line. Then if (when) the deck gets dropped, you can get it back into order faster. This worked in the 1970s and 1980s.

  • TheOldBishop

    A trip down memory lane …
    What a great article. I have been in the business since the old punch card days and wired many a board for a collator, card reproducer and IBM EAM processing machines (the nearest thing to a computer in those days .. it could actually do arithmetic). While working my way thru Georgia Tech, I operated and programmed the first commercially available IBM 360-30 .. which IBM “gave us”. We could run 3 programs at a time in 64K. Kids these days dont know what to do with gigabytes of memory and more storage on a 1 pound hard drive than we could imagine.

    All that being said, principles well learned and well thought-out almost always carry forward and I cant imagine any of the guys from the old days being foolish enough to use an “identity field” to impart either meaning or navigation to a set of data. I think the whole thing started comming apart when the guys in the picture took off their suits and white shirts and switched to golf shirts and kaki slacks … data processing and society lost its much needed discipline when that happened. …. Ahh for the old days of DP before it was IT.

    I remember once when I was just a computer operator, a programmer gave me an object deck to run his program. THere were several cards whose notches were turned the wrong way. I flipped the cards so that they would all line up and loaded the deck into the computer. 3 debugging sessions later I learned those were the “patch cards” and he purposely punhed them upside down so he could locate them again … he was using his identifier to navigate .. and he had the same identifier assigned to multiple records … I’m not sure who was the dumb ass in that deal, me or him 🙂

    Thanks for the trip down memory lane.

    Del

  • Jimbo

    another Joe Celko anti-identity rant
    but the rest of the article is very good 🙂

  • Phil Factor

    Re: Identity columns
    Words of wisdom from Seven24 ‘Think of identity columns like drugs. In small quantities and used properly, they can beneficial; used improperly and they’ll kill you.’
    The trouble is that, without articles like Joe’s, people are unaware that, when you use identity columns indiscriminately as surrogate keys, you don’t even realize that it is a drug. It is fine for performance, but a number of operations become more difficult because of it. Simple example from real life: you have a list of countries of the world. You use an identity column and don’t bother with the international (ISO)country code. You come to the routine that updates the table (yes, countries change their names, coalesce or split: Czechoslovakia? Yugoslavia?) What good is your surrogate key now?

  • David

    Real World Application
    Besides pontificating for pontification sake, I didn’t see any real world application for this article.

    Because I didn’t have to learn how to use punch cards, doesn’t make me any less cognizant of data identification, processing, and storage concepts than somebody else.

    This is just another self serving Joe Celko article which does nothing to educate the SQL Server community.

  • Anonymous

    Misc replies
    TheOldBishop — I was at Georgia Tech in the late 1060’s. I sitll miss the B-5500 and Algol.

    >> . ISAN numbers are barely used (especially in the US), there are multiple competing identifier authorities out there and many studios do not faithfully register ISAN numbers for their films. What about IMDB? Know what their identifier is? Yep, you guess it; an identity column <<

    The EU is now required them by law. And I will have to check, but isn’t the motion picture industy in the US switched to ISAN by 2015? I ran into this prtoblem decades ago in Belgium with their Natiional Radio and TV network (BRTN). There was no standard identifier for TV shows back then. As I said, if you look and wait, you will find an industry standard.

    I just wish we would agree on electrical wall outlet plugs and voltage:)

    >> Second, at the SS administration, which generates SSN, is the SSN an identifier? In that scenario, there is no external authority as they are the authority generating the values. What about postal codes at the USPS? Are those an identifier in their system since there is no external authority? <<

    Unh? They are the external authority!

  • Seven24

    RE: Misc
    RE: ISAN

    As far as I know, they are not in fact required by the EU by law. However, many EU countries are (the only ones) using them. I know for a fact that the motion picture industry ( read studios ) are not guaranteed to adopt them by 2015 because there is a discussion right now involving the major studios about what identifier to use and there are competing standards. ISAN is being considered, but it has its own problems with respect to digital media and validation(i.e., ensuring all media is stamped with the proper ISAN value and determining what the proper ISAN value ought to be and validating that the ISAN value you got is in fact the one it should be. Not at all easy to do with clips for example). In fact, even in the EU it is not universally adopted unfortunately.

    RE: Unh? They are the external authority!

    That’s my point. You said only use identifiers managed by an *external* authority. How is the SSA supposed to do that? Where I’m going with this is that at some point, there *is no* external authority or said another way, your system *is* the external authority as far as you are aware.

  • –Jeff Moden

    Well done
    @Seven24,

    Very well done. You should write a counter-article. Seriously.

  • BuggyFunBunny

    My Surrogate Motha
    — The trouble is that, without articles like Joe’s, people are unaware that, when you use identity columns indiscriminately as surrogate keys, you don’t even realize that it is a drug.

    The real trouble, alas, is that kiddie koders think they know everything, and make the same mistakes their grandfathers made in the 1960’s. When these mistakes are pointed out to them, they get incensed, as if the pointer-outer is picking on them. Well, yeah. You’re a dummy.

    While it’s historically true that maths is done best by the young and foolish, comp sci isn’t much math, and being young and foolish isn’t asset.

  • Anonymous

    External vs Internal
    >> You said only use identifiers managed by an *external* authority. <<

    I think I said that it is the best way and that if you run into a situation without such an authority, then take the time to design your encoding scheme. Then I have a ton of rules for creating those encodings.

    Most of the time, we are dealing with a pre-existing environment. Someone else has done it and set up standards. This is why software professionals say “buy, don’t build” and amateurs want to re-invent the wheel in programming situation. Inventing your own encoding schemes is the database version of the same mindset.

    >> How is the SSA supposed to do that? Where I’m going with this is that at some point, there *is no* external authority or said another way, your system *is* the external authority as far as you are aware.<<

    Infinite recursion: see Infinite recursion 🙂

    Consider a new project. Your team gets to pick the name, the cartoon mascot and fight song. But you go to accounting to get the account numbers for charges. Accounting is internal to the company, but external to your team.

  • Celko

    More on ISAN and failures to use Standards
    The ISAN number has been made mandatory by EGEDA, the Spanish Audiovisual Producers’ Rights Management Association in November 2009.

    This year, Film Foundation of India has submitted an Expression of Interest for implementing an ISAN Registration Agency in the coming months to serve the audiovisual market in the Indian territory. In August 2010, the Beijing Institute of Coding for Resources (CCR) has submitted an Expression of Interest for implementing an ISAN Registration Agency in the China Mainland.

    Before UPC codes were pre-printed packages might remember what it took to apply your own bar code stickers on inventory or using simple price stickers.

    The now bankrupt Borders Books does 🙂 They were the second largest chain in the US. They put their own bar code labels on their merchandise right up to the last minute. That is a lot of needless overhead.

  • Seven24

    RE: ISAN
    Spain and India != the world or even the EU. As I said, there are competing standards (I think one of them is called EIDR). In the US, some of the studios have set ISAN values for some of their films (note the operative word “some”). However, the studios are only part of the industry. There are many other industries which work with films (e.g. distribution, production, credit industries etc) which hardly ever use them or at best simply provide storage for them (i.e., do not actually use them to identify the film). As I said, some European countries are using ISAN but it has not gotten traction in the US. In fact, the IMDB number has more traction than ISAN sad as that is.

  • AlexK

    ISBNs are not unique
    Hi Joe,

    Unfortunately, ISBNs are not always unique. I learned that from my personal experience. A few years ago we borrowed some sheet music, liked it, and ordered it on Amazon. When our order arrived, the book we got had the same ISBN, but different contents. The Amazon support people fixed it for us right away, and we did not get the feeling that they were surprised, not at all.

    I am not an expert on books and ISBNs, but maybe we should not always design all our databases around the assumption that someone else is always doing a perfect job? What do you think?

    Suppose that we have a huge database of books, with ISBN used as PK, and then we discover that ISBN is no longer completely unique? We are in trouble, we are in for a big and expensive redesign, are we not?

  • Dave Convery

    Re: the Borders thing.
    Sadly for me, I can speak directly to the Borders ‘additional barcode’ situation, having served in the trenches there.

    There may have been some additional overhead from the Borders stickers, but it massively reduced the amount of time it took to sort, shelve (and reshelve, and reshelve, and reshelve) the books, because they contained far more information than simply a barcode. So while it took time to create and apply the stickers, it reduced the time taken in all other aspects of inventory management, from shipping to shelving. When even the small stores have stock counts in 6 digits, it’s a Godsend.

    Even if you could reliably use an ISBN as a unique identifier, surely it’s better to have a quick, cheap sticker that means each person handling that item doesn’t have to look it up each time?

    Borders failed for a whole host of reasons, but stickering their books wasn’t one of them. It was one of the few organised and efficient things about them.

  • Celko

    Replies
    The DOI has been standardised as ISO/DIS 26324, Information and Documentation — Digital Object Identifier System back in November 15, 2010. It is used by the EIDR. I think it is reasponable to predict that a standard will settle it in the next few years, just UPC/EAN did. My point is that industry standards exist and there is no need to invent your own.

    AlexK: I used to own two bookstores. There were some duplicate ISBNs in the early days, but I never saw them and they have been controlled since then. What you do find is a rare mis-print on the packages from smaller publishers. I’d like to know how well the conversion from ISBN-10 to ISBN-13 went. Time to ask Bowker’s ISBN department.

    Dave: I had friends who consulted at Borders many years ago on their inventory database. I agree that they had other problems. But besides the physical sticking and labor, they were constantly converting from their internal codes to industry standards.

  • Celko

    Dave again
    I did a Google and got an essay by Mark Evans, former Director of Merchandise Planning & Analysis at Borders. His point #5 was:

    Failure to build efficient systems and processes – While Borders’ legendary “expert system” was considered cutting edge and an advantage early on, the company failed to successfully build upon this foundation and create new, better assortment, replenishment, and supply chain systems and processes to keep pace with the changing state of technology and efficient retail operations. B&N invested considerable time/energy/money through the 90’s in systems and processes. To provide one example, a lower ranked title that sells out in a B&N will be replenished from a central warehouse within 2-3 days. The same process could take up to 16 weeks for Borders. Borders sought to upgrade systems with two large efforts in the 00’s: first one was a home grown effort called Common Systems. Second was a “buy and integrate” project to implement Retek and E3. Both failed spectacularly. The Retek effort dramatically hurt the Walden chain, the only business unit that was managed by the system. With both of these efforts, large sums of money and, perhaps more importantly, human resources and time were squandered.

  • BuggyFunBunny

    B&N ain’t all that smart
    My local B&N (in a “city” not much bigger than a town) stocks single copies of obscure titles (“Thinking in Sets”, for one) I’ve special ordered. Not really very smart.

    OTOH, they really downsized the computer section over the last couple of years, so it’s either special order or Amazon.

  • Dave Convery

    More on Borders
    Undoubtedly Borders sank huge amounts of money into replacing their systems, abandoning them each time (they finally replaced the instore systems after I left – with a new system that couldn’t even look up authors from the frontend!) but it’s not quite the same point as the one I was making. The fact that elements of the system were utterly lousy doesn’t mean that the stickering was. It saved a lot of time in distribution and moving stock around a store.

    But yes, in not using ISBNs on these stickers they weren’t doing themselves any favours (at least until ISBN-13s were introduced, when it became very useful for booksellers – the systems they had at the time couldn’t handle the new codes). So, while the overall system was a horrible, convoluted, outdated mess, there were elements in place that were genuinely useful for the practical business of running a bookshop.

    I never thought I’d want to defend Borders!

  • tonyrogerson

    Codd on Surrogate Keys
    Hi Celko,

    >
    IDENTITY is not a surrogate key according to Codd because it is exposed to the user. Say I have an (x,y,z) coordinate system as my natural key and an IDENTITY as the surrogate key. According to Dr. Codd if I try to put in a duplicate (x,y,z), the surrogate key mechanism will detect the error and disallow it. But unless I did UNIQUE (x,y,z) on my table (i.e. a second key), I can get different IDENTITY values for the same entity. I have to hand-code all of the surrogacy constraints myself!

    Let’s be clear, that is only “your” interpretation of Codd’s paper, others including Chris Date who is well regarded as the authority on the relational model and Codd’s colleage do not share your interpretation.

    A key is not a key at all unless it is unique, the relational model has no constraints, that is an implementation issue, SQL has key (unique/primary) to enforce uniqueness.

    In my blog post http://sqlblogcasts.com/blogs/tonyrogerson/archive/2010/03/13/database-design-surrogate-keys-part-1-of-many-rules-for-surrogate-keys-e-f-codd-and-c-j-date-research-and-problems-they-solve.aspx I document precisely what a surrogate key is by using (and reading the entire) paper that Codd wrote and also looking at the further work others including Chris Date have done.

    Surrogate keys are fine, I do agree they should not be shown to USERS, but internally they are fine for plumbing. Note – a USER is defined as the external recipient of the data, data that has left the confines of the internal consistency of the database AND application.

    Tony Rogerson, SQL Server MVP