Click here to monitor SSC
  • Av rating:
  • Total votes: 40
  • Total comments: 19
Phil Factor

Primary Key Primer for SQL Server

02 December 2013

Every database developer uses keys, but without always understanding all the ramifications. They come with few hard and fast rules, but if you get them right from the start with a database design, the whole process of database development is simpler, and the result is likely to perform better. We asked Phil for advice, little knowing that the explanation might take a while.

Primary keys can be rather a blind-spot for database developers. I can understand and sympathize, because there is some ambiguity in the advice given about their design and use, and any attempt to offer hard-and-fast rules tends to add to the confusion. In writing this article, I felt the urge to try to provide a clear guide for the busy developer. This turned out to be an ambitious aim.

Primary keys cut across other topics such as indexes and foreign keys. To explain some of the issues, I have to discuss other topics, such as clustered indexes, which deserve their own article. I’ll give as concise a meaning that I can, and give references for further reading. (There is some code to accompany this article here)

What is a key, and why is one of them a primary key?

For a table to be relational, any scalar value must be addressable by specifying the table, column and the key value(s) of the containing row. This key, often referred to as a candidate key, defines a set of one or more columns that together can distinguish individual rows in a table.

Originally, relational theory allowed only one key for a table. After a while, this was considered an unnecessary restriction and so now there is no limit on the number of keys you can have for a table (though it can be argued that this is a violation of Boyce-Codd Normal Form. BCNF). With several candidate keys, the question then was ‘which one should be used to reference individual rows if no particular key was specified’. To meet this requirement, the idea of the table having a primary key was born, where a single key amongst the candidate keys was selected to be the primary one. In short, a primary key is merely the default key for a table. Any type of base table, even temporary tables or table variables, can have keys.

Keys can comprise several columns, but should only contain sufficient columns to ensure uniqueness. A key constraint is satisfied if, and only if, no two rows in the table have the same non-null values in the unique columns. The SQL Standard enshrines the subtle difference that null values are allowed, with some restrictions, in a key’s participating columns, whereas they are forbidden in the primary key. In practice, I’ve never found a use for a key with nulls in any of its participating columns. My advice is to make them not null.

Although relational theory now gives no weight to the importance of the primary key over the others, SQL Server implements them in such a way that they have more particular significance. A primary key is important for SQL server, not only to allow for processes such as replication, Change Data Capture, filetables, audit trail, and data synchronization, but also to simplify the creation of foreign key constraints. It is also used as a logical property of the table for modeling the database.

How are keys defined for a table?

Keys are defined and enforced by key constraints. There are just two types, a primary key constraint and a unique key constraint (a foreign key constraint is actually a referential constraint). A key constraint defines one or more columns to be a key. A primary key constraint indicates that the key is the default key, and that all its contributing columns are NOT NULL.

For any key constraint that you define in a CREATE TABLE script, SQL server will ensure that there will be an error if there is any attempt to store, in the column(s) that comprise the key, any data that is not unique If you define a key on an existing table, then SQL server will check for duplicates and if it finds any of them, it will trigger an error roll back the table alteration.

A primary key constraint requires the key columns to be declared as NOT NULL, and SQL Server implies NOT NULL if the constraint is declared in the CREATE TABLE statement without specifying the nullability of the column. If you attempt to add a primary key constraint on a column that allows NULL values, the statement will immediately fail, even if the column contains no NULL values.

As noted earlier, you can opt to have a candidate key that does allows NULLs but they are liable to do strange, unexpected things. As well as being used to enforce key constraints, unique indexes in SQL are also used to enforce such non-key rules as ‘must be unique where the value is known’, so should be able to do this where the column is NULL (SQL Server implements this feature wrongly since it only allows one NULL value in the column).

SQL Server will create an index on a key to enforce efficiently the uniqueness of the key (i.e. to support fast validation of the key constraint). These indexes also support use of the key in joins, and selection of single rows or ranges. SQL Server assigns this index the same name as the key constraint.

For both unique key and primary key constraints, the columns chosen for the key must be of a datatype that is allowed for an index. This precludes columns that are can’t be compared, such as ntext, text, image, varchar(max), nvarchar(max), varbinary(max), XML, geography, geometry, and CLR user-defined data types that do not support binary ordering.

SQL Server cannot adapt an existing suitable index to support a key constraint: it will always create a new one even if there is already an index with exactly the same definition. To create a key on a single column, you can use either a table or column constraint in the table-creation script. To create a key on multiple columns, you must use a table constraint.

Primary keys in table or column constraints can be anonymous, although I would advise naming them explicitly in permanent tables.

Should I use a natural or surrogate key?

A key is just a combination of column (attribute) values that provide an obvious way of distinguishing rows, and a natural, or ‘domain’ key is one that has meaning outside the database environment, such a Longitude/Latitude. Many people argue for a general rule that keys must be natural (or perversely that they shouldn't be) or that they must be immutable. I've even heard it said that all joins must be on key values. As always, it depends. On this topic, there are very few hard and fast rules, because sometimes there are conflicting requirements between the knowledge level and the requirements of implementation.

Primary keys for business entities such as invoices, deliveries or products are generally best if they are relate directly to the current language of business, and are part of the group culture. For example, financial entities tend to be date-related since this is generally the way that commerce orders things. Therefore a date-related column is likely to be part of a natural key for such a table, because it aligns well with the natural way that the business sorts and groups its data. A natural key is helps to bridge between the real-world and the database

Businesses often have arcane rules for describing individual items, apparently passed down from distant tribal ancestors, but often imposing a common-sense sequence to the data that the business uses to group and report. They resent having to change their system to incorporate a surrogate key, which is usually a meaningless incrementing number. The cost of retraining, or changing the corporate culture can be astronomical. As Codd says in his paper RM/T Section 4,

'Users [...] are no longer compelled to invent a user-controlled key if they do not wish to'

It is their choice and we can play it either way. Besides which, when things go wrong, it is much easier to sort things out if the keys are meaningful, and errors are so much easier to spot too before damage is done.

Surrogate keys, usually implemented via an IDENTITY column, are the normal way of getting round the complexities of trying to handle natural keys that are ungainly or don’t quite conform to your business rules. These are fine if they are kept private within the database: otherwise, they are a form of technical debt. They make the coding of databases easier but are disliked by book-keepers, accountants, retailers or anyone else who has to handle them. They aren't human-friendly. Sometimes, surrogate keys ‘escape’ into the world if exposed as ‘reference numbers’ and take on a permanent and intended meaning that prevents any refactoring or renumbering within the database.

Although a natural key is good to have, there is always a place for surrogate keys such as are provided by an integer or numeric IDENTITY column. A natural key, such as an internationally-agreed code or name makes ETL and data warehousing a lot easier to maintain, but one should never keep a grip on the idea, with the whites of your knuckles showing, when it is easier for everyone to use a simple surrogate. A simple incrementing integer is a compromise, but rarely one that returns to haunt you, unlike a GUID which is always horrible.

For my primary key, should I choose a clustered index or not?

When you use key constraints, you are defining the logical rules for ensuring that the data is correct. You determine the way that tables are accessed, how they interrelate, and ensure that entity integrity is enforced (for referential integrity we use the foreign key constraint). In effect, you leave it to the RDBMS to decide how it all gets implemented, and expect it to do so.

SQL Server chooses to create indexes for each key, which is sensible. However, SQL Server has little idea about your database and data and so needs help to choose the right sort of index. Unless you specify otherwise, it will choose a clustered index for a primary key. However, if you previously specified a clustered index for a unique constraint and then specify a primary key constraint, it will be a non-clustered index. If you are adding a primary key, or moving it, by using an ALTER TABLE statement, then the clustered index may already be used and, in this case, SQL Server will also choose a unique index instead. A unique index will work reasonably well in most circumstances to support a primary key if the existing clustered index has been well-chosen.

It is seldom wrong to assign a clustered index to a primary key. It is just a problem if your choice of primary key is a ‘fat key’ without natural order that doesn’t work well as a clustered index, or if there is a much better use in the table for that clustered index, such as supporting range scans or avoiding sorts on a frequently-queried candidate key.

Primary Keys and clustered indexes are very different. A primary key is a logical construct and a clustered index is an index with a special physical implementation. By specifying a clustered index for a key, you determine the way that a key is implemented. Although your choice of Primary key may well prove to be most effectively implemented with a clustered index, it isn’t necessarily so.

The choice of the clustered index can have a huge performance impact. The candidate key that makes logical sense as a primary key does not necessarily have the characteristics that are required for a well-performing clustered index. A good clustered index is lean, incrementing and easy to do comparisons with. A good primary key is not always like that.

There are certainly some characteristics of the clustered index that often make it a good choice for a primary key. For example, a good clustered index needs columns that rarely change in value, as does a key column that is referenced by a foreign key.

A clustered index allows you to access any value in the row efficiently because it is automatically ‘covering’, since by reaching the leaf, you’ve reached the row itself. If you've chosen a natural primary key that is commonly used in searches, and implemented it as a clustered index, you wouldn't need 'lookups' for the column values or predicates.

If using a non-clustered index for the primary key (or any key), it is more awkward to fetch any columns other than those participating in either the primary key or clustered index key. If you specify columns in your SELECT statements that can’t be covered then a separate read of the data row will be needed to get that data after a row is located by the non-clustered primary key index. You can’t prevent this by specifying the non-key columns to be added to the leaf level of the non-clustered index since there is no way of specifying any key with INCLUDE columns.

However, you have lost your greatest weapon in the fight for performance if you throw away your one clustered index per table on a primary key that is meaningless to the business and so never used in searches, aggregated on, or used for the ordering of data. Your table may have a number of alternative candidate keys and you could find that the usage of one of these is better suited to be implemented via a clustered index. You might, in some circumstances, find that it is even better to use non-clustered index es for all candidate keys, and create a non-unique clustered index.

Do not be afraid to use the clustered index for another key if it fits better with the way you query the data, and specifically how the table participates in frequently-used queries. Are you likely to select predominately single values, unsorted or sorted ranges? Are you predominately using one particular index other than your primary key? Is the table experiencing many more reads than writes, with queries that make reference to columns that aren't part of the primary key? Are you typically selecting ranges within a particular category? Are your WHERE clauses returning many rows? These ways that the table participates in frequently-used queries are likely to be better accommodated by a clustered index, which shouldn’t be frittered away on a surrogate primary key, based on some meaningless ever-increasing integer.

For your clustered index, you are likely to choose a ‘narrow’ index which is stored economically because this value has to be held in every index leaf-level pointer. This can be an interesting trade-off because the clustered index key is automatically included in all non-clustered indexes as the row locator so non-clustered indexes will cover queries that need only the non-clustered index key and the clustered index key.

For a table that has to support a high volume of inserts, a continuously increasing primary key can help performance because page splits are eliminated. But if the insert volume increases even more, latch contention on the 'end' pages of the index can undo this effect and a more random distribution is likely to then perform better. In some circumstances, therefore, a primary key that is implemented via a non-clustered index will perform better when used in combination with a separate clustered index.

Should a table always have a primary key?

It is wrong to say that every table requires a primary key, but any relational table must, by definition have at least one candidate key. You may find a primary key is compulsory if you have a SQL Server process that requires it: for example, if participating in transactional replication or if you’re using SQL Azure. You may, however, deliberately choose to use a SQL Server table in a non-relational way, and it is perfectly legitimate, though odd, to have a table without any key in SQL Server.

A bigger issue is whether a table must have a clustered index, the index often chosen, rightly or wrongly, to implement a primary key. A 'table' without a clustered index is actually a heap, which is a particularly bad idea when its data is usually returned in an aggregated form, or in a sorted order. Paradoxically, though, it can be rather good for implementing a log or a ‘staging’ table used for bulk inserts, since it is read very infrequently, and there is less overhead in writing to it. A table with a non-clustered index , but without a clustered index can sometimes perform well even though the index has to reference individual rows via a Row Identifier rather than a more meaningful clustered index. The arrangement can be effective for a table that isn’t often updated if the table is always accessed by a non-clustered index and there is no good candidate for a clustered index.

Is a Smart Key a good idea?

A Smart Key, or Concatenated key is a key into which is encoded more than one value. A common example of a smart key is one that uses internationally-accepted codes such as the IBAN, ISBN, email address or postcodes. Whereas there is no problem with using Smart Keys that already exist within the user domain, it will come to haunt you and your successors if you create a new smartkey in your database for use as a primary key

Despite the support in SQL Server for composite keys composed of more than one column, it has been an occasional practice to pack the values of more than one column into the primary key to make a ‘smart’ or concatenated key. In the past, this type of key had to be maintained by hand but nowadays you can use a computed column as part or all of any primary key or unique constraint, as long as the computed column value is defined by a deterministic expression and the data type of the result is allowed in index columns. You can also update foreign keys, using cascaded updates, if the smart key changes its value.

Despite it now being easy to implement a Smart Key, it is hard to recommend that you create one of your own  that  isn't a natural key, because they tend to eventually run into trouble, whatever their advantages, because it makes the databases harder to refactor, imposes an order which is difficult to change and may not be optimal for your queries, requires a string comparison if the Smart Key includes non-numeric characters, and is less effective than a composite key in helping range-based aggregations. It also violates the basic relational guideline that every column should store atomic values (First Normal Form; Information Rule [Codd’s First Rule]).

Smart Keys also tend to outgrow their original coding constraints: US Zip codes, for example, have become extraordinarily difficult to decode after years of expediency. The initial zero can mean one of seven states, Puerto Rico, or military addresses in Europe.

Can I use keys in Temporary Tables?

You can and should use keys constraints for temporary tables and table variables, for the same reasons you should use them for permanent tables, with the exception that you can’t use them in combination with foreign key constraints to maintain referential integrity. This is because you can create key constraints in local or global temporary tables, but you can’t create the foreign key constraints. If a foreign key constraint is specified in a temporary table, the statement returns a warning and ignores the constraint.

With temporary tables, it is best to make key constraints anonymous. If a temporary table is created within the scope of a user-defined transaction and has a named constraint, only one user at a time can execute the statement that creates the temp table.

Both temporary tables and table variables can have keys defined in the table definition, but only temporary tables can have keys added later (with ALTER TABLE) and indexes added later (with CREATE INDEX).

You can define keys as either a column or table constraint. This is the only way that you can create indexes on your table variable, but only at the time the table is defined. It can make a great difference in the performance of table variables just as it would for a table.

Conclusions

Although key constraints are a logical device to ensure data integrity and guarantee access to individual rows, there is no easier way to increase the chances of a SQL Server database performing well than to provide effective keys. If this can be done from the very start, a few hours spent in refining your choices, and assigning the optimum attributes to them can prevent countless hours later on in creating, maintaining and pruning extra indexes.

A misjudged key can have surprising ramifications if the error is not corrected in time. Although many experts have attempted to define hard-and-fast rules or ‘best practices’ for keys, experience invariably seems to prove them guilty of over-generalization. This work cannot be distilled into a few rules. As usual with databases, everything comes down to measured tests with your data and your likely pattern of queries.

Further reading

Glossary

Key
A key enforces the entity integrity of a table. To be usable, in other words to be able to unambiguously identify a database entity, a key must be unique and known (i.e. NOT NULL). It can consist of one or more attributes (columns). A key is no more than a 'candidate' until you use it and you can have as many candidate keys in a table as you like.
Key constraint
A constraint to the effect that a subset of the table’s columns are a candidate key for the table.
Fat key
A key that contains 16 bytes or more, typically a GUID
Primary key
The default candidate key for a table. Primary keys are simply the default keys for a table. A Primary key is no more than an indexed key whose value, taken together across all component columns, is unique and known.
Compound key
A key consisting of several columns, in which all values are known (not null) and, in combination, are unique.
Foreign key
A reference to a key in another table that uniquely identifies a row of another table. Despite its name, it is enforced by a referential integrity constraint rather than a key constraint. Foreign key constraints enforce the rule that every non-null foreign key in the referencing table corresponds to a row of the referenced table
Candidate key
Any key for which only one row will hold the combination of values held in the constituent columns and for which the same does not apply to any subset of these columns. An alternate key generally refers to any key that isn’t a primary key, whereas a candidate key includes the primary key.
Natural key
A key containing attributes that comprises information used by people. Primary keys for business entities such as invoices, deliveries or products are generally best if they are composed of something the business already uses, such as a product code, shipment reference, customer reference, or invoice number. and is part of their culture. Use natural keys when they fit, otherwise use surrogates
Surrogate key
A computer-generated candidate key. Surrogate keys such as the 'Identity field key' are the normal way of getting round the complexities of trying to handle natural keys. The use of an object-relational mapping layer will often lead to additional restriction on the design of the database design including the rule that primary keys should be immutable, that is, never changed in value, and that they should be anonymous integer or numeric identifiers. Neither of these restrictions appear in the relational model or any SQL standard.
SmartKey
(or contatenated key)
A key containing a single value usually composed of the values of several columns.
Composite key
A key that uses several non-null columns.
Alternate key
A candidate key that is not assigned to be the primary key.
Key Constraint
This is a constraint to the effect that one or more columns in a table are to be a candidate key. A primary Key constraint merely defines the default candidate key.
Immutable key
This is a key whose value cannot be changed. This is the normal assumption, though it is seldom enforced and isn’t relevant to the relational model
Heap
A heap is any data structure where the records themselves are not ordered (i. e. not linked to one another). In SQL Server a heap is a table without a clustered index
Clustered index
The clustered index defines the order of the data in the table based on their key value, so there can be only one per table. The data of the table becomes the leaf level of the clustered b-Tree index.
Non-clustered index
The non-clustered index tree contains the index keys in sorted ‘logical’ order, with the leaf level of the index containing a row locator: the clustering key if there is one otherwise a RID, as well as any non-key columns specified in the index definition.

Fig 3: Particularly nasty bugs can be prevented by well-placed key constraints.

Phil Factor

Author profile:

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Google + To translate this article...

Search for other articles by Phil Factor

Rate this article:   Avg rating: from a total of 40 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Surrogate keys
Posted by: snetzky (view profile)
Posted on: Thursday, December 05, 2013 at 9:40 AM
Message: I've been a long time advocate of surrogate keys merely because they limit the database's exposure to risk that something will change. The rule of thumb is that the key shall describe the row, the whole row and nothing but the row. The problem is that if a key value has meaning anywhere outside the row, it's highly likely that that value will change at some point, causing effects that will ripple through the database.

One example I was given was of a booking system for a hotel. Room number seems an ideal natural key, but I could see the hotel being remodeled and rooms given new numbers as part of the remodel.

A surrogate key has the added advantage in that it has no meaning other than to refer to the row, which means that it will never change its meaning.

I do use composite indexes to enforce uniqueness of field combinations, but that to me is a different animal than identifying the row.

Subject: Surrogate keys
Posted by: callcopse (view profile)
Posted on: Monday, December 09, 2013 at 9:52 AM
Message: I'm with you Snetzky.

Additional advantages are that ORMs and such work really easily with a surrogate key, and that if you have a uniform approach to structuring your data then any team members can switch over and see what is occurring. I would avoid clustering on the identity for large transactional tables of course where a date field is a more natural choice.

Of course there is a time to discard such principles - mainly around import tables and other sloshing around type bits and bobs.

Subject: Surrogate keys
Posted by: Anonymous (not signed in)
Posted on: Monday, December 09, 2013 at 2:29 PM
Message: FQR-Comment
I agree with the advantages of Surrogate keys and I use them whenever it’s possible. But surrogate keys impose some artificial confidence on uniqueness.
For example:
In a table with 2 columns ID and Full Name; these 3 rows are different on virtue of the surrogated key.
ID:1, Full Name Phil Factor
ID:2, Full Name Phil Factor
ID:3, Full Name Phil Factor
But are they really different?
I could’ve eliminated this situation by making the Full Name column unique, which, on the other hand, would introduce other problems


Subject: Surrogate keys
Posted by: Anonymous (not signed in)
Posted on: Monday, December 09, 2013 at 2:30 PM
Message: FQR-Comment
I agree with the advantages of Surrogate keys and I use them whenever it’s possible. But surrogate keys impose some artificial confidence on uniqueness.
For example:
In a table with 2 columns ID and Full Name; these 3 rows are different on virtue of the surrogated key.
ID:1, Full Name Phil Factor
ID:2, Full Name Phil Factor
ID:3, Full Name Phil Factor
But are they really different?
I could’ve eliminated this situation by making the Full Name column unique, which, on the other hand, would introduce other problems


Subject: Surrogate keys
Posted by: Anonymous (not signed in)
Posted on: Tuesday, December 10, 2013 at 10:22 AM
Message: > Therefore a date-related column is likely to be part of a natural key for such a [business entities such as invoices, deliveries or products] table

Therein lies the trap. Yes, if the company never acquires another company or merges with another entity within the same company or always does business the same way such that the natural key remains the same, then yes it can be a logical *candidate* key. However, experience should tell you that this often not the case.

> Sometimes, surrogate keys ‘escape’ into the world if exposed as ‘reference numbers’ and take on a permanent and intended meaning that prevents any refactoring or renumbering within the database.
...
> A simple incrementing integer is a compromise, but rarely one that returns to haunt you, unlike a GUID which is always horrible.

Your first paragraph is one of the core reasons a sequential GUID (no one should use a random GUID for a key) is better than an integer. It simply isn't a choice because of its user unfriendly nature, by accident or on purpose, for a developer to expose the GUID to the user for consumption. However, I see it happen all the time with integers.

> ...smart key is one that uses internationally-accepted codes such as the IBAN, ISBN, email address or postcodes

ISBN is a great example of a key that *should* be universally unique but isn't. Talk to someone that has had to manage a library's database system about how often they encounter duplicate ISBNs. It represents one of the dangers of keys where the final uniqueness is not set by a single authoritative source. E.g., currencies codes all come from a single source: ISO. However, ISBN values are only partly built on values set by a central source and partly set by the vendors. That means there is the potential for those vendors to create non-unique values.

Your comment about the dangers of devising your own smart key are spot on. Invariably the business will outgrow the assumed constraints in the key.

Finally, on surrogate keys, there is a false assumption that using a surrogate key obviates the need to for imposing some other logical key. It does not. There has to be some other combination of unique columns that makes sense to the business or else you will get two identical rows only differentiated by their surrogate key.

Subject: Surrogate Mother...
Posted by: Anonymous (not signed in)
Posted on: Tuesday, December 10, 2013 at 1:15 PM
Message: -- There has to be some other combination of unique columns that makes sense to the business or else you will get two identical rows only differentiated by their surrogate key.

Which is exactly why they should never be used. All such do, outside of corner cases such as order/invoice, is give one a false sense of security. If anything changes in that "other combination" one has to deal with the issue. May as well make it the primary key. Moreover, clustered indexes are useful to those who want/need to process RBAR style over some dominant order. If the primary key/index differs from the clustered index, performance surprises are sure to occur.

Subject: RE: Surrogate Mother...
Posted by: Anonymous (not signed in)
Posted on: Tuesday, December 10, 2013 at 1:51 PM
Message: > Which is exactly why they should never be used. All such do, outside of corner cases such as order/invoice, is give one a false sense of security.

That's throwing the baby out with the bath water. The list of benefits that surrogate keys provide is long. ORM's are just one example. *However*, that doesn't mean that they can be used without any precautions or considerations.

> Moreover, clustered indexes are useful to those who want/need to process RBAR style over some dominant order. If the primary key/index differs from the clustered index, performance surprises are sure to occur.

Whether a developer uses a RBAR solution is entirely orthogonal to the indexing strategy. I've never heard of developers going out of their way to specifically make an index clustered for the express reason of cycling over that index using a cursor or some other RBAR solution. One is simply not a factor of the other.

Subject: GUIDs bad?
Posted by: Anonymous (not signed in)
Posted on: Thursday, December 19, 2013 at 5:30 AM
Message: Hmmm. GUIDs are always bad? My current system uses them on just about all tables to enforce uniquness across 282 satellite databases (not always connected to the world) that are then merged into the master database. I'd say this is a good use of GUIDs. And replication requires a GUID, although it does not require it to be the primary key...

Subject: GUIDs - more
Posted by: Anonymous (not signed in)
Posted on: Thursday, December 19, 2013 at 6:17 AM
Message: mr phil please could you please explain a little more the issue "unlike a GUID which is always horrible" ? like why ? cheers

Subject: Re: GUIDs Bad? and GUIDs more.
Posted by: Phil Factor (view profile)
Posted on: Thursday, December 19, 2013 at 7:31 AM
Message: The best place to start is in the reference I put in the end of the article by Kimberley Tripp, 'GUIDs as PRIMARY KEYs and/or the clustering key', but there a number of good posts on this topic, by Kim, Paul Randal and Michelle Ufford that will explain this in great detail far better than I could. I may have been too dogmatic: Perhaps I should have put an 'almost' into that sentence, since whenever anyone ever makes a hard and fast rule, there is always a valid and optimal use that pops out of the woodwork to prove it wrong.

Subject: Surrogate Keys
Posted by: azJim (view profile)
Posted on: Monday, December 23, 2013 at 2:58 PM
Message: I have found that for SQL Server (and to a lesser extent, Unix-based DBMSes), surrogate keys are hard to avoid. If you are willing to be honest, natural keys make more sense -- both to the business as well as internally to IT (consider if the natural primary key is used as a foreign key). But the problem with natural keys is the length. Windows based systems choke once natural keys extend beyond 16 bytes (not my opinion only, our Microsoft rep said the same thing). In a logical data model that has a lot of hierarchical levels, you will be concatenating each parent key to the logical child key. If this goes on for many levels, you might have a natural key of over 100 bytes with the non-key data of only a handful. If you have a shallow hierarchy in your model and you have a good natural key, I would go with the natural key.

Now for the question you probably want to ask: why did I qualify the need more with SQL Server? First, SQL Server integrates the clustering index (usually what you use for your primary key) and the row. In Unix and mainframe, the clustering index is a separate file structure. Secondly, the mainframe doesn't seem to mind long keys (most of the time). Most of the time for mainframe (e.g., DB2) physical designs I would use a surrogate key if the natural key was too sensitive to display -- such as is the case with financial systems partitioning on a credit card number or SSN. But if there wasn't these types of restrictions, like I said, you would find in the long term a benefit to using natural keys.

Subject: Surogate Keys
Posted by: Bryant (view profile)
Posted on: Monday, February 03, 2014 at 9:39 AM
Message: >
The rule of thumb is that the key shall describe the row, the whole row and nothing but the row.
<

So, given a table of customer data, how does the numeral 7 have any meaning when related to the other columns contained in a given customer row? It does not DESCRIBE the row. A customer number may be a derived, smart piece of data (which is itself a trap) but there is a 1:1 logical relationship between that key and the contained columns that does not exist with some manufactured identity value.

Subject: Surrogate Keys
Posted by: azJim (view profile)
Posted on: Monday, February 03, 2014 at 11:25 AM
Message: A numeral 7 doesn't have any meaning and that is why natural keys are desired. But one some platforms (Windows in particular), a long natural key is a performance killer. For instance, you want to SELECT on a column name DELAY_DESC that has a datatype of CHAR(32). You want to find the column value that matches the variable which contains the value of "Out To Lunch Today will at 3 p.m." This comparison will 32 bytes of data and compare each and every bit to each and every column (or index on the column). Now if the column had a surrogate key with a datatype of the INT (four byte binary), you will be comparing only four bytes for each column. Multiple this over thousands of transactions over tens of thousands of records, there is an unnecessary overhead that will kill you on Windows (and to a certain extent, Unix; mainframe DB2 doesn't seem to care that much).

If this data model had a deep hierarchy with declared refential integrity, and you were inserting a new row which contains this column of DELAY_DESC, you will be checking for the existence of the parentage for each and every up-line parent in the hierarchy. If you have natural keys, your insert performance will suck on Windows.

Again, my desire is to use natural keys. But a DBA has to be flexibile. You really shouldn't be wedded to either approach.

Subject: Surrogate Keys and DELAY_DESC
Posted by: Bryant (view profile)
Posted on: Tuesday, February 04, 2014 at 6:07 AM
Message: @azJim It appears to me that what you describe is not a difference between natural and surrogate keys but evidence that the software does not scale. I find it hard to believe that a mainframe DB2 implementation can handle that kind of scaling but a distributed system implementation cannot.

Practically speaking the use of surrogates in situations you describe are necessary because we must treat the symptoms. Having database performance limited by chosen data type is a limitation of the implementation, not the fault of the developers/DBAs using it.

Subject: Surrogate Keys
Posted by: azJim (view profile)
Posted on: Tuesday, February 04, 2014 at 10:41 AM
Message: I guess you can describe it that way. Unfortunately, the DBMS vendors had different starting points when they created their respective database engines – and the internals cannot change very easily. For instance, SQL Server clustering indexes are not separate data structures as they are on Unix and the mainframe. This might explain the reason for avoiding natural keys with SQL Server.

I have been working in database for over 33 years now (mainframe, Unix, and Windows). I have designed for each and enjoy it. There are trade-offs for each platform (despite the marketing hype from the vendors).

Personally, I wouldn't choose Windows for heavy data-crunching, high availability application. With enough money and perseverance, sure, you could configure SQL Server to run the New York Stock Exchange. Would there be a better technical architecture for such a solution? Of course. For smaller applications which require a lot of flexibility and connectivity with desktop apps (particularly Excel), it would be foolish to use Unix or mainframe, although both have the means by which to process Excel. If I was designing a phone company logging database, my first choice would be Unix. Windows and mainframe platforms would work, but the best-in-class economics would direct you to Unix.

Going back to our discussion of surrogates, it comes back to efficiency/performance. If the business doesn’t mind paying for a redesign after a SQL Server database goes live in production, I would try to use natural keys as my first choice. The trouble is, I haven’t found the business very amenable to redesign. So I explain to them the choice: if you want natural keys as the primary key, I can do it. But I would require them to sign off on the physical design document that would state performance might not be optimum.

Subject: Natural Key
Posted by: Duke_Ganote (view profile)
Posted on: Saturday, February 22, 2014 at 7:36 AM
Message: There are no "natural keys", only "business-oriented keys". Nothing in business is natural... :)

Subject: Clustered indices in Unix
Posted by: Jasonl (view profile)
Posted on: Tuesday, March 25, 2014 at 5:57 PM
Message: azJim "First, SQL Server integrates the clustering index (usually what you use for your primary key) and the row. In Unix and mainframe, the clustering index is a separate file structure. "

azJim might want to check out Index Organised Tables in Oracle, the data is stored at the leaf node of the index. It is true that DB2 doesn't do this, but they do attempt to let the index determine the storage by placing data in adjacent blocks, if possible, for their version of clustered indexes.

Subject: Clustered Indecies in Unix
Posted by: azJim (view profile)
Posted on: Tuesday, March 25, 2014 at 6:19 PM
Message: Jason ... I have to admit, I don't have recent experience on Oracle. From what you describe, it might be similar to the "index-only" type structures in SQL Server and DB2. They are very fast, but you would not want a long row in one.

What I was referring to in SQL Server, the clustering index actually is part of the table. It works well for direct look up, such as with a customer number. But for non-clustering index look ups, it first looks up the non-clustering key, and then using the internal information it traverses the clustering index. This isn't an optimum solution, but as I mentioned, for the types of designs I have seen, it is more than adequate. Having a separate structure that can go directly to to the offset page and record ID in the main data file is much faster, particularly if the database engine sorts them first to avoid the "ping-pong" effect of random lookup (this practice is called List Prefetch in DB2).

Thanks for letting me know.

azJim

Subject: Clustered Indecies in Unix
Posted by: azJim (view profile)
Posted on: Tuesday, March 25, 2014 at 6:24 PM
Message: Jason ... I have to admit, I don't have recent experience on Oracle. From what you describe, it might be similar to the "index-only" type structures in SQL Server and DB2. They are very fast, but you would not want a long row in one.

What I was referring to in SQL Server, the clustering index actually is part of the table. It works well for direct look up, such as with a customer number. But for non-clustering index look ups, it first looks up the non-clustering key, and then using the internal information it traverses the clustering index. This isn't an optimum solution, but as I mentioned, for the types of designs I have seen, it is more than adequate. Having a separate structure that can go directly to to the offset page and record ID in the main data file is much faster, particularly if the database engine sorts them first to avoid the "ping-pong" effect of random lookup (this practice is called List Prefetch in DB2).

Thanks for letting me know.

azJim

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.