Every database developer uses keys, but without always understanding all the ramifications. They come with few hard and
fast rules, but if you get them right from the start with a database design, the whole process of database development
is simpler, and the result is likely to perform better. We asked Phil for advice, little knowing that the explanation
might take a while.
Primary keys can be rather a blind-spot for database developers. I can understand and sympathize, because there is some ambiguity in the advice given about their design and use, and any attempt to offer hard-and-fast rules tends to add to the confusion. In writing this article, I felt the urge to try to provide a clear guide for the busy developer. This turned out to be an ambitious aim.
Primary keys cut across other topics such as indexes and
foreign keys. To explain some of the issues, I have to discuss other topics, such as
clustered indexes, which
deserve their own article. I’ll give as concise a meaning that I can, and give references for further reading.
(There is some code
to accompany this article here)
What is a key, and why is one of them a primary key?
For a table to be relational, any scalar value must be addressable by specifying the table, column and the key value(s) of the containing row.
This key, often referred to as a candidate key, defines a set of one or more columns that together can distinguish individual rows in a table.
Originally, relational theory allowed only one key for a table. After a while, this was considered an unnecessary restriction and so now there is no limit on the number of keys you can have for a table (though it can be argued that this is a violation of Boyce-Codd Normal Form. BCNF). With several
candidate keys, the question then was ‘which one should be used to reference individual rows if no particular key was specified’. To meet this requirement, the idea of the table having a primary key was born, where a single key amongst the candidate keys was selected to be the primary one. In short, a primary key is merely the default key for a table. Any type of base table, even temporary tables or table variables, can have keys.
Keys can comprise several columns, but should only contain sufficient columns to ensure uniqueness. A key constraint is satisfied if, and only if, no two rows in the table have the same non-null values in the unique columns. The SQL Standard enshrines the subtle difference that null values are allowed, with some restrictions, in a key’s participating columns, whereas they are forbidden in the primary key. In practice, I’ve never found a use for a key with nulls in any of its participating columns. My advice is to make them not null.
Although relational theory now gives no weight to the importance of the primary key over the others, SQL Server implements them in such a way that they have more particular significance. A primary key is important for SQL server, not only to allow for processes such as replication,
Change Data Capture, filetables, audit trail, and data synchronization, but also to simplify the creation of foreign key constraints. It is also used as a logical property of the table for modeling the database.
How are keys defined for a table?
Keys are defined and enforced by key constraints. There are just two types, a
primary key constraint and a unique key constraint (a foreign key constraint is actually a referential constraint). A key constraint defines one or more columns to be a key. A primary key constraint indicates that the key is the default key, and that all its contributing columns are NOT NULL.
For any key constraint that you define in a CREATE TABLE script, SQL server will ensure that there will be an error if there is any attempt to store, in the column(s) that comprise the key, any data that is not unique If you define a key on an existing table, then SQL server will check for duplicates and if it finds any of them, it will trigger an error roll back the table alteration.
A primary key constraint requires the key columns to be declared as NOT NULL, and SQL Server implies NOT NULL if the constraint is declared in the CREATE TABLE statement without specifying the nullability of the column. If you attempt to add a
primary key constraint on a column that allows NULL values, the statement will immediately fail, even if the column contains no NULL values.
As noted earlier, you can opt to have a
candidate key that does allows NULLs but they are liable to do strange, unexpected things. As well as being used to enforce key constraints, unique indexes in SQL are also used to enforce such non-key rules as ‘must be unique where the value is known’, so should be able to do this where the column is NULL (SQL Server implements this feature wrongly since it only allows one NULL value in the column).
SQL Server will create an index on a key to enforce efficiently the uniqueness of the key (i.e. to support fast validation of the key constraint). These indexes also support use of the key in joins, and selection of single rows or ranges. SQL Server assigns this index the same name as the key constraint.
For both unique key and
primary key constraints, the columns chosen for the key must be of a datatype that is allowed for an index. This precludes columns that are can’t be compared, such as
ntext, text, image, varchar(max), nvarchar(max), varbinary(max), XML, geography, geometry, and CLR user-defined data types that do not support binary ordering.
SQL Server cannot adapt an existing suitable index to support a key constraint: it will always create a new one even if there is already an index with exactly the same definition. To create a key on a single column, you can use either a table or column constraint in the table-creation script. To create a key on multiple columns, you must use a table constraint.
Primary keys in table or column constraints can be anonymous, although I would advise naming them explicitly
in permanent tables.
Should I use a natural or surrogate key?
A key is just a combination of column (attribute) values that provide an obvious way of distinguishing rows,
and a natural, or ‘domain’ key is one that has meaning outside the database environment, such a Longitude/Latitude. Many people argue for a general rule that keys must be natural (or perversely that they shouldn't be) or that they must be immutable. I've even heard it said that all joins must be on key values. As always, it depends. On this topic, there are very few hard and fast rules, because sometimes there are conflicting requirements between the knowledge level and the requirements of implementation.
Primary keys for business entities such as invoices, deliveries or products are generally best if they are relate directly to the current language of business, and are part of the group culture. For example, financial entities tend to be date-related since this is generally the way that commerce orders things. Therefore a date-related column is likely to be part of a natural key for such a table, because it aligns well with the natural way that the business sorts and groups its data. A natural key is helps to bridge between the real-world and the database
Businesses often have arcane rules for describing individual items, apparently passed down from distant tribal ancestors, but often imposing a common-sense sequence to the data that the business uses to group and report. They resent having to change their system to incorporate a surrogate key, which is usually a meaningless incrementing number. The cost of retraining, or changing the corporate culture can be astronomical. As Codd says in his paper RM/T Section 4,
'Users [...] are no longer compelled to invent a user-controlled key if they do not wish to'
It is their choice and we can play it either way. Besides which, when things go wrong, it is much easier to sort things out if the keys are meaningful, and errors are so much easier to spot too before damage is done.
Surrogate keys, usually implemented via an IDENTITY column, are the normal way of getting round the complexities of trying to handle natural keys that are ungainly or don’t quite conform to your business rules. These are fine if they are kept private within the database: otherwise, they are a form of technical debt. They make the coding of databases easier but are disliked by book-keepers, accountants, retailers or anyone else who has to handle them. They aren't human-friendly. Sometimes, surrogate keys ‘escape’ into the world if exposed as ‘reference numbers’ and take on a permanent and intended meaning that prevents any refactoring or renumbering within the database.
Although a natural key is good to have, there is always a place for surrogate keys such as are provided by an integer or numeric IDENTITY column. A natural key, such as an internationally-agreed code or name makes ETL and data warehousing a lot easier to maintain, but one should never keep a grip on the idea, with the whites of your knuckles showing, when it is easier for everyone to use a simple surrogate. A simple incrementing integer is a compromise, but rarely one that returns to haunt you, unlike a
GUID which is always horrible.
For my primary key, should I choose a clustered index or not?
When you use
key constraints, you are defining the logical rules for ensuring that the data is correct. You determine the way that tables are accessed, how they interrelate, and ensure that entity integrity is enforced (for referential integrity we use the
foreign key constraint). In effect, you leave it to the RDBMS to decide how it all gets implemented, and expect it to do so.
SQL Server chooses to create indexes for each key, which is sensible. However, SQL Server has little idea about your database and data and so needs help to choose the right sort of index. Unless you specify otherwise, it will choose a clustered index for a primary key. However, if you previously specified a clustered index for a unique constraint and then specify a
primary key constraint, it will be a non-clustered index. If you are adding a primary key, or moving it, by using an ALTER TABLE statement, then the clustered index may already be used and, in this case, SQL Server will also choose a unique index instead. A unique index will work reasonably well in most circumstances to support a primary key if the existing clustered index has been well-chosen.
It is seldom wrong to assign a clustered index to a primary key. It is just a problem if your choice of primary key is a ‘fat key’ without natural order that doesn’t work well as a clustered index, or if there is a much better use in the table for that clustered index, such as supporting range scans or avoiding sorts on a frequently-queried candidate key.
Primary Keys and clustered indexes are very different. A primary key is a logical construct and a clustered index is an index with a special physical implementation. By specifying a clustered index for a key, you determine the way that a key is implemented. Although your choice of Primary key may well prove to be most effectively implemented with a clustered index, it isn’t necessarily so.
The choice of the clustered index can have a huge performance impact. The candidate key that makes logical sense as a primary key does not necessarily have the characteristics that are required for a well-performing clustered index. A good clustered index is lean, incrementing and easy to do comparisons with. A good primary key is not always like that.
There are certainly some characteristics of the clustered index that often make it a good choice for a primary key. For example, a good clustered index needs columns that rarely change in value, as does a key column that is referenced by a foreign key.
A clustered index allows you to access any value in the row efficiently because it is automatically ‘covering’, since by reaching the leaf, you’ve reached the row itself. If you've chosen a natural primary key that is commonly used in searches, and implemented it as a clustered index, you wouldn't need 'lookups' for the column values or predicates.
If using a non-clustered index for the primary key (or any key), it is more awkward to fetch any columns other than those participating in either the primary key or clustered index key. If you specify columns in your SELECT statements that can’t be covered then a separate read of the data row will be needed to get that data after a row is located by the non-clustered primary key index. You can’t prevent this by specifying the non-key columns to be added to the leaf level of the non-clustered index since there is no way of specifying any key with
However, you have lost your greatest weapon in the fight for performance if you throw away your one clustered index per table on a primary key that is meaningless to the business and so never used in searches, aggregated on, or used for the ordering of data. Your table may have a number of alternative candidate keys and you could find that the usage of one of these is better suited to be implemented via a clustered index. You might, in some circumstances, find that it is even better to use non-clustered index es for all candidate keys, and create a non-unique clustered index.
Do not be afraid to use the clustered index for another key if it fits better with the way you query the data, and specifically how the table participates in frequently-used queries. Are you likely to select predominately single values, unsorted or sorted ranges? Are you predominately using one particular index other than your primary key? Is the table experiencing many more reads than writes, with queries that make reference to columns that aren't part of the primary key? Are you typically selecting ranges within a particular category? Are your WHERE clauses returning many rows? These ways that the table participates in frequently-used queries are likely to be better accommodated by a clustered index, which shouldn’t be frittered away on a surrogate primary key, based on some meaningless ever-increasing integer.
For your clustered index, you are likely to choose a ‘narrow’ index which is stored economically because this value has to be held in every index leaf-level pointer. This can be an interesting trade-off because the clustered index key is automatically included in all non-clustered indexes as the row locator so non-clustered indexes will cover queries that need only the non-clustered index key and the clustered index key.
For a table that has to support a high volume of inserts, a continuously increasing primary key can help performance because page splits are eliminated. But if the insert volume increases even more, latch contention on the 'end' pages of the index can undo this effect and a more random distribution is likely to then perform better. In some circumstances, therefore, a primary key that is implemented via a non-clustered index will perform better when used in combination with a separate clustered index.
Should a table always have a primary key?
It is wrong to say that every table requires a primary key, but any relational table must, by definition have at least one
candidate key. You may find a primary key is compulsory if you have a SQL Server process that requires it: for example, if participating in transactional replication or if you’re using SQL Azure. You may, however, deliberately choose to use a SQL Server table in a non-relational way, and it is perfectly legitimate, though odd, to have a table without any key in SQL Server.
A bigger issue is whether a table must have a clustered index, the index often chosen, rightly or wrongly, to implement a primary key. A 'table' without a clustered index is actually a heap, which is a
particularly bad idea when its data is usually returned in an aggregated form, or in a sorted order. Paradoxically, though, it can be rather good for implementing a log or a ‘staging’ table used for bulk inserts, since it is read very infrequently, and there is less overhead in writing to it. A table with a non-clustered index , but without a clustered index can sometimes perform well even though the index has to reference individual rows via a Row Identifier rather than a more meaningful clustered index. The arrangement can be effective for a table that isn’t often updated if the table is always accessed by a non-clustered index and there is no good candidate for a clustered index.
Is a Smart Key a good idea?
A Smart Key, or Concatenated key is a key into which is encoded more than one value. A common example of a smart key is one that uses internationally-accepted codes such as the IBAN, ISBN, email address or postcodes. Whereas there is no problem with using Smart Keys that already exist within the user domain, it will come to haunt you and your successors if you create a new smartkey in your database for use as a primary key
Despite the support in SQL Server for composite keys composed of more than one column, it has been an occasional practice to pack the values of more than one column into the primary key to make a ‘smart’ or concatenated key. In the past, this type of key had to be maintained by hand but nowadays you can use a computed column as part or all of any
primary key or unique constraint, as long as the computed column value is defined by a deterministic expression and the data type of the result is allowed in index columns. You can also update foreign keys, using cascaded updates, if the smart key changes its value.
Despite it now being easy to implement a Smart Key, it is hard to recommend
that you create one of your own that isn't a natural key, because they tend to eventually run into trouble, whatever their advantages, because it makes the databases harder to refactor, imposes an order which is difficult to change and may not be optimal for your queries, requires a string comparison if the Smart Key includes non-numeric characters, and is less effective than a composite key in helping range-based aggregations. It also violates the basic relational guideline that every column should store atomic values (First Normal Form; Information Rule [Codd’s First Rule]).
Smart Keys also tend to outgrow their original coding constraints: US Zip codes, for example, have become extraordinarily difficult to decode after years of expediency. The initial zero can mean one of seven states, Puerto Rico, or military addresses in Europe.
Can I use keys in Temporary Tables?
You can and should use keys constraints for temporary tables and table variables, for the same reasons you should use them for permanent tables, with the exception that you can’t use them in combination with
foreign key constraints to maintain referential integrity. This is because you can create key constraints in local or global temporary tables, but you can’t create the
foreign key constraints. If a foreign key constraint is specified in a temporary table, the statement returns a warning and ignores the constraint.
With temporary tables, it is best to make key constraints anonymous. If a temporary table is created within the scope of a user-defined transaction and has a named constraint, only one user at a time can execute the statement that creates the temp table.
Both temporary tables and table variables can have keys defined in the table definition, but only temporary tables can have keys added later (with
ALTER TABLE) and indexes added later (with CREATE INDEX).
You can define keys as either a column or table constraint. This is the only way that you can create indexes on your table variable, but only at the time the table is defined. It can make a great difference in the performance of table variables just as it would for a table.
Although key constraints are a logical device to ensure data integrity and guarantee access to individual rows, there is no easier way to increase the chances of a SQL Server database performing well than to provide effective keys. If this can be done from the very start, a few hours spent in refining your choices, and assigning the optimum attributes to them can prevent countless hours later on in creating, maintaining and pruning extra indexes.
A misjudged key can have surprising ramifications if the error is not corrected in time. Although many experts have attempted to define hard-and-fast rules or ‘best practices’ for keys, experience invariably seems to prove them guilty of over-generalization. This work cannot be distilled into a few rules. As usual with databases, everything comes down to measured tests with your data and your likely pattern of queries.
- A key enforces the entity integrity of a table. To be usable, in other words to be able to
unambiguously identify a database entity, a key must be unique and known (i.e. NOT NULL). It can consist of one or more attributes (columns). A key is no more than a 'candidate' until you use it and you can have as many candidate keys in a table as you like.
- Key constraint
- A constraint to the effect that a subset of the table’s columns are a candidate key for the table.
- Fat key
- A key that contains 16 bytes or more, typically a GUID
- Primary key
- The default candidate key for a table. Primary keys are simply the default keys for a table. A Primary key is no more than an indexed key whose value, taken together across all component columns, is unique and known.
- Compound key
- A key consisting of several columns, in which all values are known (not null) and, in combination, are unique.
- Foreign key
- A reference to a key in another table that uniquely identifies a row of another table. Despite its name, it is enforced by a referential integrity constraint rather than a key constraint. Foreign key constraints enforce the rule that every non-null foreign key in the referencing table corresponds to a row of the referenced table
- Candidate key
- Any key for which only one row will hold the combination of values held in the constituent columns and for which the same does not apply to any subset of these columns. An alternate key generally refers to any key that isn’t a primary key, whereas a candidate key includes the primary key.
- Natural key
- A key containing attributes that comprises information used by people. Primary keys for business entities such as invoices, deliveries or products are generally best if they are composed of something the business already uses, such as a product code, shipment reference, customer reference, or invoice number. and is part of their culture. Use natural keys when they fit, otherwise use surrogates
- Surrogate key
- A computer-generated candidate key. Surrogate keys such as the 'Identity field key' are the normal way of getting round the complexities of trying to handle natural keys. The use of an object-relational mapping layer will often lead to additional restriction on the design of the database design including the rule that primary keys should be immutable, that is, never changed in value, and that they should be anonymous integer or numeric identifiers. Neither of these restrictions appear in the relational model or any SQL standard.
(or contatenated key)
- A key containing a single value usually composed of the values of several columns.
- Composite key
- A key that uses several non-null columns.
- Alternate key
- A candidate key that is not assigned to be the primary key.
- Key Constraint
- This is a constraint to the effect that one or more columns in a table are to be a candidate key. A primary Key constraint merely defines the default candidate key.
- Immutable key
- This is a key whose value cannot be changed. This is the normal assumption, though it is seldom enforced and isn’t relevant to the relational model
- A heap is any data structure where the records themselves are not ordered (i. e. not linked to one another). In SQL Server a heap is a table without a clustered index
- Clustered index
- The clustered index defines the order of the data in the table based on their key value, so there can be only one per table. The data of the table becomes the leaf level of the clustered b-Tree index.
- Non-clustered index
- The non-clustered index tree contains the index keys in sorted ‘logical’ order, with the leaf level of the index containing a row locator: the clustering key if there is one otherwise a RID, as well as any non-key columns specified in the index definition.
Fig 3: Particularly nasty bugs can be prevented by well-placed key constraints.