Questions About SQL Server Collations You Were Too Shy to Ask

Of course we all like our colleagues to think that we know everything there is to know about SQL Server Collations. However, the truth is that it is a rather complicated topic to fully understand and the cost of getting collation wrong can be great. If only one could ask certain questions on forums or at conferences without blushing. Help is at hand, because Robert Sheldon once again makes the complicated seem simple by answering those questions that you were too shy to ask

  1. What is a SQL Server collation?
  2. How do I find the collations supported on a SQL Server instance?
  3. How do I identify Unicode-only collations?
  4. What are the differences between Windows collations and SQL Server collations?
  5. How do I set the collation at the server level?
  6. How do I configure a database’s default collation?
  7. How do I assign a collation to a column?
  8. How do I specify a collation when querying data?
  9. How do I query columns with different collations?
  10. How do I create a foreign key on columns with different collations?
  11. How do collations affect temporary tables?
  12. How do I know what collations to use?

1. What is a SQL Server collation?

A collation is a configuration setting that determines how the database engine should treat character data at the server, database, or column level. SQL Server includes a large set of collations for handling the language and regional differences that come with supporting users and applications in different parts of the world.

Each collation serves two purposes. The first is to provide a character set that defines the bit pattern associated with each character.

SQL Server stores character data using either one byte or two bytes per character, depending on the column’s data type and assigned collation. For example, European languages require only a single-byte character set, which supports up to 256 bit patterns. On the other hand, many Asian languages include thousands of characters and require a double-byte character set, which supports up to 65,536 bit patterns.

Collations are intricately tied to the data types assigned to character columns. In SQL Server, you can configure a character column with a Unicode data type (nchar, nvarchar, or ntext) or non-Unicode data type (char, varchar, or text). For Unicode types, the character bit patterns conform to international standards that define a double-byte encoding scheme for mapping most of the world’s written languages, ensuring that the same bit pattern is always associated with the same character, regardless of the underlying environment.

Most collations support both Unicode and non-Unicode character data, although some support only Unicode data. Each collation that supports non-Unicode data is associated with a code page that defines the bit patterns for the non-Unicode characters. The collation must point to a code page supported by the underlying operating system; however, multiple collations can point to the same code page if they share the same character set. Unicode-only collations do not require code pages because they conform to the universal encoding model.

To view the code page associated with a collation, you can use the COLLATIONPROPERTY system function, as shown in the following example:

The function takes two arguments. The first is the name of the collation, in this case, Modern_Spanish_100_CS_AS, and the second is the CodePage property. The SELECT statement returns the value 1252, which is the code page identifier. On my system, 894 collations share this code page. (I’m running SQL Server 2016 Developer Edition on Windows 10).

The second purpose of the SQL Server collation is to specify the rules that govern how character data is sorted and compared, taking into account not only language and regional conventions, but also the options specific to a collation, such as case-sensitivity and accent-sensitivity.

Each collation enforces its own sorting and comparison rules, whether or not it points to the same code page as other collations. For this reason, query results can differ when accessing identical data in two columns configured with different collations, even if those collations are based on the same language.

In SQL Server, you can specify a collation at the server level, database level, or column level, and you can specify collations within your queries. Keep in mind, however, that you can run into query and performance issues when working with multiple collations at the same time.

2. How do I find the collations supported on a SQL Server instance?

To view a list of the available collations, you can use the sys.fn_helpcollations system table function to retrieve the collation names and their descriptions, as shown in the following example:

On my system, the statement returns 3,887 collations that support a wide range of languages, with numerous collations for each language family. For example, the following table shows a number of the collations available to the Tibetan family (but not all):

Name

description

Tibetan_100_CI_AI_SC

Tibetan-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive, supplementary characters

Tibetan_100_CI_AI_WS_SC

Tibetan-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive, supplementary characters

Tibetan_100_CI_AI_KS_SC

Tibetan-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive, supplementary characters

Tibetan_100_CI_AI_KS_WS_SC

Tibetan-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive, supplementary characters

Tibetan_100_CI_AS_SC

Tibetan-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive, supplementary characters

Tibetan_100_CI_AS_WS_SC

Tibetan-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive, supplementary characters

Tibetan_100_CI_AS_KS_SC

Tibetan-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive, supplementary characters

Tibetan_100_CI_AS_KS_WS_SC

Tibetan-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive, supplementary characters

Tibetan_100_CS_AI_SC

Tibetan-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive, supplementary characters

Tibetan_100_CS_AI_WS_SC

Tibetan-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive, supplementary characters

Most language families include numerous collations, each configured with specific options, such as width-sensitivity and accent-sensitivity. When specifying a collation in your T-SQL script, you must use the name exactly as it is returned by the sys.fn_helpcollations function. The collation name includes the supported language and, in some cases, the collation version. For example, the collations in the table above are all specific to the Tibetan language and are all version 100.

The remaining codes in each collation name (such as _CS_AI_WS_SC) indicate the options preconfigured on the collation. For example, CI means case-insensitive and AS means accent-sensitive. The description column shows what the option codes mean. For more detailed information, see the Microsoft article “Collation and Unicode Support.”

3. How do I identify Unicode-only collations?

Some collations support only Unicode characters and consequently can be used only with the nchar, nvarchar, or ntext data types, either within column definitions or when querying data from those columns. Unfortunately, you cannot tell from the collation name whether or not it is Unicode-only, but you can use a combination of the sys.fn_helpcollations and COLLATIONPROPERTY system functions to retrieve a list of Unicode-only collations:

On my system, the SELECT statement returns 510 rows. You can, of course, narrow down the search to specific languages, but you get the idea. You use the COLLATIONPROPERTY system function to find the collations with a CodePage value of 0. Because a Unicode-only collation does not point to a code page, the database engine simply returns a 0, rather than a NULL value or some obscure number, or worse still, an error.

4. What are the differences between Windows collations and SQL Server collations?

SQL Server supports two collation types: Windows collations and SQL Server collations. You can tell them apart by their names. All SQL Server collation names begin with SQL_.

Each collation type supports both character-based collations and binary-based. Once again, you can distinguish them by their names. Binary-based collations end with either _BIN or _BIN2.

Character-based Windows collations represent the bulk of the collations available to SQL Server. The collations are based on the Windows locales available to the host operating system where SQL Server is installed. A locale is a collection of information specific to a location or culture. Windows uses locales to customize the interface for regional groups of users.

Windows collations in SQL Server use the same code pages as the Windows locales, as well as the same sorting and comparison rules. However, SQL Server provides collations for only a subset of Windows locales, and each collation supports only a subset of the attributes available to the Windows locale. The Windows collations in SQL Server are concerned with locales only in terms of data management.

Windows collations use the same algorithm for sorting and comparing both Unicode and non-Unicode data, making it easier to work with character data stored with different data types. You can retrieve a list of the character-based Windows collations available to your system by running the following SELECT statement:

All we’re doing here is taking advantage of how collations are named by eliminating any names that start with sql_ and end with _bin or _bin2. On my system, I came up with 3,552 character-based Windows collations.

Binary-based collations represent the other type of Windows collations. They are similar to the character-based collations except in how the data is sorted. For non-Unicode data, sorting is based on the coded numerical values (code points) defined by the Windows locale. For Unicode data, sorting is based on the code points defined by the Unicode standard, so locale is not a factor. In either case, the sort order is simpler to enforce, helping to improve application performance.

SQL Server supports the two types of binary-based collations:

  • BIN: The database engine compares the first character of a data value by code point and the remaining characters by their bit patterns. This type represents the original approach to binary-based collations in SQL Server.
  • BIN2: The database engine compares all characters based on code points. This type represents the newer generation of binary-based collations.

To view a list of binary-based Windows collations, we can use the following SELECT statement, which returns 258 collations on my system:

In case you’re wondering how many of the binary-based Windows collations are Unicode only, we can run the following SELECT statement, which gives me 30 collations:

Now we move into the SQL Server collations, which were created before SQL Server supported windows collations. Microsoft keeps these collations on-hand for backward compatibility with older versions of SQL Server and legacy applications. You should use not use SQL Server collations for new development work.

That said, let’s take a closer look at SQL Server collations, should you need to use them. Like Windows collations, SQL Server collations use a code page for non-Unicode characters and follow the Unicode standard for the rest.

But here’s where things get tricky with SQL Server collations. The database engine applies non-Unicode sorting rules to the non-Unicode data, and Unicode sorting rules to the Unicode data, which can give you inconsistent results when working with character data stored in columns with different data types. Again, you should be using SQL Server collations only to support those legacy databases and applications.

To learn what character-based SQL Server collations an instance supports, we can run the following query, which gives me 73 collations:

Our final collation type is the binary-based SQL Server collation, of which there are few. These collations work much like the binary-based Windows collations, except with the limitations that come with the character-based SQL Server collations. To view a list of the binary-based SQL Server collations available to an environment, we can run the following SELECT statement, which returns only four rows on my system:

It should also be noted that SQL Server does not support Unicode-only SQL Server collations, either character-based or binary-based. For example, if we were to run the following SELECT statement:

We would end up with an empty result set.

5. How do I set the collation at the server level?

You set the collation at the server level when you initially install SQL Server. The collation serves as the default collation for all system databases and user databases, unless a different collation is specified at the database or column level. You can assign any collation at the server instance, except Unicode-only collations.

If you’re uncertain what collation has been assigned to a SQL Server instance, you can use the SERVERPROPERTY system function to find out:

On my system, the SELECT statement returns the collation SQL_Latin1_General_CP1_CI_AS, which was the default when I installed SQL Server. The SQL Server installation wizard used this collation because I live in the US and the underlying Windows operating system uses the English (United States) locale. Microsoft selects this collation as the default to maintain compatibility with earlier versions of SQL Server and with applications developed against those versions.

If you want to use a collation other than the default when installing SQL Server, be sure to change the collation on the Collation tab of the wizard’s Server Configuration screen.

You should try your best to get the server collation right when you install SQL Server because changing it after the fact is no small feat. You must take such steps as backing up the data, rebuilding the master database, recreating the user databases and all the objects in them, and importing the data into the newly created tables.

Fortunately, instead of changing the server collation, you can assign a different default collation to your user databases, and you can assign a specific collation to a character column.

6. How do I configure a database’s default collation?

When you define a user database, you can set the default collation as part of the database definition. If you don’t specify the collation, SQL Server uses the collation assigned to the server instance.

You cannot assign a collation to a system database, nor can you directly modify its collation. The only way you can change the collation on a system database is to change the server-level collation.

To specify a collation when creating a user database, you add a COLLATE clause to the database definition. The clause must point to one of the collations supported by your SQL Server instance. As with server-level collations, you cannot specify a Unicode-only collation. To test this out, let’s try to create a database using the Indic_General_100_CS_AS collation:

Because Indic_General_100_CS_AS is a Unicode-only collation, the database engine returns the following error message:

The issue is easily remedied by specifying a collation that is not Unicode-only:

This time, the CREATE DATABASE statement will run and the specified collation will be assigned to the database. SQL Server will apply the collation to all metadata within that database. The collation also serves as the default for character columns, variable names, and other character data within the database.

We can confirm that the database was created using the specified collation by running the following query:

As expected, the statement returns Latin1_General_100_CS_AS, which means that the database engine will enforce both case-sensitivity and accent-sensitivity when sorting and comparing character data.

Another way we can verify the collation is to use the DATABASEPROPERTYEX system function:

The SELECT statement again returns the returns the Latin1_General_100_CS_AS collation.

We can also use the COLLATE clause in an ALTER DATABASE statement if we want to change the default collation:

When we change the database collation, the database engine updates the collation for all the character columns in the system tables, as well as existing character parameters and scalar values returned by stored procedures and user-defined functions. The database engine also applies the collation to new columns configured with character data types. However, SQL Server does not change the collations of existing character columns in user-defined tables.

7. How do I assign a collation to a column?

As with database definitions, you can add the COLLATE clause when defining a character column. In this way, you can apply a specific collation to the column’s data, without impacting the rest of the database. You can use the COLLATE clause in either a CREATE TABLE statement or ALTER TABLE statement. When you specify a collation on a column, you override the database’s collation, which is applied by default to all character columns.

For columns configured with the nchar, nvarchar, or ntext data type, you can assign any type of collation, including Unicode-only collations. For char, varchar, and text columns, you must use a collation that supports non-Unicode characters, which rules out Unicode-only collations.

To assign a collation to a column, you include the COLLATE clause within the column’s definition, as shown in the following example:

In this case, we’re configuring the LastName column with the Traditional_Spanish_CI_AS collation, which is case-insensitive, but accent-sensitive. Now let’s query the character data:

The SELECT statement returns the following results:

NameID

LastName

2

Carrion

4

carrion

3

carrión

1

Carrión

8

Cuneo

9

cúneo

7

Cúneo

5

Chaves

6

Chavira

Because of the assigned collation, the database engine treats uppercase and lowercase letters as the same, and treats accented and unaccented letters as different. But something else is going on here. The database engine also treats “c” and “ch” as separate letters, which is why Cúneo is listed before Chaves.

You can change a column’s collation by updating the table definition, unless the column is being referenced by an index, computed column, check constraint, foreign key constraint, or distribution statistics. With that in mind, let’s change the collation of the LastName column to Modern_Spanish_CI_AI, which is case-insensitive and accent-insensitive:

If we rerun the preceding SELECT statement, Chaves will now come before Cúneo, as shown in the following results:

NameID

LastName

1

Carrión

2

Carrion

3

carrión

4

carrion

5

Chaves

6

Chavira

7

Cúneo

8

Cuneo

9

cúneo

Notice that, even though the collation is case-insensitive and accent-insensitive, the database engine still lists the names with an initial capital letter first for each sorted group. You’ll find that the case-insensitive collations are not consistent in how they treat the first letter. Some will prioritize values with an initial capital letter, some will prioritize values with an initial lowercase letter, and some will not prioritize either.

At some point, you might want to change a column’s collation back to the database’s default collation. To do so, you can use the database_default keyword instead of the actual collation name (which is currently Latin1_General_100_CI_AS):

The database_default keyword saves you the trouble of having to look up the database’s collation, should you not know it off the top of your head. And even if you do, the keyword is a lot easier to type. If we rerun our SELECT statement, the database engine now returns the following results:

NameID

LastName

4

carrion

2

Carrion

3

carrión

1

Carrión

5

Chaves

6

Chavira

8

Cuneo

9

cúneo

7

Cúneo

For the Latin1_General_100_CI_AS collation, the database engine lists the name that starts with a lowercase letter first in each sorted group (e.g., carrion then Carrion), which is the opposite of the Modern_Spanish_CI_AI collation. The Traditional_Spanish_CI_AS collation is inconsistent in this regard (e.g., Carrion then carrion vs. cúneo then Cúneo).

Despite the fact that the collations Latin1_General_100_CI_AS, Traditional_Spanish_CI_AS, and Modern_Spanish_CI_AI sort data differently, they are all based on code page 1252, which means that no matter which collation is specified, the characters all map to the same set of bit patterns.

If you want to view a column’s collation, you can query the sys.columns catalog view:

Because we had configured the column with the database’s default collation, the statement returns the Latin1_General_100_CI_AS collation.

8. How do I specify a collation when querying data?

You can also use the COLLATE clause directly in your queries. For example, suppose we’ve assigned the Latin1_General_100_CI_AS collation to our database and its columns, but we want to return the data sorted according to the rules of the Traditional_Spanish_CI_AS collation. We need only add the COLLATE clause to the SELECT statement’s ORDER BY clause:

As the following results show, the names are now sorted according to the Traditional_Spanish_CI_AS collation, which is case-insensitive and accent-sensitive, with the “c” and “ch” treated as separate letters:

NameID

LastName

2

Carrion

4

carrion

3

carrión

1

Carrión

8

Cuneo

9

cúneo

7

Cúneo

5

Chaves

6

Chavira

Now suppose we want to use the following SELECT statement to retrieve a specific name from the table:

Because the collation is case-insensitive, but accent-sensitive, the results include only the following two names:

NameID

LastName

2

Carrion

4

carrion

This time, let’s specify the Traditional_Spanish_CS_AI collation when we query the table:

Because the new collation is case-sensitive, but accent-insensitive, we now get these results:

NameID

LastName

3

carrión

4

carrion

By specifying the collation, we can better control how the data is compared and the results sorted. But be aware that using collations in your queries can impact performance, especially when they involve large data sets. For example, a query might require additional table scans, depending on the nature of the data.

9. How do I query columns with different collations?

You might run into situations in which you need to query data from multiple columns that use different collations. The columns might be in the same table, different tables in the same database or different databases. In such cases, you can use the COLLATE clause in your queries to resolve possible incompatibilities.

Let’s look at a few examples to demonstrate how this works. We’ll start by creating two tables in the TestCollations database (currently configured with the Latin1_General_100_CI_AS collation):

Notice that we’ve assigned the SQL_Latin1_General_CP1_CI_AS collation to the AltID column in the Emp2 table, but use the default database collation for everything else. Now let’s try to join the tables based on the two AltID columns:

Because the AltID columns are configured with different collations, the database engine returns the following error message:

To get around this, we can add the COLLATE clause to the join condition when we reference the AltID column in the Emp2 table:

Now the SELECT statement returns these results:

EmpID

AltID

JobTitle

LoginID

1

ab101

sales rep

jchaves

2

cd102

CEO

jdoe

3

ef103

test engineer

rschmidt

4

gh104

HR specialist

llee

Because the AltID column in the Emp1 table uses the database’s default collation, we can instead specify the database_default keyword in our COLLATE clause, rather than the name of the collation:

The SELECT statement returns the same results as the preceding statement, as does the next SELECT statement:

This time, we’ve added the COLLATE clause to the first AltID reference, which achieves the same results, that is, to join the tables based on the AltID columns.

Now let’s look at what happens when we try to concatenate the LoginID and AltID columns in the Emp2 table:

As with our join, we’re working with two character columns configured with different collations. Not surprisingly, the database engine returns the following error message:

Again, the COLLATE clause comes to the rescue:

The SELECT statement now returns the results we’re looking for:

EmpID

NewEmpID

1

jchavesab101

2

jdoecd102

3

rschmidtef103

4

lleegh104

Again, being able to use the COLLATE clause in your queries is a handy tool, but keep in mind that the database engine has to work harder to make this happen.

10. How do I create a foreign key on columns with different collations?

Just like you cannot change the collation on a column that is being referenced by a foreign key, you cannot create a foreign key between two columns with different collations. To test this out, let’s update the Emp1 table we created above by adding a unique constraint to the AltID column so it can be referenced by a foreign key:

Now let’s try to add a foreign key constraint to the AltID column in the Emp2 table, referencing the AltID column in Emp1 table:

Although both AltID columns are configured with the char(5) data type, when we try to run the ALTER TABLE statement, the database engine returns the following error message:

One way to get around this is to alter the AltID column in Emp2 table by changing the default collation to the same as the AltID column in Emp1 table:

In some cases, however, you will not be able to change the referencing column’s collation. Another approach you might consider is to add a persisted computed column to the referenced table. For example, the following ALTER TABLE statement adds the AltID2 computed column to the Emp1 table:

The computed column contains the same data as AltID but specifies the SQL_Latin1_General_CP1_CI_AS collation. Next, we define a unique constraint on the AltID2 column so it can be referenced by a foreign key:

Finally, we add a foreign key constraint to the AltID column in the Emp2 table, pointing to the AltID2 column in the Emp1 table:

Although this will do the trick, adding a persisted computed column with a unique index adds overhead, so be sure this alternative won’t have too serious an impact on your transactional workloads.

11. How do collations affect temporary tables?

If your SQL Server environment uses the same collation for the server, databases, and columns, then temporary tables present no problem in terms of collations. All the character columns throughout the database adhere to the same rules when performing sorts, comparisons, joins, concatenations, or other operations.

However, if you use different collations for your databases and columns, you might run into unexpected results. For example, suppose we create and populate the following temporary table within a database configured with the Latin1_General_100_CI_AS collation:

We might expect that the AltID column will use the same collation as the database, given that no collation has been specified, but that’s not the case. Because temporary tables are stored in the tempdb database, they are assigned the same collation as the server, no matter what the database collation might be.

We can test this scenario by running the following query:

In this case, we’re joining the Emp1 and #Emp3 tables based on their AltID columns. When we run this statement, the database engine returns the following error message:

Fortunately, we can get around this issue by including a COLLATE clause in our query, as we saw in earlier examples:

Now the SELECT statement runs with no problem and returns the results shown in the following table:

EmpID

AltID

JobTitle

LoginID

1

ab101

sales rep

jchaves

2

cd102

CEO

jdoe

3

ef103

test engineer

rschmidt

4

gh104

HR specialist

llee

Another approach we can take is to create our temporary table using the same collation as that of the database:

As with other table definitions, we simply include the COLLATE clause in the column definition, allowing us to create a join without having to deal with collation differences:

When working with different collations, any objects stored in the tempdb database can cause issues. Not only temporary tables, but also variables, temporary stored procedures, and GOTO labels.

12. How do I know what collations to use?

The ideal solution is to choose a collation when setting up SQL Server that can be used for all your user databases and character columns. Using one collation removes any issues you might encounter when querying the data in different ways. It can also be the best approach in terms of performance if multiple collations impact your queries. However, this approach works only if the same language and collation settings are appropriate for all your users and applications—or at least a good majority of them.

If you support multi-cultural environments, you’ll need to take into account a number of considerations. To begin with, you should pick collations that support the most users, and you should use Unicode data types where possible because they can help avoid code page conversion issues. Just keep in mind the storage requirements that come with Unicode’s two bytes per character.

When you implement Unicode data types, you have the option of using Unicode-only collations for your columns and queries. This type of collation can be useful when you’re supporting client applications installed on older operating systems or when the applications or client drivers don’t recognize the new Windows collations. Because Unicode-only collations don’t use code pages, the applications won’t try to map to incorrect bit patterns or adhere to conflicting rules. Instead, both the client and server treat all character data according to the Unicode standards.

However, using Unicode data types with Unicode-only collations is not always an option, in which case, you might need to select your collations based on your client applications or their environments. For example, you might choose SQL Server collations because they can support the legacy systems. Depending on the circumstances, a good strategy might be to assign a SQL Server collation to the server or at least to the user databases, as applicable.

Of course, you should upgrade the client operating system, application, or provider, when possible, but that might not always be a practical solution.

If you’re developing an entirely new data-driven application, you should stick with Windows collations, implementing Unicode data types and Unicode-only collations if and when they make sense. But be sure to keep performance and maintenance in mind, especially when you’re mixing Unicode and non-Unicode data (or single-byte and double-byte data) or when your queries result in code page conversions between the client and server.

The bottom line is that you must keep your client applications and users at the forefront of your thinking when choosing collations. You should also strive to keep things as simple as possible. The more collations you assign to a database, the harder the database engine might have to work. For large data sets, you should test various scenarios to determine their impact on performance, maintenance, and storage, and be sure to look at where code page conversions might be occurring and where conflicts could arise.

  • 5228 views

  • Rate
    [Total: 9    Average: 5/5]
  • Adán Bucio

    This is the most clear explanation about collations I’ve ever read. Examples with “ch”, “c”, “l”, “ll” and “ñ” are always useful and very easy to grasp, at least here in México.
    Thank you.