Understanding Full-Text Indexing in SQL Server

Microsoft has quietly been improving full-text indexing in SQL Server. It is time to take a good look at what it offers. Who better to give us that look than Robert Sheldon, in the first of a series.

The most commonly used indexes in a SQL Server database are clustered and nonclustered indexes that are organized in a B-tree structure. You can create these types of indexes on most columns in a table or a view, except those columns configured with large object (LOB) data types, such as text and varchar(max). Although this limitation is not a problem in many cases, there will be times when you’ll want to query such column types. However, without indexes defined on the columns, the query engine must perform a full table scan to locate the necessary data. But there is a solution-full-text search.

Full-text search refers to the functionality in SQL Server that supports full-text queries against character-based data. These types of queries can include words and phrases as well as multiple forms of a word or phrase. To support full-text queries, full-text indexes must be implemented on the columns referenced in the query. The columns can be configured with character data types (such as char and varchar) or with binary data types (such as varbinary and image). A full-text index is made up of word tokens that are derived from the text being indexed. For example, if the indexed text contains the phrase “tables can include indexes,” the full-text index would contain four tokens: “tables,” “can,” “include,” and “indexes.” Because the list of tokens can be easily searched, full-text queries can quickly locate the necessary records.

In this article, I explain how to implement full-text indexing in your SQL Server 2005 or 2008 database, and I provide a number of examples to demonstrate how this is done. For these examples, I used the following script to create and populate the ProductDocs table in the AdventureWorks2008 sample database in SQL Server 2008:

If you’re using the AdventureWorks sample database in SQL Server 2005, simply change the name of the database in the USE statement.

To implement full-text indexing in SQL Server, you should take the following steps:

  1. Create a full-text catalog, if necessary.
  2. Create the full-text index.
  3. Modify the list of noise words (SQL Server 2005) or stop words (SQL Server 2008), if necessary.
  4. Modify the thesaurus for the language being used, if necessary.

These steps are the same in both SQL Server 2005 and 2008, although the specific features supported in each version differ. Part of that difference results from the fact that full-text search is now integrated in the SQL Server 2008 database engine. Prior to 2008, full-text search was based on the MSSearch engine, which sits outside SQL Server and is part of the Microsoft Office group.

Creating the Full-Text Catalog

A full-text catalog provides a mechanism for organizing full-text indexes. Each catalog can contain zero or more indexes, but each index can be associated with only one catalog. Catalogs are implemented differently in SQL Server 2005 and 2008:

  • SQL Server 2005: A full-text catalog is a physical structure that must reside on the local hard drive associated with the SQL Server instance. Each catalog is part of a specific filegroup. If no filegroup is specified when the catalog is created, the default filegroup is used.
  • SQL Server 2008: A full-text catalog is a logical concept that refers to a group of full-text indexes. The catalog is not associated with a filegroup.

To create a full-text catalog in either version of SQL Server, you can use the CREATE FULLTEXT CATALOG statement, as shown in the following example:

Full-text catalogs are associated with specific databases. In this case, I added the catalog to the AdventureWorks2008 database, but you can create the catalog on any user-defined database.

The only required clause in a CREATE FULLTEXT CATALOG statement is the CREATE FULLTEXT CATALOG clause, which requires that you provide a name for the catalog (in this case, ProductFTS). You can also specify whether this catalog is the default catalog, the catalog owner, or, as in the example above, whether the indexes in the catalog are accent sensitive or insensitive.

If the WITH ACCENT_SENSITIVITY clause is not specified, accent sensitivity is based on the default database collation. You can determine a database’s collation by querying the sys.databases catalog view, as shown in the following example:

The follow table shows the statement’s results:

Name

collation_name

AdventureWorks2008

SQL_Latin1_General_CP1_CI_AS

Note: The results shown above, along with the results shown for other examples in this article, are based on queries issued in SQL Server 2008 against the AdventureWorks2008 database. You results may be different depending on the SQL Server version and the database used.

As mentioned above, SQL Server 2005 full-text catalogs are part of a filegroup. If you’re running the CREATE FULLTEXT CATALOG statement against SQL Server 2005, you can specify the name of the filegroup by using the ON FILEGROUP option. You can also specify the directory on which to store the catalog by using the IN PATH option. Both of these options are ignored in SQL Server 2008. For details about the ON FILEGROUP and IN PATH options, see the topic “CREATE FULLTEXT CATALOG (Transact-SQL)” in SQL Server 2005 Books Online.

After you create a full-text catalog, you can use the sys.fulltext_catalogs catalog view to verify that the catalog has been created:

In this case, the statement returns the results shown in the following table:

fulltext_catalog_id

name

5

AW2008FullTextCatalog

10

ProductFTS

Notice that the ProductFTS full-text catalog is included in the results. Also notice that in this case the fulltext_catalog_id value is 10. The ID is a good number to note because you can use it in other statements that are related to the full-text catalog.

Creating the Full-Text Index

After you create your full-text catalog, you’re ready to create your full-text index. You can then associate the index with the new catalog. If you don’t specify a catalog when you create the index, the index is associated with the database’s default catalog, whether it is the system catalog or a user-defined catalog that has been configured as the default.

A full-text index is defined at the table level, and only one full-text index can be defined per table. For a table to support a full-text index, a unique index must be defined on that table. In addition, the index must be defined on a single column and be non-nullable. This column is referred to as the key index in the full-text index definition. For best performance, the key index should be defined on a column configured with an integer data type. Often, the primary key is a good candidate for a key index.

You can define a full-text index by using the CREATE FULLTEXT INDEX statement, as shown in the following example.

The first line of the statement includes the ON clause, which specifies the table name (in this case, ProductDocs). The statement’s next line is a list of the columns that should be indexed (DocSummary and DocContent).

Notice that the DocContent column definition includes the TYPE COLUMN clause and the LANGUAGE clause. The TYPE COLUMN clause is necessary when the indexed column stores binary data. The TYPE COLUMN clause specifies the name of a different column in the table that stores the file extension for the binary data. For example, the binary data might be a .doc file or .xls file. SQL Server uses the column specified in the TYPE COLUMN clause to associate the binary data with the program, such as Word or Excel.

You can use the sys.fulltext_document_types catalog view to return a list of the document types supported by full-text search, as shown in the following statement:

The following table provides a sample of some of the document types (indicated by their file extension) supported by full-text search:

document_type

Version

.ascx

12.0.6828.0

.asm

12.0.6828.0

.asp

12.0.6828.0

.aspx

12.0.6828.0

.bat

12.0.6828.0

.c

12.0.6828.0

.cmd

12.0.6828.0

.cpp

12.0.6828.0

.cxx

12.0.6828.0

.def

12.0.6828.0

.dic

12.0.6828.0

.doc

2006.0.6001.16503

The second clause in the DocContent column definition is LANGUAGE. For each column that you include in your index, you can specify the language of the documents in that column. You reference the language by using the locale identifier (LCID). You can view a list of the identifiers and their associated languages by using the sys.fulltext_languages catalog view:

The following table shows a part of the result set. As you can see, 1033, the ID I use in the LANGUAGE clause, is English.

lcid

Name

0

Neutral

1025

Arabic

1026

Bulgarian

1027

Catalan

1028

Traditional Chinese

1031

German

1033

English

1036

French

1037

Hebrew

1039

Icelandic

1040

Italian

The next line of the CREATE FULLTEXT INDEX statement in the preceding example is the KEY INDEX clause. This is the name of the unique key index (in this case, PK_ProductDocs_DocID) that is defined on the ProductDocs table. Be sure to specify the index name, and not the column name, when defining your full-text index.

Following the KEY INDEX clause in the full-text index definition is the ON clause, which specifies the name of the full-text catalog (ProductFTS) that the index will join. In SQL Server 2008, you can also specify a filegroup where the index will be stored. However, this option isn’t available in SQL Server 2005 because filegroup association is at the catalog level.

The final clause in the example CREATE FULLTEXT INDEX statement is WITH STOPLIST. This option, available only in SQL Server 2008, lets you specify the name of the stoplist that will be used for this index. In this case, the system stoplist is used, but you can instead specify a user-defined stoplist. (Stoplists are covered in more detail later in the article.)

After you create a full-text index on a table, you can use several catalog views to verify the index has been created and is associated with the correct catalog. The following SELECT statement joins the sys.tables, sys.fulltext_indexes, and sys.fulltext_catalogs views:

As you can see in the following results, the ProductDocs table is associated with the ProductFTS catalog. Only tables with full-text indexes are listed in the results.

TableName

FTCatalogName

ProductReview

AW2008FullTextCatalog

Document

AW2008FullTextCatalog

JobCandidate

AW2008FullTextCatalog

ProductDocs

ProductFTS

One of the advantages of SQL Server 2008 is that the index is now stored within the database. That means you can issue a query that lists the contents of the index, something you cannot do in SQL Server 2005. The following SELECT statement uses the sys.dm_fts_index_keywords dynamic management function to return the list of terms stored in the full-text index created on the ProductDocs table:

The results returned by the statement include the indexed terms, along with the column ID and document count (number of rows) that contain the term. The column ID is based on the order the columns are defined in the table definition. In the ProductDocs table, the two indexed columns are DocSummary (the fifth column defined) and DocContent (the sixth column defined). The following table shows the first 25 terms stored in the ProductDocs full-text index.

display_term

column_id

document_count

100

5

1

100

6

1

1000

6

1

150

6

1

16

6

2

20

6

1

2000

6

1

20w

6

1

23ft

6

1

248

6

1

250

6

1

3000

6

1

44

6

1

48

6

1

500

5

1

500

6

1

618

6

1

619

6

1

620

6

1

Above

6

2

Absorbing

6

1

Acceptable

6

1

Accessories

6

1

Add

6

1

Adding

6

1

Modifying the List of Noise Words or Stop Words

When implementing full-text indexing in SQL Server, the area in which you will probably see the greatest differences between SQL Server 2005 and 2008 is in the way each version handles noise words or stop word.

Noise Words in SQL Server 2005

Noise words are those words that are automatically removed from a full-text index when that index is created. For example, in the phrase “an apple and an orange,” the words “an” and “and” are considered noise words and are not be included in the index. Only “apple” and “orange” are tokenized and added to the index.

SQL Server 2005 defines noise words in a set of text files, which by default, are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. Each text file is associated with a specific language and contains the noise words associated in that language. For example, the English noise words are stored in the noiseENG.txt file. The following list shows a sample of some of the words in the file:

You can edit a noise word file in any text editor. Simply add or remove words as necessary, and then save your changes. (Be sure to save a copy of the original file before changing it.)

Note: After you modify a noise word file, you must repopulate the applicable full-text indexes for the changes to take effect.

Stop Words in SQL Server 2008

SQL Server 2008 uses stop words, not noise words. Stop words are saved to stoplists, which are stored within SQL Server. You can create a stoplist by using a CREATE FULLTEXT STOPLIST statement, as shown in the following example:

In the first line, the statement creates a stoplist named ProductSL. The second line retrieves the stop words from the system stoplist and uses those stop words to populate the new stoplist. You can verify that the stoplist has been created by querying the sys.fulltext_stoplists catalog view:

As you can see in the following results, the ProductSL stoplist has been added to the database, and the stoplist ID is 5.

stoplist_id

Name

5

ProductSL

After you create a stoplist, you can view its contents by querying the sys.fulltext_stopwords catalog view, as shown in the following statement:

Notice that I specified the stoplist ID and language ID to retrieve only the necessary content. The following table shows a partial list of the stop words saved to the ProductSL stoplist.

Stopword

About

After

All

Also

An

And

another

Any

Are

As

At

Be

because

been

before

being

between

both

But

By

came

Can

come

could

Did

Do

does

each

else

For

from

Get

Got

Had

Has

have

He

Her

here

Him

Himself

His

How

In SQL Server 2008, you can add words to or remove words from a stoplist by using the ALTER FULLTEXT STOPLIST statement. The following statement adds the word “nuts” to the stoplist:

When you generate a full-text index based on this stoplist, any occurrence of “nuts” will be treated as a stop word and be removed from the index.

SQL Server 2008 also includes the sys.dm_fts_parser dynamic management function. The function lets you test how SQL Server will tokenize a string based on a specific language and stoplist. In the following SELECT statement, the function parses the phrase “testing for fruit and nuts, any type of nut.”

The first argument in the function is the string that will be parsed, the second argument is the language ID, the third argument is the stoplist ID, and the fourth argument specifies whether the parsing should be accent insensitive (0) or accent sensitive (1). The following table shows the query results. Notice that “nuts” is considered a noise word.

special_term

display_term

Exact Match

Testing

Noise Word

For

Exact Match

Fruit

Noise Word

And

Noise Word

Nuts

Noise Word

Any

Exact Match

Type

Noise Word

Of

Exact Match

Nut

If you want to drop “nuts” from the stoplist, you can use the following statement:

Now when you use the sys.dm_fts_parser function to view the stoplist, you will receive the following results, which show that “nuts” is now considered an exact match.

special_term

display_term

Exact Match

Testing

Noise Word

For

Exact Match

Fruit

Noise Word

And

Exact Match

Nuts

Noise Word

Any

Exact Match

Type

Noise Word

Of

Exact Match

Nut

Earlier in the article, when I created the full-text index on the ProductDocs table, I specified that the index should use the system stoplist. I can now modify the index definition to instead use the ProductSL index. In the following ALTER FULLTEXT INDEX statement, I set the stoplist to ProductSL:

Note: After you modify a stoplist or full-text index definition, you must repopulate the applicable full-text indexes for the changes to take effect.

Modifying the Full-Text Thesaurus

Both SQL Server 2005 and SQL Server 2008 provide a set of XML thesaurus files that let you define synonyms to support full-text queries. For example, you can define a set of synonyms for “song,” “tune,” and “music.” That way, whenever a query is issued against any one of these terms, the results include every other term defined in the set.

SQL Server includes a thesaurus file for each language supported by full-text search. The files are named according to the language they support. For instance, the thesaurus file that supports English synonyms is named tseng.xml.

In a default installation of SQL Server 2005, the thesaurus files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. In a default installation of SQL Server 2008, the files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData.

Each thesaurus file comes preconfigured with the following XML code. For example, the following code is included in the tseng.xml file:

As you can see, most of the code is commented out, but you can modify the file as necessary. The first element to note is <diacritics_sensitive>. This element determines whether the thesaurus file is accent sensitive (1) or insensitive (0). By default, a thesaurus file is accent insensitive.

The other two elements worth noting are <expansion> and <replacement>. The first lets you define expansion sets, and the second lets you define replacement sets:

  • Expansion set: A set of terms that can be substituted for each other. For example, in the sample <expansion> element in the tseng.xml file, three terms are included: “Internet Explorer,” “IE,” and “IE5.” As a result, if a full-text query includes the term “Internet Explorer,” the query results can include content that contains any of the three terms.
  • Replacement set: A set of terms in which one term is substituted for another. For example, in the example <replacement> element in the tseng.xml file, the terms “NT5” and “W2K” are replaced by “Windows 2000.” As a result, if a full-text query includes the term “NT5” or “W2K,” the search results include only content that contains “Windows 2000.”

You can edit a thesaurus file in any XML or text editor. After you save your changes, you must take steps to ensure that the thesaurus is applied to your full-text searches. In SQL Server 2005, this means you must restart the full-text search engine. However, in SQL Server 2008, you need only run the sys.sp_fulltext_load_thesaurus_file system stored procedure, as shown in the following example:

As you can see, when you run the stored procedure, you must specify the LCID. The stored procedure then parses and loads the data from the applicable thesaurus file, without having to restart the full-text engine.

Looking Ahead

After you’ve implemented a full-text index on a table, you can then query the indexed columns by using special predicates or functions. In my next article, I will explain how to create those queries and demonstrate the syntax that you use. Until then, you should now have the information you need to create your full-text indexes. Remember that you can create only one full-text index on a table or view and that you can create an index only on a table or view that is configured with a non-nullable unique index (with only one key). Otherwise, the process of creating a full-text index is very straightforward. For more details about full-text indexing and full-text search in general, be sure to check out SQL Server Books Online.

Tags: , , , ,

  • 348137 views

  • Rate
    [Total: 23    Average: 4.4/5]
  • SneWs

    Good article.
    This is a nice introduction to full text indexing with SQL Server.

    Tanks and have a good one!

  • Ginters

    Excellent stuff
    Full text indexing a very important thing that I know I should do more of.

  • Keith

    Next Article
    I look forward to the next article and hope it provides more information then just how to use the “where blank like blank%” articles you find all over the internet. I am looking for some advanced queries of a full text index.

  • Ramesh

    Demistified
    Thanks alot Rob, for the demistifying the full text index. Appreciate response to some more questions.

    1. Why and where do we need full text index?
    2. What is the difference between noise words and stop words? when should be we noise words vs stop words?

  • woakesd

    Email addresses
    I had a go at using full text indexes but the language sensitivity aspect seems a bit of a show stopper because it wouldn’t match an email address.

    If there are features that allow for matching of things like this please cover in another article.

    Great article though and I look forward to reading them.

  • MR.TAEJAAY

    Good Article
    This is good article for full-text index and catelog

  • Sam

    We can understand the basics.
    This is really a good article to have the basic understanding of the Full text indexing.
    I am really looking forward for the next article.

    Cheers,
    Sam.

  • kalyansundar

    understand basics of Full Text
    Must Read article

  • Houmed

    Thanking to Mr. Robert Sheldon

    before anything Alhamdouli-allah…. I am thanking you so much and this is one the kind article…. cover all the essential basics of Full-Text Searching… so clean and so easy to understand… Thank you God bless you wa yahdik’allah incha’allah 🙂

  • patrick2525

    Clarification for ambiguous ‘index’
    Thanks for the great write-up. I have just one question – and hope that its not too late to ask since this post is over 2 years old!

    In the second paragraph of the Creating a Full-Text Index section, you state “In addition, the index must be defined on a single column and be non-nullable.”

    Was that statement referring to the full-text index, or the unique index for the table (as mentioned in the previous sentence)?

    Thanks!

  • patrick2525

    Clarification for ambiguous ‘index’
    Thanks for the great write-up. I have just one question – and hope that its not too late to ask since this post is over 2 years old!

    In the second paragraph of the Creating a Full-Text Index section, you state “In addition, the index must be defined on a single column and be non-nullable.”

    Was that statement referring to the full-text index, or the unique index for the table (as mentioned in the previous sentence)?

    Thanks!

  • TheHappyDBA

    Fulltext Indexes
    Awesome article! Answered a few questions and let me better understand fulltext indexes.

    Great job! Thanks a bunch!

  • badal.kumar@gmail.com

    Same contains statement fail on server
    Hi Robert,

    Your articles are really very nice. It’s really a good piece of work.

    I’ve one quick query:

    SQL Statement:

    Select [Sample Product] from ProductList
    where contains([Sample Product],’"Simple Text-Lock (D)"’)

    Same above statement working fine (It means it’s returning four result) on our Test environment but fail on UAT environment (Note: its contains same data in database), same full text catalogue etc. no difference but fail to return any record. If we are removing “( and )” from above statement, then it’s returning expected result on both environment.

    Few more details which may help to move in right direction.

    Collation_name: SQL_Latin1_General_CP1_CI_AS

    Result after executing – SELECT * FROM sys.dm_fts_parser (‘"Ariel Dirt-Lock (D)"’, 1033, NULL, 0) on both server, result is same as mentioned:

    Keyword group_id phrase_id occurance special_term display_term expension_type source_term
    0x0061007200690065006C 1 0 1 Exact Match Simple 0 Simple Text-Lock (D)
    0x0064006900720074 1 1 2 Exact Match Text 0 Simple Text-Lock (D)
    0x006C006F0063006B 1 1 3 Exact Match Lock 0 Simple Text-Lock (D)
    0x0064006900720074006C006F0063006B 1 2 2 Exact Match Textlock 0 Simple Text-Lock (D)
    0x0064 1 0 4 Exact Match d 0 Simple Text-Lock (D)

    Value of statement – Select * from sys.configurations where name like ‘%noise%’ is 0 on both server:

    Any kind of help or suggestion will be highly appreciated.

    Thanks

  • AndreiPetrut

    Indexing .msg files
    Hi Robert,

    The article was great.

    I tried to apply a Full Text Index in SQL Server 2008 on a varbinary(max) column that contains documents uploaded using FileStream. Some documents are .msg files that come from Outlook and for some reason the content of these .msg files is not indexed 100%. Out of 2000 words from the email (title, body and signature) only 400 words were being indexed.

    Do you know why this might happen ? We rebuilt the catalog and the index and still no change occured. There is no error in the trace files.

    Any kind of help is really appreciated.

    Thanks,
    Andrei

  • Gyozo Kudor

    How do you search?