Av rating:
Total votes: 94
Total comments: 7


Robert Sheldon
Understanding Full-Text Indexing in SQL Server
29 December 2008

Microsoft has quietly been improving full-text indexing in SQL Server. It is time to take a good look at what it offers. Who better to give us that look than Robert Sheldon, in the first of a series.

The most commonly used indexes in a SQL Server database are clustered and nonclustered indexes that are organized in a B-tree structure. You can create these types of indexes on most columns in a table or a view, except those columns configured with large object (LOB) data types, such as text and varchar(max). Although this limitation is not a problem in many cases, there will be times when you’ll want to query such column types. However, without indexes defined on the columns, the query engine must perform a full table scan to locate the necessary data. But there is a solution—full-text search.

Full-text search refers to the functionality in SQL Server that supports full-text queries against character-based data. These types of queries can include words and phrases as well as multiple forms of a word or phrase. To support full-text queries, full-text indexes must be implemented on the columns referenced in the query. The columns can be configured with character data types (such as char and varchar) or with binary data types (such as varbinary and image). A full-text index is made up of word tokens that are derived from the text being indexed. For example, if the indexed text contains the phrase “tables can include indexes,” the full-text index would contain four tokens: “tables,” “can,” “include,” and “indexes.” Because the list of tokens can be easily searched, full-text queries can quickly locate the necessary records.

In this article, I explain how to implement full-text indexing in your SQL Server 2005 or 2008 database, and I provide a number of examples to demonstrate how this is done. For these examples, I used the following script to create and populate the ProductDocs table in the AdventureWorks2008 sample database in SQL Server 2008:

USE AdventureWorks2008

GO

IF OBJECT_ID (N'ProductDocs', N'U') IS NOT NULL

DROP TABLE ProductDocs

GO

CREATE TABLE ProductDocs (

  DocID INT NOT NULL IDENTITY,

  DocTitle NVARCHAR(50) NOT NULL,

  DocFilename NVARCHAR(400) NOT NULL,

  FileExtension NVARCHAR(8) NOT NULL,

  DocSummary NVARCHAR(MAX) NULL,

  DocContent VARBINARY(MAX) NULL,

  CONSTRAINT [PK_ProductDocs_DocID] PRIMARY KEY CLUSTERED (DocID ASC)

)

GO

INSERT INTO ProductDocs

(DocTitle, DocFilename, FileExtension, DocSummary, DocContent)

SELECT Title, FileName, FileExtension, DocumentSummary, Document

FROM Production.Document

GO

If you’re using the AdventureWorks sample database in SQL Server 2005, simply change the name of the database in the USE statement.

To implement full-text indexing in SQL Server, you should take the following steps:

  1. Create a full-text catalog, if necessary.
  2. Create the full-text index.
  3. Modify the list of noise words (SQL Server 2005) or stop words (SQL Server 2008), if necessary.
  4. Modify the thesaurus for the language being used, if necessary.

These steps are the same in both SQL Server 2005 and 2008, although the specific features supported in each version differ. Part of that difference results from the fact that full-text search is now integrated in the SQL Server 2008 database engine. Prior to 2008, full-text search was based on the MSSearch engine, which sits outside SQL Server and is part of the Microsoft Office group.

Creating the Full-Text Catalog

A full-text catalog provides a mechanism for organizing full-text indexes. Each catalog can contain zero or more indexes, but each index can be associated with only one catalog. Catalogs are implemented differently in SQL Server 2005 and 2008:

  • SQL Server 2005: A full-text catalog is a physical structure that must reside on the local hard drive associated with the SQL Server instance. Each catalog is part of a specific filegroup. If no filegroup is specified when the catalog is created, the default filegroup is used.
  • SQL Server 2008: A full-text catalog is a logical concept that refers to a group of full-text indexes. The catalog is not associated with a filegroup.

To create a full-text catalog in either version of SQL Server, you can use the CREATE FULLTEXT CATALOG statement, as shown in the following example:

USE AdventureWorks2008

GO

CREATE FULLTEXT CATALOG ProductFTS

WITH ACCENT_SENSITIVITY = OFF

Full-text catalogs are associated with specific databases. In this case, I added the catalog to the AdventureWorks2008 database, but you can create the catalog on any user-defined database.

The only required clause in a CREATE FULLTEXT CATALOG statement is the CREATE FULLTEXT CATALOG clause, which requires that you provide a name for the catalog (in this case, ProductFTS). You can also specify whether this catalog is the default catalog, the catalog owner, or, as in the example above, whether the indexes in the catalog are accent sensitive or insensitive.

If the WITH ACCENT_SENSITIVITY clause is not specified, accent sensitivity is based on the default database collation. You can determine a database’s collation by querying the sys.databases catalog view, as shown in the following example:

SELECT name, collation_name FROM sys.databases

WHERE name = 'AdventureWorks2008'

The follow table shows the statement’s results:

Name

collation_name

AdventureWorks2008

SQL_Latin1_General_CP1_CI_AS

Note: The results shown above, along with the results shown for other examples in this article, are based on queries issued in SQL Server 2008 against the AdventureWorks2008 database. You results may be different depending on the SQL Server version and the database used.

As mentioned above, SQL Server 2005 full-text catalogs are part of a filegroup. If you’re running the CREATE FULLTEXT CATALOG statement against SQL Server 2005, you can specify the name of the filegroup by using the ON FILEGROUP option. You can also specify the directory on which to store the catalog by using the IN PATH option. Both of these options are ignored in SQL Server 2008. For details about the ON FILEGROUP and IN PATH options, see the topic “CREATE FULLTEXT CATALOG (Transact-SQL)” in SQL Server 2005 Books Online.

After you create a full-text catalog, you can use the sys.fulltext_catalogs catalog view to verify that the catalog has been created:

SELECT fulltext_catalog_id, name FROM sys.fulltext_catalogs

In this case, the statement returns the results shown in the following table:

fulltext_catalog_id

name

5

AW2008FullTextCatalog

10

ProductFTS

Notice that the ProductFTS full-text catalog is included in the results. Also notice that in this case the fulltext_catalog_id value is 10. The ID is a good number to note because you can use it in other statements that are related to the full-text catalog.

Creating the Full-Text Index

After you create your full-text catalog, you’re ready to create your full-text index. You can then associate the index with the new catalog. If you don’t specify a catalog when you create the index, the index is associated with the database’s default catalog, whether it is the system catalog or a user-defined catalog that has been configured as the default.

A full-text index is defined at the table level, and only one full-text index can be defined per table. For a table to support a full-text index, a unique index must be defined on that table. In addition, the index must be defined on a single column and be non-nullable. This column is referred to as the key index in the full-text index definition. For best performance, the key index should be defined on a column configured with an integer data type. Often, the primary key is a good candidate for a key index.

You can define a full-text index by using the CREATE FULLTEXT INDEX statement, as shown in the following example.

CREATE FULLTEXT INDEX ON ProductDocs

(DocSummary, DocContent TYPE COLUMN FileExtension LANGUAGE 1033)

KEY INDEX PK_ProductDocs_DocID

ON ProductFTS

WITH STOPLIST = SYSTEM

The first line of the statement includes the ON clause, which specifies the table name (in this case, ProductDocs). The statement’s next line is a list of the columns that should be indexed (DocSummary and DocContent).

Notice that the DocContent column definition includes the TYPE COLUMN clause and the LANGUAGE clause. The TYPE COLUMN clause is necessary when the indexed column stores binary data. The TYPE COLUMN clause specifies the name of a different column in the table that stores the file extension for the binary data. For example, the binary data might be a .doc file or .xls file. SQL Server uses the column specified in the TYPE COLUMN clause to associate the binary data with the program, such as Word or Excel.

You can use the sys.fulltext_document_types catalog view to return a list of the document types supported by full-text search, as shown in the following statement:

SELECT document_type, version, manufacturer

FROM sys.fulltext_document_types

The following table provides a sample of some of the document types (indicated by their file extension) supported by full-text search:

document_type

Version

.ascx

12.0.6828.0

.asm

12.0.6828.0

.asp

12.0.6828.0

.aspx

12.0.6828.0

.bat

12.0.6828.0

.c

12.0.6828.0

.cmd

12.0.6828.0

.cpp

12.0.6828.0

.cxx

12.0.6828.0

.def

12.0.6828.0

.dic

12.0.6828.0

.doc

2006.0.6001.16503

The second clause in the DocContent column definition is LANGUAGE. For each column that you include in your index, you can specify the language of the documents in that column. You reference the language by using the locale identifier (LCID). You can view a list of the identifiers and their associated languages by using the sys.fulltext_languages catalog view:

SELECT * FROM sys.fulltext_languages

ORDER BY lcid

The following table shows a part of the result set. As you can see, 1033, the ID I use in the LANGUAGE clause, is English.

lcid

Name

0

Neutral

1025

Arabic

1026

Bulgarian

1027

Catalan

1028

Traditional Chinese

1031

German

1033

English

1036

French

1037

Hebrew

1039

Icelandic

1040

Italian

The next line of the CREATE FULLTEXT INDEX statement in the preceding example is the KEY INDEX clause. This is the name of the unique key index (in this case, PK_ProductDocs_DocID) that is defined on the ProductDocs table. Be sure to specify the index name, and not the column name, when defining your full-text index.

Following the KEY INDEX clause in the full-text index definition is the ON clause, which specifies the name of the full-text catalog (ProductFTS) that the index will join. In SQL Server 2008, you can also specify a filegroup where the index will be stored. However, this option isn’t available in SQL Server 2005 because filegroup association is at the catalog level.

The final clause in the example CREATE FULLTEXT INDEX statement is WITH STOPLIST. This option, available only in SQL Server 2008, lets you specify the name of the stoplist that will be used for this index. In this case, the system stoplist is used, but you can instead specify a user-defined stoplist. (Stoplists are covered in more detail later in the article.)

After you create a full-text index on a table, you can use several catalog views to verify the index has been created and is associated with the correct catalog. The following SELECT statement joins the sys.tables, sys.fulltext_indexes, and sys.fulltext_catalogs views:

SELECT t.name AS TableName, c.name AS FTCatalogName

FROM sys.tables t JOIN sys.fulltext_indexes i

  ON t.object_id = i.object_id

JOIN sys.fulltext_catalogs c

  ON i.fulltext_catalog_id = c.fulltext_catalog_id

As you can see in the following results, the ProductDocs table is associated with the ProductFTS catalog. Only tables with full-text indexes are listed in the results.

TableName

FTCatalogName

ProductReview

AW2008FullTextCatalog

Document

AW2008FullTextCatalog

JobCandidate

AW2008FullTextCatalog

ProductDocs

ProductFTS

One of the advantages of SQL Server 2008 is that the index is now stored within the database. That means you can issue a query that lists the contents of the index, something you cannot do in SQL Server 2005. The following SELECT statement uses the sys.dm_fts_index_keywords dynamic management function to return the list of terms stored in the full-text index created on the ProductDocs table:

SELECT display_term, column_id, document_count

FROM sys.dm_fts_index_keywords

  (DB_ID('AdventureWorks2008'), OBJECT_ID('ProductDocs'))

The results returned by the statement include the indexed terms, along with the column ID and document count (number of rows) that contain the term. The column ID is based on the order the columns are defined in the table definition. In the ProductDocs table, the two indexed columns are DocSummary (the fifth column defined) and DocContent (the sixth column defined). The following table shows the first 25 terms stored in the ProductDocs full-text index.

display_term

column_id

document_count

100

5

1

100

6

1

1000

6

1

150

6

1

16

6

2

20

6

1

2000

6

1

20w

6

1

23ft

6

1

248

6

1

250

6

1

3000

6

1

44

6

1

48

6

1

500

5

1

500

6

1

618

6

1

619

6

1

620

6

1

Above

6

2

Absorbing

6

1

Acceptable

6

1

Accessories

6

1

Add

6

1

Adding

6

1

Modifying the List of Noise Words or Stop Words

When implementing full-text indexing in SQL Server, the area in which you will probably see the greatest differences between SQL Server 2005 and 2008 is in the way each version handles noise words or stop word.

Noise Words in SQL Server 2005

Noise words are those words that are automatically removed from a full-text index when that index is created. For example, in the phrase “an apple and an orange,” the words “an” and “and” are considered noise words and are not be included in the index. Only “apple” and “orange” are tokenized and added to the index.

SQL Server 2005 defines noise words in a set of text files, which by default, are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. Each text file is associated with a specific language and contains the noise words associated in that language. For example, the English noise words are stored in the noiseENG.txt file. The following list shows a sample of some of the words in the file:

about

1

after

2

all

also

3

an

4

and

5

another

6

any

7

are

8

as

9

at

0

be

$

because

been

before

being

between

both

but

by

came

can

come

could

did

do

You can edit a noise word file in any text editor. Simply add or remove words as necessary, and then save your changes. (Be sure to save a copy of the original file before changing it.)

Note: After you modify a noise word file, you must repopulate the applicable full-text indexes for the changes to take effect.

Stop Words in SQL Server 2008

SQL Server 2008 uses stop words, not noise words. Stop words are saved to stoplists, which are stored within SQL Server. You can create a stoplist by using a CREATE FULLTEXT STOPLIST statement, as shown in the following example:

CREATE FULLTEXT STOPLIST ProductSL

FROM SYSTEM STOPLIST;

In the first line, the statement creates a stoplist named ProductSL. The second line retrieves the stop words from the system stoplist and uses those stop words to populate the new stoplist. You can verify that the stoplist has been created by querying the sys.fulltext_stoplists catalog view:

SELECT stoplist_id, name FROM sys.fulltext_stoplists

As you can see in the following results, the ProductSL stoplist has been added to the database, and the stoplist ID is 5.

stoplist_id

Name

5

ProductSL

After you create a stoplist, you can view its contents by querying the sys.fulltext_stopwords catalog view, as shown in the following statement:

SELECT stopword FROM sys.fulltext_stopwords

WHERE stoplist_id = 5 AND language_id = 1033

Notice that I specified the stoplist ID and language ID to retrieve only the necessary content. The following table shows a partial list of the stop words saved to the ProductSL stoplist.

Stopword

About

After

All

Also

An

And

another

Any

Are

As

At

Be

because

been

before

being

between

both

But

By

came

Can

come

could

Did

Do

does

each

else

For

from

Get

Got

Had

Has

have

He

Her

here

Him

Himself

His

How

In SQL Server 2008, you can add words to or remove words from a stoplist by using the ALTER FULLTEXT STOPLIST statement. The following statement adds the word “nuts” to the stoplist:

ALTER FULLTEXT STOPLIST ProductSL

ADD 'nuts' LANGUAGE 1033;

When you generate a full-text index based on this stoplist, any occurrence of “nuts” will be treated as a stop word and be removed from the index.

SQL Server 2008 also includes the sys.dm_fts_parser dynamic management function. The function lets you test how SQL Server will tokenize a string based on a specific language and stoplist. In the following SELECT statement, the function parses the phrase “testing for fruit and nuts, any type of nut.”

SELECT special_term, display_term

FROM sys.dm_fts_parser

  (' "testing for fruit and nuts, any type of nut" ', 1033, 5, 0)

The first argument in the function is the string that will be parsed, the second argument is the language ID, the third argument is the stoplist ID, and the fourth argument specifies whether the parsing should be accent insensitive (0) or accent sensitive (1). The following table shows the query results. Notice that “nuts” is considered a noise word.

special_term

display_term

Exact Match

Testing

Noise Word

For

Exact Match

Fruit

Noise Word

And

Noise Word

Nuts

Noise Word

Any

Exact Match

Type

Noise Word

Of

Exact Match

Nut

If you want to drop “nuts” from the stoplist, you can use the following statement:

ALTER FULLTEXT STOPLIST ProductSL

DROP 'nuts' LANGUAGE 1033;

Now when you use the sys.dm_fts_parser function to view the stoplist, you will receive the following results, which show that “nuts” is now considered an exact match.

special_term

display_term

Exact Match

Testing

Noise Word

For

Exact Match

Fruit

Noise Word

And

Exact Match

Nuts

Noise Word

Any

Exact Match

Type

Noise Word

Of

Exact Match

Nut

Earlier in the article, when I created the full-text index on the ProductDocs table, I specified that the index should use the system stoplist. I can now modify the index definition to instead use the ProductSL index. In the following ALTER FULLTEXT INDEX statement, I set the stoplist to ProductSL:

ALTER FULLTEXT INDEX ON ProductDocs

SET STOPLIST ProductSL

Note: After you modify a stoplist or full-text index definition, you must repopulate the applicable full-text indexes for the changes to take effect.

Modifying the Full-Text Thesaurus

Both SQL Server 2005 and SQL Server 2008 provide a set of XML thesaurus files that let you define synonyms to support full-text queries. For example, you can define a set of synonyms for “song,” “tune,” and “music.” That way, whenever a query is issued against any one of these terms, the results include every other term defined in the set.

SQL Server includes a thesaurus file for each language supported by full-text search. The files are named according to the language they support. For instance, the thesaurus file that supports English synonyms is named tseng.xml.

In a default installation of SQL Server 2005, the thesaurus files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. In a default installation of SQL Server 2008, the files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData.

Each thesaurus file comes preconfigured with the following XML code. For example, the following code is included in the tseng.xml file:

<XML ID="Microsoft Search Thesaurus">

 

<!--  Commented out (SQL Server 2008)

 

    <thesaurus xmlns="x-schema:tsSchema.xml">

   <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>Internet Explorer</sub>

            <sub>IE</sub>

            <sub>IE5</sub>

        </expansion>

        <replacement>

            <pat>NT5</pat>

            <pat>W2K</pat>

            <sub>Windows 2000</sub>

        </replacement>

        <expansion>

            <sub>run</sub>

            <sub>jog</sub>

        </expansion>

    </thesaurus>

-->

</XML>

As you can see, most of the code is commented out, but you can modify the file as necessary. The first element to note is <diacritics_sensitive>. This element determines whether the thesaurus file is accent sensitive (1) or insensitive (0). By default, a thesaurus file is accent insensitive.

The other two elements worth noting are <expansion> and <replacement>. The first lets you define expansion sets, and the second lets you define replacement sets:

  • Expansion set: A set of terms that can be substituted for each other. For example, in the sample <expansion> element in the tseng.xml file, three terms are included: “Internet Explorer,” “IE,” and “IE5.” As a result, if a full-text query includes the term “Internet Explorer,” the query results can include content that contains any of the three terms.
  • Replacement set: A set of terms in which one term is substituted for another. For example, in the example <replacement> element in the tseng.xml file, the terms “NT5” and “W2K” are replaced by “Windows 2000.” As a result, if a full-text query includes the term “NT5” or “W2K,” the search results include only content that contains “Windows 2000.”

You can edit a thesaurus file in any XML or text editor. After you save your changes, you must take steps to ensure that the thesaurus is applied to your full-text searches. In SQL Server 2005, this means you must restart the full-text search engine. However, in SQL Server 2008, you need only run the sys.sp_fulltext_load_thesaurus_file system stored procedure, as shown in the following example:

EXEC sys.sp_fulltext_load_thesaurus_file 1033

As you can see, when you run the stored procedure, you must specify the LCID. The stored procedure then parses and loads the data from the applicable thesaurus file, without having to restart the full-text engine.

Looking Ahead

After you’ve implemented a full-text index on a table, you can then query the indexed columns by using special predicates or functions. In my next article, I will explain how to create those queries and demonstrate the syntax that you use. Until then, you should now have the information you need to create your full-text indexes. Remember that you can create only one full-text index on a table or view and that you can create an index only on a table or view that is configured with a non-nullable unique index (with only one key). Otherwise, the process of creating a full-text index is very straightforward. For more details about full-text indexing and full-text search in general, be sure to check out SQL Server Books Online.



This article has been viewed 20133 times.
Robert Sheldon

Author profile: Robert Sheldon

After being dropped 35 feet from a helicopter and spending the next year recovering, Robert Sheldon left the Colorado Rockies and emergency rescue work to pursue safer and less painful interests—thus his entry into the world of technology. He is now a technical consultant and the author of numerous books, articles, and training material related to Microsoft Windows, various relational database management systems, and business intelligence design and implementation. He has also written news stories, feature articles, restaurant reviews, legal summaries, and the novel 'Dancing the River Lightly'. You can find more information at http://www.rhsheldon.com.

Search for other articles by Robert Sheldon

Rate this article:   Avg rating: from a total of 94 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Good article.
Posted by: SneWs (not signed in)
Posted on: Tuesday, December 30, 2008 at 5:09 PM
Message: This is a nice introduction to full text indexing with SQL Server.

Tanks and have a good one!

Subject: Excellent stuff
Posted by: Ginters (view profile)
Posted on: Wednesday, January 14, 2009 at 3:40 AM
Message: Full text indexing a very important thing that I know I should do more of.

Subject: Next Article
Posted by: Keith (not signed in)
Posted on: Wednesday, January 14, 2009 at 9:01 AM
Message: I look forward to the next article and hope it provides more information then just how to use the "where blank like blank%" articles you find all over the internet. I am looking for some advanced queries of a full text index.

Subject: Demistified
Posted by: Ramesh (view profile)
Posted on: Wednesday, January 14, 2009 at 5:37 PM
Message: Thanks alot Rob, for the demistifying the full text index. Appreciate response to some more questions.

1. Why and where do we need full text index?
2. What is the difference between noise words and stop words? when should be we noise words vs stop words?

Subject: Email addresses
Posted by: woakesd (view profile)
Posted on: Thursday, January 15, 2009 at 6:00 AM
Message: I had a go at using full text indexes but the language sensitivity aspect seems a bit of a show stopper because it wouldn't match an email address.

If there are features that allow for matching of things like this please cover in another article.

Great article though and I look forward to reading them.

Subject: Good Article
Posted by: MR.TAEJAAY (not signed in)
Posted on: Monday, January 19, 2009 at 8:18 PM
Message: This is good article for full-text index and catelog



Subject: We can understand the basics.
Posted by: Sam (view profile)
Posted on: Tuesday, January 20, 2009 at 5:05 AM
Message: This is really a good article to have the basic understanding of the Full text indexing.
I am really looking forward for the next article.

Cheers,
Sam.

 










Phil Factor
Finding Stuff in SQL Server Database DDL
 You'd have thought that nothing would be easier than using SQL Server Management Studio (SSMS) for searching... Read more...



 View the blog
Implementing User-Defined Hierarchies in SQL Server Analysis Services
 To be able to drill into multidimensional cube data at several levels, you must implement all of the... Read more...

Using the Filtering API with the SQL Comparison SDK
 Red Gate's SQL Comparison SDK provides a means to compare and synchronize database schemas and data... Read more...

SQL Toolbelt 2008: Predominantly an Engineering Task
 The conversion of the Red-Gate tools to be compatible with SQL Server 2008 might not seem, on first... Read more...

SQL Response: The dim sum interview
 Richard Morris met David and Nigel of the SQL Response team, in a dim sum Restaurant in Cambridge. They... Read more...

Writing Efficient SQL: Set-Based Speed Phreakery
 Phil Factor's SQL Speed Phreak challenge is an event where coders battle to produce the fastest code to... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

SQL Server Full Text Search Language Features
 SQL Full-text Search (SQL FTS) is an optional component of SQL Server 7 and later, which allows fast... Read more...

Creating CSV Files Using BCP and Stored Procedures
 Nigel Rivett demonstrates some core techniques for extracting SQL Server data into CSV files, focussing... Read more...

Over 150,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk