SQL Server Full Text Search Language Features, Part 2

Extensive coverage of the query-time behaviour of SQL Server Full-text search, demonstrating how searches are expanded by the parser before hitting the full-text index.

Query-time behavior

This is the second of a two-part article that explores the language features of SQL Full-text Search (SQL FTS), an component of SQL Server 7 and above that allows fast and efficient querying of large amounts of unstructured textual data. Part I dealt with index time language options, covering how words or tokens are broken from the text stream emitted from the iFilters and stored in the index.

Here, we move on to query time behavior. This refers to the manner in which search phrases supplied in a query are expanded by the parser before hitting the full-text index. This expansion process is performed by a language-specific component called a stemmer and the exact expansions that occur will depend on the language-specific rules that are in place. However, for example, expansion can be used to accommodate the following types of searches:

  • Searches for plural forms of a word – a search on apple would also return apples
  • Thesaurus searches for synonymous forms of word – a search on IE might return hits to Internet Explorer and IE (thesaurus based expansion); a search on Bombay might also return hits to Mumbai (thesaurus based replacement)
  • Searches that will return all different linguistic forms of a word (called generations) – a search on bank would return hits to banking, banked, banks, banks’ and bank’s, etc. (all declensions and/or conjugations of the search term bank)
  • Accent insensitive searches – a search on café would return hits to cafe and café

Before we investigate stemming in more detail, it’s worth discussing the sort of expansions that will be performed depending on whether you issue a strict query (using the CONTAINS predicate) or a fuzzy query (using the FREETEXT predicate).

Strict querying using CONTAINS

By default, use of the CONTAINS predicate will entail minimal language specific query time expansion. For example, consider a search on the term book.

The asterisk (*) is used to search all full-text indexed columns. In SQL 2000 you can search all full-text indexed columns or one named full-text index column:

In SQL 2005 you can search all columns, a named column, or a subset of all full-text indexed columns:

There may be text in the table where the term “book” is used both as a noun (John threw the book at Sam) and a verb (John was booked for assault and battery when he threw the book at Sam). However, a query using the basic CONTAINS predicate will only match with book (and alternative word forms for book – although in English there are none).

Fortunately, you can choose to have your CONTAINS search expanded to capture different generations of a word by using the FORMSOF term. This term accepts two arguments – INFLECTIONAL or THESAURUS. The INFLECTIONAL argument will expand the search phrase to search on all conjugations and declensions or each word in the search phrase, and the THESAURUS argument will enable a thesaurus expansion on the search phrase. Here are two examples of what this would look like:

And

Furthermore, with some languages, additional breaking will occur by default at query time. For example, consider a search on häuser (German for houses) on a full-text indexed table containing the term häuser in one row and haeuser, the alternative word form, in another row. If the default full-text language on the Server was US English then the query would not be expanded and you would only get a hit on the row containing häuser:

However, if your server had a default full-text language of German, you would get hits on both rows because your search phrase would be automatically expanded to the following:

Note that the search phrases are now wrapped in double quotes. When using CONTAINS, you must use double quotes to wrap a search phrase containing more than one word.

Fuzzy querying using FREETEXT

The term “fuzzy” is something of a misnomer as it implies use of a technology such as Levenstein Edit Distance which is tolerant of minor spelling variations or mistakes, which is not the case. In fact, use of FREETEXT means that, by default, the search is expanded to encompass all generations of the searched word, for the default full-text language setting of your server. So, to continue the previous “book” example, a FREETEXT search on book:

would be expanded to the equivalent of:

The search would also return any relevant thesaurus expansions.

Note that when using FREETEXT in SQL 2000 you have to wrap your search phase in double quotes if a search phrase contains more than one word or token (as when using CONTAINS). However, in SQL 2005 you no longer need to do this when using the FREETEXT predicate. In fact if you wrap your search phrase in double quotes stemming and thesaurus expansion and replacements are disabled and your search phrase is treated as a strict match – in other words the functional equivalent of a CONTAINS predicate (although the ranking algorithm is different).

 The LANGUAGE argument

In SQL 2005, you can override the default full-text language settings for your server by using the LANGUAGE argument with the CONTAINS and FREETEXT predicates and have your searches conducted in different languages. For example, on a SQL 2005 Server with a default full-text language setting of US_ENGLISH, you could search on German phrases by using the LANGUAGE argument, and have your German search terms stemmed in a FREETEXT predicate according to German language rules.

Script 3 provides a full example of this (see the “Scripts” link at the top right of this page). In this script we are loading HTML documents into the varbinary data type column of a table called German. In each case we specify that word boundaries are identified using the US English Word breaker, as denoted by the LANGUAGE 1033 clause:

However, when we load some of these documents into the database, we include a German language MetaTag:

And when we load others we use the US English language MetaTag:

So, a query on wanderlust returns two documents – the English and German language documents contain the word wanderlust.

However, a search on lust returns three documents – two exact matches for lust and one further match with wanderlust in the German language document. The reason for this is that the German word breaker was launched by the HTML language tag and this overrides the word breaker setting in our full-text index creation statement. As a result, wanderlust is broken by the German word breaker into wanderlust, wandern, and lust. Hence a search on lust returns the German HTML document with wanderlust in it.

If we override the default full-text language settings for our server (US English) and set the language of our search to German:

Then we get six rows returned, since now our search phrase is automatically expanded into wanderlust, wandern and lust, and so we get matches for all three of these in both German and English documents.

Stemming

By stemming, Microsoft means:

  • Generating declensions of a word – possessive and gender (although gender is not supported in SQL FTS)
  • Generating conjugations of a word

Stemming reduces a search term to its linguistic root (for example careers would be reduced to career), and then expands from that root to encompass all linguistic forms of that root (for example career, career’s, careers’, careers’, careered, and careering).

Most stemmers implement a form of the Porter Stemming Algorithm with dictionaries to handle irregularly stemmed words such as goose/geese, mouse/mice, fit/fat, and so on. Some of the more sophisticated stemmers implement an inference algorithm which when encountering a freshly minted word (like blog for instance) can conjugate, or generate declensions of, this new word based on its similarity to other older words.

Note: It is interesting to search on terms such as “careers” on various web sites to see what matches are returned. It is frequently incorrectly stemmed as careful.

Declensions

Generating declensions is best illustrated through a few examples:

  • A FREETEXT or CONTAINS search on the noun apple would return hits to apple and apples
  • A FREETEXT or CONTAINS inflectional search on the advjective pretty would match to pretty, prettier and prettiest.

Please see script 4 for examples of declination in action.

Conjugation

Again, this is best illustrated by example. If you search on the verb book the stemmer will generate all conjugations of this term and row matches will be returned for book, booked, booking, and books. Script 4 has an example of this.

Thesaurus-based searches

The thesaurus option allows expansion of your search term to include preconfigured terms, and replacement of your search term with another search term. To configure your thesaurus options, edit the files found in C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. The thesaurus file for your language will follow the naming convention tsXXX.xml where XXX represents your language code, i.e. ENU for US-English, ENG for UK-English, and so on.

If you edit tsXXX.xml in a text editor you will find that there are two sections, or nodes, to the thesaurus file: an expansion node, and a replacement node. The expansion node is used to extend your search argument from one term to another – known as thesaurus expansion. For example in the thesaurus file you will find the following expansion:

This will convert any searches on IE to search on IE, IE5, and Internet Explorer.

Note that you will need to remove the comment lines from your thesaurus file in order to activate this.

The replacement node is used to replace one search argument with another – known as thesaurus replacement. For example, you may want a search on sex to be replaced by a search on gender. By configuring your thesaurus file as follows, any search on sex will only get hits to rows containing the term gender and will miss rows that do not contain gender, but contain the word sex:

The pat element (sex) indicates the pattern you want substituted by the sub element (gender).

A FREETEXT query will automatically use the thesaurus file for the language type, and a CONTAINS query will do so when you use FORMSOF with the thesaurus option:

This would return matches to rows containing IE, IE5, and Internet Explorer.

Diacritics and accents

A diacritical mark, or diacritic (sometimes called an accent mark), is a mark added to a letter to alter a word’s pronunciation or to distinguish it from other similar words. Previous versions of SQL Server were unable to handle accents so a search on café would not match with rows containing cafe – even when using the French word breaker or stemmer.

SQL Server 2005 has the option of creating accent insensitive catalogs. This means that a search on café will match with café, cafe, cafè, cafê, and cafë. In other words it does a very unintelligent match, but one that will solve many of the problems that users have when handling diacritics. Script 5 illustrates this functionality.

False friends or false conjugates

When you are searching content that is written in a variety of languages, you must be careful to account for a linguistic phenomenon called false friends or false conjugates.

A false friend is a word that has the same spelling in different languages, but different meanings. An example is the word gift which means a present in English, but in German means poison. You need to be careful to avoid a search on the English word returning rows with German content.False friends are not to be confused with wanderworts – words which occur with the same spelling in different languages and also have the same meaning.

The best approach to problems like this is to segregate your content into language specific columns.

UK vs. US English at query time

There are very slight variations between US and UK English stemmers, mostly arising from spelling variations between American and International English. For example in America we spell color and center, analog, apologize, flavor; whereas in Britain these words are spelt colour, centre, analogue, apologise, and flavour. There is no conformity in the rest of the English speaking world, however most of the English speaking countries use the International version of English; although it may vary from one region to another – for example in Western Canada – the spelling center predominates, in Eastern Canada centre predominates.

If you are querying in International English color would be stemmed as color , whereas colour would be stemmed as colour, colour’s, colours, colours’, coloured and colouring. Likewise if you searched on colour using American English you would not get hits to rows containing colour, colour’s, colours, colours’, coloured and colouring.

The key to avoiding idiomatic spelling variations is to scrupulously clean your content or use the thesaurus option to expand your search to alternate spellings.

Summary

This two-part article covered language features of SQL full text search. Key to keep in mind is that the language features occur at index time and query time. Different content types may control which language resources are used and, depending on the language resources used, the content may be broken at intervals other than white space and punctuation. A word or token may be indexed as more than one word, so when you query on a word a match may be made to one of the alternate word forms of that word. At query time the search is expanded according to language rules to provide a more natural way of searching that incorporates language features.

These language features make SQL full-text search a very valuable tool for the developer and DBA.

Downloads

Tags: , , , , ,

  • 94765 views

  • Rate
    [Total: 93    Average: 4.2/5]
  • Anonymous

    FTS
    Is great to have all the anglo cultures in sight, there are Spanish correlation between terms too.
    Languajes are a beauty of evolution and they can become tricky to undersatnd or evaluate.

  • Anonymous

    Your article
    This was a helpful article.

    One thing that would have made it a home run for my project: examples of multi-word searches. So someone might search for

    beans and rice
    “dog crate” or chain
    “dog crate” or “dog chain” not cat

    (I know these are crazy examples)

    And how the engine interprets those.

    Thanks
    Scott

  • Anonymous

    Need contact details
    Where can I find the contact details of Hilary Cotter.

    Regards,
    Naveed Ahmed

  • Anonymous

    language settings???
    SELECT CAST(’31/12/1999 23:59:59′);

    produces a SQL conversion error on one SQL machine, while not on another.

    Both users on both machines, and the user accounts running the sql service have the same ‘locale’ settings, and the language settings for both DB’s are English (US) 1033, as is the language on the connections.

    so… the question is just what DOES determine the date format in SQL SERVER when casting from a varchar because it isn’t anything obvious?
    (i know i can avoid the issue altogether by storing the date as a date or unambiguous format – bt lets assume for the sake of argument i can’t).

    Cheers
    Dr. Evil.

  • Anonymous

    How I can to do it?
    Hy,

    I have a condition that i don’t know what I can to do in SQL Server Full Text Search Language.

    About this example:

    This is SQL:

    select
    *
    from
    sum_net_dominio
    where
    ‘http://www.google.com.br/search?q=FULL+text+index+SQL+Server+2000&hl=pt-BR&safe=off&start=30&sa=N’
    like ‘%’+nome_dominio+’%’

    and this the result

    google.com.br/

    If somebody can help, I will be welcome.

    Thanks
    Pedro Usinskas.

  • Anonymous

    Hy, I have a problem

    I have a condition that i don’t know how I can to do in SQL Server Full Text Search Language.

    About this example:

    This is SQL:

    select
    *
    from
    sum_net_dominio
    where
    ‘http://www.google.com.br/search?q=FULL+text+index+SQL+Server+2000&hl=pt-BR&safe=off&start=30&sa=N’
    like ‘%’+nome_dominio+’%’

    and this the result

    google.com.br/

    If somebody can help, I will be welcome.

    Thanks
    Pedro Usinskas.

  • Anonymous

    How I can to do it?
    Hy,

    I have a condition that i don’t know what I can to do in SQL Server Full Text Search Language.

    About this example:

    This is SQL:

    select
    *
    from
    sum_net_dominio
    where
    ‘http://www.google.com.br/search?q=FULL+text+index+SQL+Server+2000&hl=pt-BR&safe=off&start=30&sa=N’
    like ‘%’+nome_dominio+’%’

    and this the result

    google.com.br/

    If somebody can help, I will be welcome.

    Thanks
    Pedro Usinskas.

  • Charneet

    REGARDING FULL TEXT SEARCH
    i am trying to use full text to search on .doc file stored in database but when i am tryin to output the “Comments” field result is coming in Hexadecimal or encrypted values. can u help?

    create fulltext catalog cat4;

    CREATE UNIQUE INDEX Ui_doc on Commentsdata(id)

    CREATE FULLTEXT INDEX ON Commentsdata
    (
    Comments
    TYPE COLUMN FileExtension
    Language 0X0

    )
    KEY INDEX Ui_doc ON cat4
    WITH CHANGE_TRACKING AUTO

    Select Comments from Commentsdata where contains(Comments,'”class”‘);

  • Anonymous

    REGARDING FULL TEXT SEARCH WITH WILD CARD

    SELECT A
    FROM B
    WHERE CONTAINS (A_B,'”*abc0111*”‘)

    Fulltext index is configured on column A_B

    It gives me results when I search using following text:
    ‘”*abc0111*”‘ OR ””*abc*”‘ OR ””*a*”‘

    But if i just use ‘”*0111*”‘ it produces no result. Is it a limitation of Full Text search in SQL server 2005

  • Anonymous

    Accent InSensitive Search
    Hello,
    I need to return return results wheresearching for cafe would return cafe and café.

    I don’t know about Index. If index are required, can you point me to a good resource.

    Thanks
    Peeyush

  • sam

    diacritics
    Hello anonymous,

    He states the cafe search would work in 2005, not 2000.

  • Anonymous

    More functionalities
    Hello,

    Thank you for the detailed article.

    I believe FTS engine has info to give more functionalities than the one we are getting now. Just return the Row and Rank is not enough. It seems the engine knows where exactly the indexed text/string or whatever was indexed located in the document. Most probability, it might not need a lot of effort to know the count of this indexed text in the document. These two functionalities, I found them very important, most especially if the searched document is type XML. E.g. XML element/s can be included into the results in order to show where exactly the text was found in the XML document.

    Thanks

  • rsadams

    Location of search terms in text
    Hi,

    Thanks for the article. But does anyone know if it’s possible to get the location of the search terms inside the text. I’d like to do the standard text highlight of searched terms when I display my results..

    Thanks for any help,

    Richard Adams

  • xd125

    about LANGUAGE argument
    i try to execute a sql like this:
    select * from tablename freetext(tablename .*,’中文’, language 2052), i use this script to search the chinese characters,but it seems not wroking(the default full-text language settings is english, ),but if i set the default full-text language to simplified chinse, the script works,what should i do?

  • ebegoli

    Ranking matches
    Hilary:

    Do you know of any SQL full text function that would allow for ranking or counting of the matches in the text.

    Example:

    I have a text:
    “Lorem ipsum dolor sit amet, consectetur adipisicing magna elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.”

    and I want to check how many times ‘magna’ occurs in the text.

    Do you know of any SQL server function that would be useful for this.

    Thank you,
    Edmon

  • prakash.gupta83@hotmail.com

    Retrieve documents from their containts
    I am uploading word,pdf files and storing these file in a drive(C) within a folder and store path of this folder into database.
    I want to search any word within document(word,pdf files) and it should display all the files(word,pdf files) containing that word.

    any help

    Thanks
    Prakash Gupta
    prakash.gupta83@hotmail.com

  • Zehra Nasif

    Thank you
    This post was a tremendous help during our SQL 2000 upgrade to SQL 2008. We were not using CONTAINS() and running LIKE ‘%xx%’ query and it was timing out. As soon as we used FT index and CONTAINS(‘”*xx*”‘), it was returning records in a few seconds.

    I loved the way how you explained cleanly with examples. Thanks again!

  • bhandari220

    Wrong behavior : sometimes not at the top of the list
    1. Why are the most relevant results sometimes not at the top of the list? For example, a title search for “The Road” should have the exact match as the top hit and then titles that contain both words such as “Ministries of Mercy: The Call of Jericho Road (Unabridged)”. The exact match “The Road” by Cormac McCarthy isn’t anywhere to be found in the first 250 search results (I stopped looking at that point).Is there a setting in MS SQL to change this behavior?

  • asadhkhan

    Thesaurus
    I have used Oracle text all my life and now running an application where I would like to use FullText.

    I have a good understanding of the query and operators.

    Where i am having difficulty is the use of FreeText.

    FreeText is used to run against thesaurus.

    But reading into the Thesaurus, it looks like there is no default thesaurus built in like Oracle Text.

    How do I load a get a base line, instead of modifying the text file to load.

    Example:, if search for ‘laptop’, I would think I would get computer as a possible thesaurus equivalent.

    But it is not happening.

    If you can provide some insight to this would be great.

    — Asad