Click here to monitor SSC
  • Av rating:
  • Total votes: 50
  • Total comments: 4
William Brewer

Quantifying Text differences in TSQL

20 September 2007
Last revised: Feb 3rd 2014

In TSQL there is a limit to the way you can compare text strings. They're either equal or not. Sooner or later, usually when cleaning data, something more subtle is required!

When you compare two pieces of text, or strings, in SQL Server in an expression, you will get just get a value of 'true' returned if the two were the same, 'false' if they were not, or 'null' if one of the pieces of text was null.

A simple matter of an extra space character is often enough to tip the balance. This is quite unlike real life - If we look at two pieces of text we judge them to be the same, almost the same, quite similar, nothing like each other and many shades in-between. Surely, we need to quantify differences?

Text difference algorithms are as
old as the hills- but not in TSQL

In IT applications, there are several times when one needs more of a measure of the differences between text, than a simple 'yes they're the same/ no they are not'. A typical problem is in finding duplicates in database entries where the understanding of 'duplicate' allows for minor differences. Finding a reliable algorithm for quantifying similarity in text is quite hard, especially one that is set-based. TSQL has no native means to use regular expressions and other means of making life easier for this sort of work

I find this problem quite intriguing. I think that there is a general consensus that the Levenshtein string distance algorithm is the best for giving you the difference on a character-by-character basis, and I provide some code at the end for doing this. The algorithm was developed by Vladimir Levenshtein in 1965. It tells you the number of edits required to turn one string into another by breaking down string transformation into three basic operations: adding, deleting, and replacing a character. Each operation is assigned a cost of 1. Leaving a character unchanged has a cost of 0. There are some other algorithms. I'm not at all convinced by 'soundex' algorithms- they don't seem to help much. 

I decided that what I wanted was a difference based on words rather than characters. I find that the solution, the difference counter, that I give below pretty handy, though it sometimes gives a score for differences that I don't like. Try it yourselves with a variety of strings and you'll see it makes a pretty good attempt. It is, of course, slow, because of the tedium of breaking down text into words and white-space. In normal use, this is only done once, when importing text into the database, when it is placed in an 'inversion table'. One can use this data to test the similarity of the original text, which is much faster. Just so as to include those stuck on SQL Server 2000, I've made the function use a nTEXT parameter rather than a VARCHAR(MAX) though the latter would have made for a simpler routine

“Cleaning data is not
  an exact science”

In reality, every time one comes across a requirement where one has to check for differences, there are subtle requirements that are never the same. Cleaning data is not an exact science. I generally prefer to ignore 'white-space', including new-lines and punctuation, when checking for differences. My approach is to break down text into words and 'not-words', or white-space. I refer to these as different types of token. The table function I give below allows you to define a word in terms of the characters that make up a word. This is different in other languages. The routine is generally, though not always, much faster if one uses a 'number table' but I decided that creating one for this article was a distraction for the reader .

With the use of the 'parsing' table-function, I then do a simple outer join between the two collections of words, and count the number of times that the minimum 'best-fit' between words changes in the sequence of words. This is of course, an approximation: I should be using sliders and other devices that use iteration. At some point one has to hand over to the computer scientists. I tend to stop at the point where the routine does the job I want.

As a flourish, I've provided, at the end, a variation of the function that provides a single-row table giving the first point at which the two samples of text diverge. It is really just a by-product of the first routine but I slipped it in to give a suggestion of the many ways the routine can be adapted for particular purposes. It is surprisingly handy for applications such as summary reports of the latest changes made to stored procedures!

First the 'classic Levenshtein string distance in TSQ (using strings instead of arrays)

create FUNCTION Levenshtein_Distance(@Source nvarchar(4000), @Target nvarchar(4000))

RETURNS int

AS

/*

The Levenshtein string distance algorithm was developed by Vladimir Levenshtein in 1965. It tells you the number of edits required to turn one string into another by breaking down string transformation into three basic operations: adding, deleting, and replacing a character. Each operation is assigned a cost of 1. Leaving a character unchanged has a cost of 0.

This is a translation of 'Fast, memory efficient Levenshtein algorithm' By Sten Hjelmqvist, originally converted to SQL by Arnold Fribble

http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm

*/

BEGIN

  Declare  @MaxDistance int

  Select @MaxDistance=200

  DECLARE @SourceStringLength int, @TargetStringLength int, @ii int, @jj int, @SourceCharacter nchar, @Cost int, @Cost1 int,

      -- create two work vectors of integer distances

    @Current_Row nvarchar(200), @Previous_Row nvarchar(200), @Min_Cost int

  SELECT @SourceStringLength = LEN(@Source),

         @TargetStringLength = LEN(@Target),

         @Previous_Row = '',

         @jj = 1, @ii = 1,

         @Cost = 0, @MaxDistance=200

    -- do the degenerate cases

    if @Source = @Target return (@Cost);

    if @SourceStringLength= 0 return @TargetStringLength;

    if @TargetStringLength= 0 return @SourceStringLength;

   

    -- initialize the previous row of distances

    -- this row is edit distance for an empty source string

    -- the distance is just the number of characters to delete from the target

  WHILE @jj <= @TargetStringLength

    SELECT @Previous_Row = @Previous_Row + NCHAR(@jj), @jj = @jj + 1

 

  WHILE @ii <= @SourceStringLength

  BEGIN

    SELECT @SourceCharacter = SUBSTRING(@Source, @ii, 1),

           @Cost1 = @ii,

           @Cost = @ii,

           @Current_Row = '',

           @jj = 1,

           @Min_Cost = 4000

    WHILE @jj <= @TargetStringLength

    BEGIN  -- use formula to fill in the rest of the row

      SET @Cost = @Cost + 1

      --v1[j + 1] = Minimum(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost);

      SET @Cost1 = @Cost1 - CASE WHEN @SourceCharacter = SUBSTRING(@Target, @jj, 1) THEN 1 ELSE 0 END

      IF @Cost > @Cost1 SET @Cost = @Cost1

      SET @Cost1 = UNICODE(SUBSTRING(@Previous_Row, @jj, 1)) + 1

      IF @Cost > @Cost1 SET @Cost = @Cost1

      IF @Cost < @Min_Cost SET @Min_Cost = @Cost

      SELECT @Current_Row = @Current_Row + NCHAR(@Cost), @jj = @jj + 1

    END

    IF @Min_Cost > @MaxDistance return -1

    -- copy current row to previous row for next iteration

    SELECT @Previous_Row = @Current_Row, @ii = @ii + 1

  END

  RETURN  @Cost

END

GO

...and now the equivalent system for detecting word differences

IF OBJECT_ID(N'dbo.uftWordTokens'IS NOT NULL 
  
DROP FUNCTION dbo.uftWordTokens
GO

/*------------------------------------------------------------*/
CREATE FUNCTION [dbo].[uftWordTokens]
  
(
    
@string NTEXT,
    
@WordStartCharacters VARCHAR(255'a-z',
    
@WordCharacters VARCHAR(255'-a-z'''
  
)
RETURNS @Results TABLE
  
(
    
SeqNo INT IDENTITY(11),
    
Item VARCHAR(255),
    
TokenType INT
  
)
AS /*
This table function produces a table which divides up the words and 
the spaces between the words in some text and produces a table of the
two types of token in the sequence in which they were found
*/
   
BEGIN
    DECLARE 
@Pos INT,    --index of current search
      
@WhereWeAre INT,--index into string so far
      
@ii INT,    --the number of words found so far
      
@next INT,  --where the next search starts 
      
@size INT   --the total size of the text

    
SELECT  @ii 0@WhereWeAre 1@size DATALENGTH(@string)


    
WHILE @Size >= @WhereWeAre
      
BEGIN
        SELECT  
@pos PATINDEX('%[' @wordStartCharacters ']%',
                                
SUBSTRING(@string@whereWeAre4000))
        
IF @pos 
          
BEGIN
            IF 
@pos 
              
INSERT  INTO @Results
                      
itemtokentype )
                      
SELECT  SUBSTRING(@String@whereWeAre@pos 1), 2
            
SELECT  @next @WhereWeAre @pos@ii @ii 1
            
SELECT  @pos PATINDEX('%[^' @wordCharacters ']%',
                                    
SUBSTRING(@string@next4000) + ' ')
            
INSERT  INTO @Results
                    
itemtokentype )
                    
SELECT  SUBSTRING(@String@next 1@pos), 1
            
SELECT  @WhereWeAre @next @pos 1
          
END
        ELSE 
          BEGIN
            IF 
LEN(REPLACE(
                    
SUBSTRING(@String@whereWeAre4000), ' ''!'
            
)) > 
              
INSERT  INTO @Results
                      
itemtokentype )
                      
SELECT  SUBSTRING(@String@whereWeAre4000), 2
            
SELECT  @whereWeAre @WhereWeAre 4000
          
END
      END
    RETURN
   END

/* Tests:
SELECT  '[' + item + ']', tokentype
FROM    dbo.uftWordTokens('This     has
 been relentlessly 
,^----tested', DEFAULT, DEFAULT)       
SELECT  '[' + item + ']', tokentype
FROM    dbo.uftWordTokens('This has been relentlessly tested        !',
                          DEFAULT, DEFAULT)        
SELECT  item, tokentype
FROM    dbo.uftWordTokens('This has been', DEFAULT, DEFAULT)   
SELECT  '[' + item + ']', tokentype
FROM    dbo.uftWordTokens(' <!-- 23 343.43  <div>Hello there  .... -->',
                          DEFAULT, DEFAULT)

*/
GO

IF OBJECT_ID(N'dbo.ufnDifferencesInText'IS NOT NULL 
  
DROP FUNCTION dbo.ufiDifferencesInText
GO
/*------------------------------------------------------------*/
CREATE FUNCTION dbo.ufiDifferencesInText
  
(
    
@Sample NTEXT,
    
@comparison NTEXT
  
)
RETURNS INT
AS BEGIN
    DECLARE 
@results TABLE
      
(
        
token_ID INT IDENTITY(11),
        
sequenceNumber INT,
        
Sample_ID INT,
        
Item VARCHAR(255),
        
TokenType INT
      
)
/*
This function returns the number of differences it found between two pieces
of text
*/
    
INSERT  INTO @results
            
SequenceNumberSample_IDItemTokentype )
            
SELECT  seqno1itemtokentype
            
FROM    dbo.uftWordTokens(@sampleDEFAULTDEFAULT)

    
INSERT  INTO @results
            
SequenceNumberSample_IDItemTokentype )
            
SELECT  seqno2itemtokentype
            
FROM    dbo.uftWordTokens(@comparisonDEFAULTDEFAULT)
    
DECLARE @closestMatch TABLE
      
(
        
sequenceNumber INT,
        
skew INT
      
)
    
INSERT  INTO @closestMatch
            
sequencenumberskew )
            
SELECT  COALESCE(a.sequencenumberb.sequencenumber),
                    
COALESCEE(MIN(ABS(COALESCE(b.sequenceNumber1000
                        - 
COALESCE(a.sequencenumber1000))),
                             -
1)
            
FROM    SELECT  *
                      
FROM    @results
                      
WHERE   sample_ID AND tokentype 1
                    
FULL OUTER JOIN SELECT  *
                                          
FROM    @results
                                          
WHERE   sample_ID 
                                              
AND tokentype 1
                                        
ON a.item b.item
            
GROUP BY COALESCE(a.sequencenumberb.sequencenumber)
            
ORDER BY COALESCE(a.sequencenumberb.sequencenumber)



    
RETURN SELECT SUM(CASE WHEN a.skew b.skew THEN 0
                             
ELSE 1
                        
END)
             
FROM   @closestmatch INNER JOIN @closestMatch 
                  
ON b.sequenceNumber a.sequenceNumber 2
           
)
   
END
GO

SELECT  dbo.ufnDifferencesInText('I am a piece of text',
                                 
'I am a piece of text')
--0
SELECT  dbo.ufnDifferencesInText('I am a piece of text',
                                 
'I am not a piece of text')
--1
SELECT  dbo.ufnDifferencesInText('I am a piece of text',
                                 
'I am piece a a a of text')
--2
SELECT  dbo.ufnDifferencesInText('I  piece of text'
                                 
'I am a piece of text')
--1
SELECT  dbo.ufnDifferencesInText('I  am a pot of jam'
                                 
'I am a piece of text')
--3
SELECT  dbo.ufnDifferencesInText('I  am a pot of jam',
                                 
'I  am a pot of jam beloved by humans')
--3
SELECT  dbo.ufnDifferencesInText('I am a piece of text',
                                 
'text of piece a am I')
--4
SELECT  dbo.ufnDifferencesInText('I am a piece of text',
                                 
'this is completely different')
--5
SELECT  dbo.ufnDifferencesInText('I am a piece of text''')
--5
SELECT  dbo.ufnDifferencesInText('''I am a piece of text')
--5

SELECT  dbo.ufnDifferencesInText('Call me Ishmael. Some years ago -- never mind how long precisely -- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people''s hats off -- then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'
'Call me Ishmael. Some years ago -- never mind how long precisely -- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people''s hats off -- then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.')


-- =============================================
-- Description: A routine that returns a single-row which 
-- gives the context of the first difference between two 
-- strings
-- =============================================
IF OBJECT_ID(N'dbo.uftShowFirstDifference'IS NOT NULL 
  
DROP FUNCTION dbo.uftShowFirstDifference
GO
CREATE FUNCTION uftShowFirstDifference
  
(
    
-- Add the parameters for the function here
    
@sample NTEXT,
    
@comparison NTEXT
  
)
RETURNS @result TABLE
  
(
    
-- Add the column definitions for the TABLE variable here
    
first VARCHAR(2000),
    
second VARCHAR(2000),
    
[where] INT
  
)
AS BEGIN
    DECLARE 
@results TABLE
      
(
        
token_ID INT IDENTITY(11),
        
sequenceNumber INT,
        
Sample_ID INT,
        
Item VARCHAR(255),
        
TokenType INT
      
)
    
INSERT  INTO @results
            
SequenceNumberSample_IDItemTokentype )
            
SELECT  seqno1itemtokentype
            
FROM    dbo.uftWordTokens(@sampleDEFAULTDEFAULT)

    
INSERT  INTO @results
            
SequenceNumberSample_IDItemTokentype )
            
SELECT  seqno2itemtokentype
            
FROM    dbo.uftWordTokens(@comparisonDEFAULTDEFAULT)
    
DECLARE @closestMatch TABLE
      
(
        
sequenceNumber INT,
        
skew INT
      
)
    
INSERT  INTO @closestMatch
            
sequencenumberskew )
            
SELECT  COALESCE(a.sequencenumberb.sequencenumber),
                    
COALESCE(MIN(ABS(COALESCE(b.sequenceNumber1000
                            - 
COALESCE(a.sequencenumber1000))),
                             -
1)
            
FROM    SELECT  *
                      
FROM    @results
                      
WHERE   sample_ID AND tokentype 1
                    
FULL OUTER JOIN SELECT  *
                                          
FROM    @results
                                          
WHERE   sample_ID 
                                               
AND tokentype 1
                                        
ON a.item b.item
            
GROUP BY COALESCE(a.sequencenumberb.sequencenumber)
            
ORDER BY COALESCE(a.sequencenumberb.sequencenumber)



    
DECLARE @first VARCHAR(2000)
    
DECLARE @firstDifference INT
    DECLARE 
@second VARCHAR(2000)
    
SELECT  @FirstDifference = MIN(sequenceNumber)
    
FROM    @closestMatch
    
WHERE   skew <> 0
    
SELECT  @first ''@second ''
    
SELECT TOP 10
            
@first COALESCE(@First'') + item
    
FROM    @results
    
WHERE   sample_ID AND sequenceNumber >= @FirstDifference
    
ORDER BY SequenceNumber
    
SELECT TOP 10
            
@second COALESCE(@second'') + item
    
FROM    @results
    
WHERE   sample_ID AND sequenceNumber >= @FirstDifference
    
ORDER BY SequenceNumber
    
INSERT  INTO @result
            
firstSecond[where] )
            
SELECT  [first] @First[second] @second,
                    
[where] @FirstDifference

    
RETURN 
   END
GO

SELECT  *
FROM    dbo.uftShowFirstDifference('I am a piece of text',
                                   
'I am a piece of text')
-- NULL
SELECT  *
FROM    dbo.uftShowFirstDifference('I am a piece of text',
                                   
'I am not a piece of text')
--a piece of text     not a piece of text         5
SELECT  *
FROM    dbo.uftShowFirstDifference('I am a piece of text',
                                   
'I am piece a a a of text')
--a piece of text     piece a a a of              5
SELECT  *
FROM    dbo.uftShowFirstDifference('I  piece of text'
                                   
'I am a piece of text')
--piece of text       am a piece of text          3
SELECT  *
FROM    dbo.uftShowFirstDifference('I  am a pot of jam',
                                   
'I am a piece of text')
--pot of jam          piece of text               7
SELECT   *
FROM    dbo.uftShowFirstDifference('I  am a pot of jam',
                                   
'I  am a pot of jam beloved by humans')
--                    beloved by humans           13
SELECT  *
FROM    dbo.uftShowFirstDifference('I am a piece of text',
                                   
'text of piece a am I')
--I am a piece of     text of piece a am          1
SELECT  *
FROM    dbo.uftShowFirstDifference('I am a piece of text',
                                   
'this is completely different')
--I am a piece of     this is completely different 1
SELECT  *
FROM    dbo.uftShowFirstDifference('I am a piece of text''')
--I am a piece of                                  1
SELECT  *
FROM    dbo.uftShowFirstDifference('''I am a piece of text')
--                    I am a piece of              1

William Brewer

Author profile:

William Brewer is a SQL Server developer who has worked as a Database consultant and Business Analyst for several Financial Services organisations in the City of London. True to his name, he is also an expert on real ale.

Search for other articles by William Brewer

Rate this article:   Avg rating: from a total of 50 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Very interesting--I can hardly wait to dive in deep to it...
Posted by: Jonathon Storm (view profile)
Posted on: Friday, September 21, 2007 at 12:10 PM
Message: Thank you very much for writing this...

Subject: Diving in deep
Posted by: WBrewer (view profile)
Posted on: Monday, September 24, 2007 at 5:28 AM
Message: I must admit that I've happily passed the time in some otherwise boring train journeys whilst trying to find set-based solutions to this problem!

Subject: comparing text
Posted by: Jayesh Dharamsey (not signed in)
Posted on: Friday, October 5, 2007 at 6:53 AM
Message: very useful and thoughtful

Subject: Case Sensitive?
Posted by: imaximus (view profile)
Posted on: Wednesday, June 2, 2010 at 6:27 PM
Message: Could this be modified to be case sensitive? Let me know your thoughts. I would be willing to pay your hourly consulting rate for this modification if so. Look forward to hearing from you.

Chet

 
Simple-Talk Database Delivery

DLM
Patterns & Practices Library

Visit our patterns and practices library to learn more about database lifecycle management.

Find out how to automate the process of building, testing and deploying your database changes to reduce risk and make rapid releases possible.

Get started

Phil Factor
Documenting your SQL Server Database

One of the shocks that a developer can get when starting to program in T-SQL is that there is no simple way of... Read more...

 View the blog

Top Rated

Getting to know your customers better – cohort analysis and RFM segmentation in R
 It often pays to use a tool like R, in conjunction with a relational database, to quickly perform a... Read more...

A Start with Automating Database Configuration Management
 For a number of reasons, it pays to have the up-to-date source of all the databases and servers that... Read more...

Archiving Hierarchical, Deleted Transactions Using XML
 When you delete a business transaction from the database, there are times when you might want to keep a... Read more...

Rollback and Recovery Troubleshooting; Challenges and Strategies
 What happens if your database deployment goes awry? Do you restore from a backup or snapshot and lose... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Temporary Tables in SQL Server
 Temporary tables are used by every DB developer, but they're not likely to be too adventurous with... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.