Click here to monitor SSC
  • Av rating:
  • Total votes: 66
  • Total comments: 29
Robyn Page and Phil Factor

TSQL Regular Expression Workbench

27 November 2007

/* This Workbench is about using Regular expressions with SQL Server via TSQL. It doesn't even attempt to teach how regular expressions work or how to pull them together. There are plenty of such resources on the Web. The aim is to demonstrate a few possibilities and try to persuade you to experiment with them if you don't already use Regex with SQL Server.

We suggest that, if you are an ordinary mortal like Phil or I, without special powers, you should use an application such as RegexBuddy to form, edit and interpret Regular expressions. It makes learning them a lot easier. In order that people with access only to SQL Server 2000 can use the workbench, we'll use OLE in the examples, but they are readily adapted to CLR

As always, the source code is in the speechbubble above. */

/*

-- Contents --

  1. Introduction
  2. The OLE Functions
    1. The OLE Regex Match function
    2. The OLE Regex Replace function
    3. The OLE Regex Find (Execute) function
  3. Combining two Regexs
  4. OLE Regex Performance

Regular Expressions can be very useful to the Database programmer, particularly for data validation, data feeds and data transformations. A lot of the time, tools such as grep and awk or Funduc's S&R will be the most suitable way of using regular expressions, but just occasionally, it is handy to be able to use them in TSQL as we'll try to show.

Regular Expressions are not regular in the sense that there is any common dialect of expression that is understood by all Regex engines. On the contrary, regular expresssions aren't always portable and there are many common, similar but incompatible, dialects in use, such as Perl 5.8, Java.util.regex, .NET, PHP, Python, Ruby, ECMA Javascript, PCRE, Apache, vi, Shell tools TCL ARE, POSIX BRE, Funduc and JGsoft.

Regular Expressions were never developed to be easy to understand. They are a condensed shorthand that, on preliminary inspection, looks as if someone has repeatedly sat on the keyboard. Even when interpreted, the logic isn't always easy to follow. If you don't agree, then explain this one!

Probably the best tutorial on the web for Regular Expressions is on www.regular-expressions.info but it is also worth reading Implementing Real-World Data Input Validation using Regular Expressions by Francis Norton for an introduction to regular expressions

A great deal can be done using commandline applications that work with regular expressions such as GREP and AWK. However, there are times where it is handy to use Regex directly from TSQL. There are two Regex engines available to SQL Server. These are

  • the .NET Regex which is in the system.text.regularexpression module
  • The ECMA Regex from VBScript.RegExp which is distributed with the IE browser and is used by Javascript and JScript.

Both of these are excellent standard implementations. Both work well in TSQL.

The .NET Regex requires the creation of CLR functions to provide regular expressions, and works only with SQL Server 2005, (and 2007) See CLR Integration by Christoffer Hedgate

The ECMA Regex can be used via VBScript.RegExp, which are available to SQL Server 2000 as well. The regex is compatible with Javascript.

The advantage of using CLR is that the regular expressions of the NET framework are very good, and performance is excellent. However, the techniques are well-known, whereas some of the more powerful uses of VBScript.RegExp have hardly ever been published, so this workbench will concentrate on the latter

The OLE functions
------------------

There are various properties to consider in these functions

IgnoreCase
By default, the regular expression is case sensitive. In the following functions, we have set the IgnoreCase property to True to make it case insensitive.
The Multiline property
The caret and dollar only match at the very start and very end of the subject string by default. If your subject string consists of multiple lines separated by line breaks, you can make the caret and dollar match at the start and the end of those lines by setting the Multiline property to True. (there is no option to make the dot match line break characters).
The Global property
If you want the RegExp object to return or replace all matches instead of just the first one, set the Global property to True.
Only the 'IgnoreCase is relevant in the first function but we've 'hardcoded' it to 1 as case-sensitive searches are a minority interest.

The OLE Regex Match function
-----------------------------

Let's start off with something simple, a function for testing a string against a regular expression */

IF OBJECT_ID (N'dbo.RegexMatch') IS NOT NULL
  
DROP FUNCTION dbo.RegexMatch
GO
CREATE FUNCTION dbo.RegexMatch
    
(
      
@pattern VARCHAR(2000),
      
@matchstring VARCHAR(MAX)--Varchar(8000) got SQL Server 2000
    
)
RETURNS INT
/* The RegexMatch returns True or False, indicating if the regular expression matches (part of) the string. (It returns null if there is an error).
When using this for validating user input, you'll normally want to check if the entire string matches the regular expression. To do so, put a caret at the start of the regex, and a dollar at the end, to anchor the regex at the start and end of the subject string.
*/
AS BEGIN
    DECLARE
@objRegexExp INT,
        
@objErrorObject INT,
        
@strErrorMessage VARCHAR(255),
        
@hr INT,
        
@match BIT

    
SELECT  @strErrorMessage = 'creating a regex object'
    
EXEC @hr= sp_OACreate 'VBScript.RegExp', @objRegexExp OUT
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'Pattern', @pattern
        
--Specifying a case-insensitive match
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'IgnoreCase', 1
        
--Doing a Test'
    
IF @hr = 0
        
EXEC @hr= sp_OAMethod @objRegexExp, 'Test', @match OUT, @matchstring
    
IF @hr <> 0
        
BEGIN
            RETURN
NULL
        
END
    EXEC
sp_OADestroy @objRegexExp
    
RETURN @match
  
END
GO
/* Now, with this routine, we can do some complex input validation*/
--IS there a repeating word
SELECT dbo.RegexMatch('\b(\w+)\s+\1\b','this has has been repeated')--1
SELECT dbo.RegexMatch('\b(\w+)\s+\1\b','this has not been repeated')--0

--find a word near another word (in this case 'for' and 'last' 1 or 2 words apart)
SELECT dbo.RegexMatch('\bfor(?:\W+\w+){1,2}?\W+last\b',
          
'You have failed me for the last time, Admiral')--1
SELECT dbo.RegexMatch('\bfor(?:\W+\w+){1,2}?\W+last\b',
          
'You have failed me for what could be the last time, Admiral')--0

--is this likely to be a valid credit card
SELECT dbo.RegexMatch('^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{12}|3(?:0
[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13}|(?:2131|1800)\d{11})$'
,'4953129482924435')          

--IS this a valid ZIP code
SELECT dbo.RegexMatch('^[0-9]{5,5}([- ]?[0-9]{4,4})?$','02115-4653')

--is this a valid Postcode
SELECT dbo.RegexMatch('^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha
-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))
) {0,1}[0-9][A-Za-z]{2})$'
,'RG35 2AQ')

--is this a valid European date
SELECT dbo.RegexMatch('^((((31\/(0?[13578]|1[02]))|((29|30)\/(0?[1,3-9]|1[0-2])))\/(1[
6-9]|[2-9]\d)?\d{2})|(29\/0?2\/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16
|[2468][048]|[3579][26])00))))|(0?[1-9]|1\d|2[0-8])\/((0?[1-9])|(1[0-2]))\/((1[6-9]|[2
-9]\d)?\d{2})) (20|21|22|23|[0-1]?\d):[0-5]?\d:[0-5]?\d$'
,'12/12/2007 20:15:27')

--is this a valid currency value (dollar)
SELECT dbo.RegexMatch('^\$(\d{1,3}(\,\d{3})*|(\d+))(\.\d{2})?$','$34,000.00')

--is this a valid currency value (Sterling)
SELECT dbo.RegexMatch('^\£(\d{1,3}(\,\d{3})*|(\d+))(\.\d{2})?$',
'£34,000.00'
)

--A valid email address?
SELECT dbo.RegexMatch('^(([a-zA-Z0-9!#\$%\^&\*\{\}''`\+=-_\|/\?]+(\.[a-zA-Z0-9!#\$%\^&
\*\{\}''`\+=-_\|/\?]+)*){1,64}@(([A-Za-z0-9]+[A-Za-z0-9-_]*){1,63}\.)*(([A-Za-z0-9]+[A
-Za-z0-9-_]*){3,63}\.)+([A-Za-z0-9]{2,4}\.?)+){1,255}$'
,'Phil.Factor@simple-Talk.com')

/*

With this function, the passing back of errors is rudimentary. If an OLE error occurs, then a null is passed back.

There are two other basic Regex functions available. With them, you can use regular expressions in all sorts of places in TSQL without having to get to direct grips with the rather awkward OLE interface.

The OLE Regex Replace function
-----------------------------

*/

IF OBJECT_ID(N'dbo.RegexReplace') IS NOT NULL
    
DROP FUNCTION dbo.RegexReplace
GO
CREATE FUNCTION dbo.RegexReplace
    
(
      
@pattern VARCHAR(255),
      
@replacement VARCHAR(255),
      
@Subject VARCHAR(MAX),
      
@global BIT = 1,
    
@Multiline bit =1
    
)
RETURNS VARCHAR(MAX)

/*The RegexReplace function takes three string parameters. The pattern (the regular expression) the replacement expression, and the subject string to do the manipulation to.

The replacement expression is one that can cause difficulties. You can specify an empty string '' as the @replacement text. This will cause the Replace method to return the subject string with all regex matches deleted from it (see "strip all HTML elements out of a string" below).
To re-insert the regex match as part of the replacement, include $& in the replacement text. (see "find a #comment and add a TSQL --" below)
If the regexp contains capturing parentheses, you can use backreferences in the replacement text. $1 in the replacement text inserts the text matched by the first capturing group, $2 the second, etc. up to $9. (e.g. see import delimited text into a database below) To include a literal dollar sign in the replacements, put two consecutive dollar signs in the string you pass to the Replace method.*/
AS BEGIN
    DECLARE
@objRegexExp INT,
        
@objErrorObject INT,
        
@strErrorMessage VARCHAR(255),
        
@Substituted VARCHAR(8000),
        
@hr INT,
        
@Replace BIT

    
SELECT  @strErrorMessage = 'creating a regex object'
    
EXEC @hr= sp_OACreate 'VBScript.RegExp', @objRegexExp OUT
    
IF @hr = 0
        
SELECT  @strErrorMessage = 'Setting the Regex pattern',
                
@objErrorObject = @objRegexExp
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'Pattern', @pattern
    
IF @hr = 0 /*By default, the regular expression is case sensitive. Set the IgnoreCase property to True to make it case insensitive.*/
        
SELECT  @strErrorMessage = 'Specifying the type of match'
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'IgnoreCase', 1
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'MultiLine', @Multiline
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'Global', @global
    
IF @hr = 0
        
SELECT  @strErrorMessage = 'Doing a Replacement'
    
IF @hr = 0
        
EXEC @hr= sp_OAMethod @objRegexExp, 'Replace', @Substituted OUT,
            
@subject, @Replacement
    
/*If the RegExp.Global property is False (the default), Replace will return the @subject string with the first regex match (if any) substituted with the replacement text. If RegExp.Global is true, the @Subject string will be returned with all matches replaced.*/  
    
IF @hr <> 0
        
BEGIN
            DECLARE
@Source VARCHAR(255),
                
@Description VARCHAR(255),
                
@Helpfile VARCHAR(255),
                
@HelpID INT
  
            EXECUTE
sp_OAGetErrorInfo @objErrorObject, @source OUTPUT,
                
@Description OUTPUT, @Helpfile OUTPUT, @HelpID OUTPUT
            
SELECT  @strErrorMessage = 'Error whilst '
                    
+ COALESCE(@strErrorMessage, 'doing something') + ', '
                    
+ COALESCE(@Description, '')
            
RETURN @strErrorMessage
        
END
    EXEC
sp_OADestroy @objRegexExp
    
RETURN @Substituted
  
END
GO
--remove repeated words in text
SELECT  dbo.RegexReplace('\b(\w+)(?:\s+\1\b)+', '$1',
                        
'Sometimes I cant help help help stuttering',1, 1)

--find a #comment and add a TSQL --
SELECT  dbo.RegexReplace('#.*','--$&','
# this is a comment
first,second,third,fourth'
,1,1)

--replace a url with an HTML anchor
SELECT  dbo.RegexReplace(
       
'\b(https?|ftp|file)://([-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|])',
        
'<a href="$2">$2</a>',
         
'There is  this amazing site at http://www.simple-talk.com',1,1)

--strip all HTML elements out of a string
SELECT  dbo.RegexReplace('<(?:[^>''"]*|([''"]).*?\1)*>',
  
'','<a href="http://www.simple-talk.com">Simle Talk is wonderful</a><!--This is a comment --> we all love it',1,1)

--import delimited text into a database, converting it into insert statements
SELECT  dbo.RegexReplace(
 '([^\|\r\n]+)[|\r\n]+([^\|\r\n]+)[|\r\n]+([^\|\r\n]+)[|\r\n]+([^\|\r\n]+)[|\r\n]+'
,
 
'Insert into MyTable (Firstcol,SecondCol, ThirdCol, Fourthcol)
select $1,$2,$3,$4
'
,'1|white gloves|2435|24565
2|Sports Shoes|285678|0987
3|Stumps|2845|987
4|bat|29862|4875'
,1,1)
/*

*/

The OLE Regex Find (Execute) function
-----------------------------

This is the most powerful function for doing complex finding and replacing of text. As it passes back detailed records of the hits, including the location and the backreferences, it allows for complex manipulations.

This is written as a table function. The Regex Routine actually passes back a collection for each 'hit'. In the relational world, you'd normally represent this in two tables, so we've returned a left outer join of the two logical tables so as to pass back all the information. This seems to cater for all the uses we can think of. We also append an error column, which should be blank!

*/

IF OBJECT_ID(N'dbo.RegexFind') IS NOT NULL
    
DROP FUNCTION dbo.RegexFind
GO
CREATE FUNCTION RegexFind(
    
@pattern VARCHAR(255),
    
@matchstring VARCHAR(MAX),
    
@global BIT = 1,
  
@Multiline bit =1)
RETURNS
    
@result TABLE
        
(
        
Match_ID INT,
          
FirstIndex INT ,
          
length INT ,
          
Value VARCHAR(2000),
          
Submatch_ID INT,
          
SubmatchValue VARCHAR(2000),
        
Error VARCHAR(255)
        )


AS -- columns returned by the function
  
BEGIN
    DECLARE
@objRegexExp INT,
        
@objErrorObject INT,
        
@objMatch INT,
        
@objSubMatches INT,
        
@strErrorMessage VARCHAR(255),
      
@error VARCHAR(255),
        
@Substituted VARCHAR(8000),
        
@hr INT,
        
@matchcount INT,
        
@SubmatchCount INT,
        
@ii INT,
        
@jj INT,
        
@FirstIndex INT,
        
@length INT,
        
@Value VARCHAR(2000),
        
@SubmatchValue VARCHAR(2000),
        
@objSubmatchValue INT,
        
@command VARCHAR(8000),
        
@Match_ID INT
        
    DECLARE
@match TABLE
        
(
          
Match_ID INT IDENTITY(1, 1)
                       NOT NULL,
          
FirstIndex INT NOT NULL,
          
length INT NOT NULL,
          
Value VARCHAR(2000)
        )    
    
DECLARE @Submatch TABLE
        
(
          
Submatch_ID INT IDENTITY(1, 1),
          
match_ID INT NOT NULL,
          
SubmatchNo INT NOT NULL,
          
SubmatchValue VARCHAR(2000)
        )
      


    
SELECT  @strErrorMessage = 'creating a regex object',@error=''
    
EXEC @hr= sp_OACreate 'VBScript.RegExp', @objRegexExp OUT
    
IF @hr = 0
        
SELECT  @strErrorMessage = 'Setting the Regex pattern',
                
@objErrorObject = @objRegexExp
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'Pattern', @pattern
    
IF @hr = 0
        
SELECT  @strErrorMessage = 'Specifying a case-insensitive match'
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'IgnoreCase', 1
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'MultiLine', @Multiline
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'Global', @global
    
IF @hr = 0
        
SELECT  @strErrorMessage = 'Doing a match'
    
IF @hr = 0
        
EXEC @hr= sp_OAMethod @objRegexExp, 'execute', @objMatch OUT,
            
@matchstring
    
IF @hr = 0
        
SELECT  @strErrorMessage = 'Getting the number of matches'    
    
IF @hr = 0
        
EXEC @hr= sp_OAGetProperty @objmatch, 'count', @matchcount OUT
    
SELECT  @ii = 0
    
WHILE @hr = 0
        
AND @ii &lt; @Matchcount
        
BEGIN
/*The Match object has four read-only properties.
The FirstIndex property indicates the number of characters in the string to the left of the match.
The Length property of the Match object indicates the number of characters in the match.
The Value property returns the text that was matched.*/
            
SELECT  @strErrorMessage = 'Getting the FirstIndex property',
                    
@command = 'item(' + CAST(@ii AS VARCHAR) + ').FirstIndex'    
            
IF @hr = 0
                
EXEC @hr= sp_OAGetProperty @objmatch, @command,
                    
@Firstindex OUT
            
IF @hr = 0
                
SELECT  @strErrorMessage = 'Getting the length property',
                        
@command = 'item(' + CAST(@ii AS VARCHAR) + ').Length'    
            
IF @hr = 0
                
EXEC @hr= sp_OAGetProperty @objmatch, @command, @Length OUT
            
IF @hr = 0
                
SELECT  @strErrorMessage = 'Getting the value property',
                        
@command = 'item(' + CAST(@ii AS VARCHAR) + ').Value'    
            
IF @hr = 0
                
EXEC @hr= sp_OAGetProperty @objmatch, @command, @Value OUT
            
INSERT  INTO @match
                    
(
                      
Firstindex,
                      
[Length],
                      
[Value]
                    
)
                    
SELLECT  @firstindex + 1,
                            
@Length,
                            
@Value
            
SELECT  @Match_ID = @@Identity        
/*The SubMatches property of the Match object is a collection of strings. It will only hold values if your regular expression has capturing groups. The collection will hold one string for each capturing group. The Count property (returned as SubmatchCount) indicates the number of string in the collection. The Item property takes an index parameter, and returns the text matched by the capturing group.
*/
            
IF @hr = 0
                
SELECT  @strErrorMessage = 'Getting the SubMatches collection',
                        
@command = 'item(' + CAST(@ii AS VARCHAR)
                        +
').SubMatches'    
            IF @hr = 0
                
SELECT  @strErrorMessage = 'Getting the number of submatches'    
            
IF @hr = 0
                
EXEC @hr= sp_OAGetProperty @objSubmatches, 'count',
                    
@submatchCount OUT
            
SELECT  @jj = 0
            
WHILE @hr = 0
                
AND @jj &lt; @submatchCount
                
BEGIN
                    IF
@hr = 0
                        
SELECT  @strErrorMessage = 'Getting the submatch value property',
                                
@command = 'item(' + CAST(@jj AS VARCHAR)
                                +
')' ,@submatchValue=NULL  
                    
IF @hr = 0
                        
EXEC @hr= sp_OAGetProperty @objSubmatches, @command,
                            
@SubmatchValue OUT
                    
INSERT  INTO @Submatch
                            
(
                              
Match_ID,
                              
SubmatchNo,
                              
SubmatchValue
                            
)
                            
SELECT  @Match_ID,
                                    
@jj+1,
                                    
@SubmatchValue
                    
SELECT  @jj = @jj + 1
                
END  
                EXEC @hr= sp_OAGetProperty @objmatch, @command,
                    
@objSubmatches OUT    
            SELECT  
@ii = @ii + 1
        
END
    IF
@hr &lt;&gt; 0
        
BEGIN
            DECLARE
@Source VARCHAR(255),
                
@Description VARCHAR(255),
                
@Helpfile VARCHAR(255),
                
@HelpID INT
  
            EXECUTE
sp_OAGetErrorInfo @objErrorObject, @source OUTPUT,
                
@Description OUTPUT, @Helpfile OUTPUT, @HelpID OUTPUT
            
SELECT  @Error = 'Error whilst '
                    
+ COALESCE(@strErrorMessage, 'doing something') + ', '
                    
+ COALESCE(@Description, '')
        
END
    EXEC
sp_OADestroy @objRegexExp
    
EXEC sp_OADestroy        @objMatch
    
EXEC sp_OADestroy        @objSubMatches

INSERT INTO @result
          
(Match_ID,
          
FirstIndex,
          
[length],
          
[Value],
          
Submatch_ID,
          
SubmatchValue,
        
error)


    
SELECT  m.[Match_ID],
          
[FirstIndex],
          
[length],
          
[Value],[SubmatchNo],
          
[SubmatchValue],@error
  
FROM    @match m
    
LEFT OUTER JOIN   @submatch s
    
ON m.match_ID=s.match_ID    
IF @@rowcount=0 AND LEN(@error)&gt;0
INSERT INTO @result(error) SELECT @error
RETURN
END
GO

--showing the context where two words 'for' and 'last' are found in proximity
DECLARE @sample VARCHAR(2000)
SELECT @Sample='You have failed me for the last time, Admiral.
We have not long to wait for your last gasp'
SELECT '...'+SUBSTRING(@Sample,Firstindex-8,length+16)+'...'
    
FROM dbo.RegexFind ('\bfor(?:\W+\w+){0,3}?\W+last\b',
          
@sample,1,1)

--finding repeated words, showing the repetition and the repeated word
SELECT [repetition]=value, [word]=SubmatchValue FROM dbo.RegexFind ('\b(\w+)\s+\1\b',
'this this is is a repeated word word word',1,1)

--Split lines based on a regular expression
SELECT value FROM dbo.regexfind('[^\r\n]*(?:[\r\n]*)',
'
This is the second line
This is the third
and the fourth'
,1,1) WHERE length&gt;0

--break up all words in a string into separate table rows
SELECT value FROM dbo.RegexFind ('\b[\w]+\b',
'Hickory dickory dock, the mouse ran up the clock',1,1)
--split text into keywords and values
SELECT Match_ID,
[keyword]=MAX (CASE WHEN submatch_ID=1 THEN  submatchValue ELSE '' END),
[value]=MAX (CASE WHEN submatch_ID=2 THEN  submatchValue ELSE '' END)
  
FROM dbo.RegexFind ('(\w+)\s*=\s*(.*)\s*',
'firstname=Phil
Lastname=Factor
Salary=$200,000
age=unknown to us
Post=DBA'
,1,1) GROUP BY Match_ID

SELECT * FROM dbo.RegexFind ('([^\|\r\n]+[\|\r\n]+)',
'1|white gloves|2435|24565
2|Sports Shoes|285678|0987
3|Stumps|2845|987
4|bat|29862|4875'
,1,1)

--get valid dates and convert to SQL Server format
SELECT DISTINCT CONVERT(DATETIME,value,103) FROM dbo.RegexFind ('\b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20?[0-9]{2})\b','
12/2/2006 12:30 <> 13/2/2007
32/3/2007
2-4-2007
25.8.2007
1/1/2005
34/2/2104
2/5/2006'
,1,1)

/*

Combining two Regexs
--------------------

Once you've experimented with the regex calls we've provided, you'll realise that you can create some really cool functions and procedures that combine regexs. Here we have a procedure that does a 'google-style' search on text to find the words you specify. It returns the 'context' in that it quotes the substring where the match occurred. You can specify how close the words need to be to specify a 'hit'

*/

IF OBJECT_ID(N'dbo.FindWordsInContext') IS NOT NULL
    
DROP FUNCTION dbo.FindWordsInContext
GO
CREATE FUNCTION dbo.FindWordsInContext
    
(
      
@words VARCHAR(255),--list of words you want searched for
      
@text VARCHAR(MAX),--the text you want searched
      
@proximity INT--the maximum distance in words between specified words
    
)
RETURNS @proximityList TABLE
    
(
      
Hit INT IDENTITY(1, 1),
      
context VARCHAR(2000)
    )
AS BEGIN
    DECLARE
@Pattern VARCHAR(512)
    
SELECT  @Pattern = COALESCE(@pattern + '(?:\W+\w+){0,'
                                
+ CAST(@proximity AS VARCHAR(5)) + '}?\W+',
                                
'\b') + value
    
FROM    dbo.RegexFind('\b[\w]+\b', @words, 1, 1)
    
INSERT  INTO @ProximityList ( context )
            
SELECT  '...' + SUBSTRING(@text, Firstindex - 8, length + 16)
                    +
'...'
            
FROM    dbo.RegexFind(@pattern+'\b', @text, 1, 1)
    
RETURN
   END

GO

SELECT * FROM dbo.FindWordsInContext('sadness farewell embark',
'Sunset and evening star,
And one clear call for me!
And may there by no moaning of the bar,
When I put out to sea,

But such a tide as moving seems asleep,
Too full for sound and foam,
When that which drew from out the boundless deep
Turns again home.

Twilight and evening bell,
And after that the dark!
And may there be no sadness of farewell,
When I embark;

For tho'' from out our bourne of Time and Place
The flood may bear me far,
I hope to see my Pilot face to face
When I have crost the bar.
'
,8)


/*

OLE Regex performance
-----------------

Whereas the use of the OLE VBScript.RegExp to scan large chunks of text is fine, it is good for complex validation, and it makes a great testbed for regexes, These OLE functions are too slow for use in queries. The overhead of making the calls is just too high because the performance of OLE in TSQL is not great. See Zach Nichter's excellent article on the subject 'Writing to a File Using the sp_OACreate Stored Procedure and OSQL' Here is an example, scanning a databases of nearly 50,000 names of public houses from out XML Jumpstart Workbench.
*/

SELECT COUNT(*) FROMpublichouses.dbo.publichouses WHERE dbo.RegexMatch ('\bred\b',name)=1
--5 minutes 28 secs
SELECT COUNT(*) FROM publichouses.dbo.publichouses WHERE name LIKE '%red %'
--less than 50 ms

/*You can reduce the overhead to a quarter of what it was by using a function like this and creating the Regex object before you do the call. This means the Regex Object does not get repeatedly created and destroyed on every call.*/

IF OBJECT_ID(N'dbo.OARegexMatch') IS NOT NULL
    
DROP FUNCTION dbo.OARegexMatch
GO
CREATE FUNCTION dbo.OARegexMatch /* very simple Function Wrapper around the call */
    
(
    
@objRegexExp INT,
      
@matchstring VARCHAR(MAX)
    )
RETURNS INT
AS BEGIN
    DECLARE
@objErrorObject INT,
        
@hr INT,
        
@match BIT
        
EXEC @hr= sp_OAMethod @objRegexExp, 'Test', @match OUT, @matchstring
    
IF @hr &lt;&gt; 0
        
BEGIN
            RETURN
NULL
        
END
    RETURN
@match
  
END
GO
/* and now embed the SQL Query within the life-cycle of the Regex object */

DECLARE @objRegexExp INT,
        
@objErrorObject INT,
        
@strErrorMessage VARCHAR(255),
        
@hr INT,
        
@match BIT

    
SELECT  @strErrorMessage = 'creating a regex object'
    
EXEC @hr= sp_OACreate 'VBScript.RegExp', @objRegexExp OUT
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp,'pattern', '\bred\b'
        
--Specifying a case-insensitive match
    
IF @hr = 0
        
EXEC @hr= sp_OASetProperty @objRegexExp, 'IgnoreCase', 1
        
--Doing a Test'
    
IF @hr = 0
      
SELECT COUNT(*)
          
FROM publichouses.dbo.publichouses
          
WHERE dbo.OARegexMatch (@objRegexExp,name)=1
    
IF @hr &lt;&gt; 0
        
BEGIN
            DECLARE
@Source VARCHAR(255),
                
@Description VARCHAR(255),
                
@Helpfile VARCHAR(255),
                
@HelpID INT
  
            EXECUTE
sp_OAGetErrorInfo @objErrorObject, @source OUTPUT,
                
@Description OUTPUT, @Helpfile OUTPUT, @HelpID OUTPUT
            
SELECT  @strErrorMessage = 'Error whilst '
                    
+ COALESCE(@strErrorMessage, 'doing something') + ', '
                    
+ COALESCE(@Description, '')
            
RAISERROR( @strErrorMessage,16,1)
        
END
    EXEC
sp_OADestroy @objRegexExp
--1 minute 28 secs

/* it is no consolation for those who are stuck with SQL Server 2000, but the CLR functions are a lot quicker for this sort of usage. */

We've used a range of regex patterns from a number of sources in this workbench. Like a lot of programmers, we collect up snippets we come across, almost always forgetting to record the original author. We therefore apologise in advance for not crediting the source and original author, of regex patterns. As you can guess, they often take a long time and effort to develop. If you spot a regex which we should have cited, please add a comment and let us all know who originally wrote it!

Robyn Page and Phil Factor

Author profile:


Robyn Page has worked as a consultant with Enformatica and USP Networks with a special interest in the provision of broadcast services over IP intranets. She was also a well known actress, being most famous for her role as Katie Williams, barmaid and man-eater in the Television Series Family Affairs. She is currently having a career break to raise a young family.

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 20 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

Search for other articles by Robyn Page and Phil Factor

Rate this article:   Avg rating: from a total of 66 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Excelent stuff!
Posted by: Anonymous (not signed in)
Posted on: Thursday, November 29, 2007 at 1:46 AM
Message: Perhaps You can considder making a similar one using CRL and .Net .-)

Subject: re: Excelent stuff!
Posted by: Phil Factor (view profile)
Posted on: Thursday, November 29, 2007 at 3:11 AM
Message: Thanks for the compliment.

If you mean CLR, we started out doing it with CLR but came to realise it had been done so many times already, and we like to come up with something new with these Workbenches. Also, the SQL Server 2000 crew appreciate it if they can join in too!
The examples should work with CLR pretty closely though the EXECUTE call (which has a lot of uses in TSQL) will take a bit of thought! Can anyone provide a CLR Version of RegexFind?

Subject: SP_OACREATE turned off in SQL2005
Posted by: Andy Clark (not signed in)
Posted on: Thursday, November 29, 2007 at 4:43 AM
Message: SQL 2005 users will need to enable Ole Automation Procedures using the Surface Area Configuration or sp_configure.

Subject: Error on line 27 in T-SQL 2000
Posted by: Bob Carretta (view profile)
Posted on: Monday, December 03, 2007 at 11:04 AM
Message: Robyn, I get an error:
Line 27: Incorrect syntax near ';'.

which is the line:
IF @hr <> 0

I have updated the varaiable @matchstring varchar(8000)as noted in the code.

Am Ioverlooking something else?


Subject: re: Error on line 27 in T-SQL 2000
Posted by: Robyn Page (view profile)
Posted on: Tuesday, December 04, 2007 at 4:13 AM
Message: Hmm. We don't get the same error. We've put an alternative SQL Server 200 source in the speech-bubble at the top of the page just in case anyone else hits problems.

Subject: Error on line 27 in T-SQL 2000
Posted by: Bob Carretta (view profile)
Posted on: Tuesday, December 04, 2007 at 4:03 PM
Message: Robyn,
I did indeed overlook the script in the bubble. The script works fine now.
Thanks

Subject: Catastrophic backtracking
Posted by: Phil Factor (view profile)
Posted on: Wednesday, December 05, 2007 at 3:47 AM
Message: Make sure you try stuff out comprehensively on a development server. A problem with a lot of Regex implementations is a mistake called 'Catastrophic Backtracking'. It seems incredible that one can cause applications to go into endless loops (the .net regex) or stack overflows (perl) just by making a mistake with a pattern but this is the case in the wacky world of Regex.

Catastrophic Backtracking, like much in Regular Expressions, takes a lot of explaining so I won't even attempt it here. see http://www.regular-expressions.info/catastrophic.html When it happens, the spid can't be killed, and the server slows to a crawl. The only way out I know of is to stop and restart the server.

Subject: VBScript.RegExp not there...
Posted by: Jeff Moden (view profile)
Posted on: Monday, December 10, 2007 at 2:01 AM
Message: VBScript.RegExp isn't anywhere to be found on my harddisk... and, I use IE7 (upgraded from 6 upgraded from 5, etc). I have both SQL Server 2000 and 2005 loaded, as well.

Subject: re: VBScript.RegExp not there...
Posted by: Phil Factor (view profile)
Posted on: Monday, December 10, 2007 at 5:07 AM
Message: Jeff,

The RegExp object was introduced in WSH 2 which is normally installed with the OS, (since NT) but was installed with IE5 onwards. A simple way of checking to see if it is installed is to see what icons are associated with a .js file, a .vbs file, and a .wsf file on your server.
If you hit any problems, you can install the latest version from Microsoft. This is on MSDN: Download Centre-Windows Scripting.

Subject: Nice
Posted by: Eider Mauricio Aristizabal (not signed in)
Posted on: Monday, December 10, 2007 at 7:20 AM
Message: Very nice!

Subject: Very Nice
Posted by: Giancarlo Delgado (not signed in)
Posted on: Monday, December 10, 2007 at 8:33 AM
Message: Years ago when i was looking for a sql2000 solution to regex, i could only find dlls to buy. It wasnt till the CLR that i started using regex at the database level. This article should have been written a very long time ago, but will still serve its use today i bet. Thanks guys.

http://www.RezoneSolutions.com

Subject: TSQL Regular Expression Workbench
Posted by: Anonymous (not signed in)
Posted on: Monday, December 10, 2007 at 10:44 AM
Message: I have used regular expression in font-end programs so many times but I have not used it in TSQL before. This article is very good and useful.

Subject: Nice if on SQL 2000
Posted by: Anonymous (not signed in)
Posted on: Wednesday, December 12, 2007 at 9:57 PM
Message: This would be good for SQL Server 2000 and earlier, but I believe it would be much simpler to create CLR UDFs to do this exact thing.

Subject: Re: Nice if on SQL 2000
Posted by: Robyn Page (view profile)
Posted on: Thursday, December 13, 2007 at 5:37 AM
Message: Sure, but we didn't want to leave out the SQL Server 2000 users (see our comments about this in the article). The article is about using Regex, not OLE.

I haven't come across a CLR solution that gives all the information that comes from RegexFind, so it would be nice if you can send one that we can publish.

Subject: Always the best Stuff
Posted by: Amit Tiwari (not signed in)
Posted on: Thursday, December 13, 2007 at 11:24 PM
Message: You always put up the best stuff. Your knowledge is incredible.
Cheers

Subject: Sweetness
Posted by: Maxkil (not signed in)
Posted on: Thursday, December 20, 2007 at 4:14 PM
Message: EXACTLY what I was looking for!
I scoured the web trying to find the right syntax to use Regular Expression's Replace method with sp_OACreate and sp_OAMethod. I know it was possible. I just didn't know the right snytax. Thanks!

Subject: Re: Sweetness
Posted by: Robyn Page (view profile)
Posted on: Saturday, December 22, 2007 at 3:04 AM
Message: Bless you.
Let us know of any other subjects that need the 'workbench' treatment and Grant, Adam, Phil or I will do our best.

Subject: We don't need no stinkin' CLR!
Posted by: Rowland Gosling (not signed in)
Posted on: Thursday, December 27, 2007 at 9:01 PM
Message: I like my rum, tea, coffee and SQL straight. I became troubled by the awkwardness of such a solution too. I also started down the CLR path--excited as a school boy I at last found a real use for the CLR-- only to realize it wasn't very portable: my current client is on MSSQL2K and MSSQL2K5.

Loved your solution--it's verra nice
</Steve Martin>

Rowland



Subject: Befuzzled
Posted by: William Meitzen (not signed in)
Posted on: Tuesday, February 26, 2008 at 9:37 PM
Message: My SQL Server 2000 regex UDF worked, but the same UDF does not. The one posted here does not, nor do the other 5 I've tried. OLE automation is turned on.

SELECT dbo.RegexMatch('\b(\w+)\s+\1\b','this has has been repeated')-- returns NULL
SELECT dbo.RegexMatch('\b(\w+)\s+\1\b','this has not been repeated')-- returns NULL

My SQL Server 2005 database login is tied to the sysadmin role, too.

Subject: count of the select statements in a stored procedure
Posted by: Anonymous (not signed in)
Posted on: Thursday, February 28, 2008 at 11:04 AM
Message: i have 500 sps and some of them have select statements in it...
i want to take the count of this select statements excluding the commented ones..
is there any query for that?..
or hoe can i get the count of the select statements in sps in my database

is there any regular expression to do this..
i am using mssql 2005 and 2000.
thanks in advance

Subject: WOW - thank you!
Posted by: Nick (view profile)
Posted on: Wednesday, April 16, 2008 at 11:48 AM
Message: You just saved me my job. This function worked on the second try. My sql user needed dbo rights to the master table, but after that it worked fantastic.

I really appreciate you time on this workbench item!

Subject: RegEx results limit
Posted by: RJSO (not signed in)
Posted on: Wednesday, June 25, 2008 at 1:48 PM
Message: Hi all, i created a function that returns the number of matches within a string.

For example:

Select Name, dbo.ufn_RegexObjCount('\b\w+?\b',[Name]) from #Table1

Name Count
-------------------------------------
1 'test 1 or 2 or 3' 7
2 'test 1 or 2 or 3' 7
3 'test 1 or 2 or 3' 7


It returns the number of words for every field on Name. The problem occurs when i have more then 128 registries in the Name field.

Name Count
-------------------------------------
1 'test 1 or 2 or 3' 7
2 'test 1 or 2 or 3' 7
3 'test 1 or 2 or 3' 7
...
129 'test 1 or 2 or 3' NULL
130 'test 1 or 2 or 3' NULL
131 'test 1 or 2 or 3' NULL

You can see my code bellow:

set ANSI_NULLS ON
set QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[ufn_regexObjCount]
(
@pattern nvarchar(4000),
@source nvarchar(4000)
)
RETURNS INT AS
BEGIN
DECLARE @hr INT
DECLARE @objMatch INT
DECLARE @matchcount INT
DECLARE @objRegExp INT

EXEC @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUT
EXEC @hr = sp_OASetProperty @objRegExp, 'Pattern', @pattern
EXEC @hr= sp_OASetProperty @objRegExp, 'MultiLine', 1
EXEC @hr= sp_OASetProperty @objRegExp, 'IgnoreCase', 1
EXEC @hr = sp_OASetProperty @objRegExp, 'Global', 1
EXEC @hr = sp_OAMethod @objRegExp, 'execute', @objMatch OUT, @source
EXEC @hr = sp_OAGetProperty @objMatch, 'count', @matchcount OUT

IF @hr <> 0
BEGIN
SET @matchcount = NULL
END
RETURN @matchcount
END

Subject: regular expression problem
Posted by: tengtium (view profile)
Posted on: Saturday, September 05, 2009 at 5:10 AM
Message: good day sirs and madam,

i badly needed help.. i have this regular expression

^(?!.*--)[A-Za-z\d-]+$

it accepts alphanumeric character optionally a dash (single dash only, not consecutive dash) for exampl:

it accepts:
12a-3c-4f3fg
12ertgg2
1-2-3-3-4-3
dffgsfg
d-f-f-g-s-f-g

it does not accept:
12-3c-4f&3fg
12e%rt-gg2
d--f-f-g-s-f-g
1-2-3-3-4--3

now i have problem, i cant find any in the .net that accepts the above valid expression with space-a non consecutive space. meaning the it needs to accept alphanumeric with optionally a non-consecutive dash and optionally a non-consecutive space.

can someone help please.

Subject: thanks
Posted by: rajib.bahar (view profile)
Posted on: Tuesday, October 13, 2009 at 2:48 PM
Message: thanks for the quick refresher...

the bio forgot to include...

"Robyn Page played Katie Williams, Ben's sister, barmaid and man-eater" --- pulled from wikipedia

Subject: the multline does not work
Posted by: Jeffrey (view profile)
Posted on: Wednesday, November 28, 2012 at 2:53 AM
Message: i used this sentence "EXEC @hr= sp_OASetProperty @objRegExp, 'MultiLine', 1" to set MultiLine work
but i found that doesn't work

can someone help please.

Subject: the multline does not work PS
Posted by: Jeffrey (view profile)
Posted on: Wednesday, November 28, 2012 at 3:06 AM
Message: i used this sentence "EXEC @hr= sp_OASetProperty @objRegExp, 'MultiLine', 1" to set MultiLine work
but i found that doesn't work

i can use regexreplace to delete all the "\r\n" but it isn't an ideal solution
can someone help please.


Subject: Error trying to replace HTML tags
Posted by: mat41 (view profile)
Posted on: Sunday, April 21, 2013 at 6:23 PM
Message: Hi there. Wow this works amazing, thank you for your fine work!!

Locally I created the function and then ran this to replace all the HTML tags in a field:

SELECT dbo.RegexReplace('<(?:[^>''"]*|([''"]).*?\1)*>',
'',context,1,1) from hazHRA

But in the live environment I an getting this error:

The EXECUTE permission was denied on the object 'sp_OACreate', database 'mssqlsystemresource', schema 'sys'.

I am a DBO type user, is this the issue? Is there an instance level setting I need to ask to be changed? SQLServer 2008R2

Thanks

Subject: Zip code
Posted by: rocketpanda (view profile)
Posted on: Tuesday, July 16, 2013 at 2:52 PM
Message: Somewhat unrelated, but I thought it was funny that your example zip code (02115-4653) is about a mile from my apartment.

Subject: Bug in identifying submatches in regExFind
Posted by: LenBinns (view profile)
Posted on: Tuesday, April 15, 2014 at 11:41 AM
Message: BTW, THANK YOU, this was a great article!

Couple minor bugs in regExFind:

1) Typo: you have "SELLECT" instead of "SELECT"
2) Submatches don't work because you're missing a line of code.

Need to add the following line of code:
exec @hr = sp_OAGetProperty @objmatch, @command, @objSubmatches OUTPUT

Current Code:
-- The SubMatches property of the Match object is a collection of strings.
-- It will only hold values if your regular expression has capturing groups.
-- The collection will hold one string for each capturing group.
-- The Count property (returned as SubmatchCount) indicates the number of string in the collection.
-- The Item property takes an index parameter, and returns the text matched by the capturing group.
--
if @hr = 0 select @strErrorMessage = 'Getting the SubMatches collection'
if @hr = 0 select @command = 'item(' + cast(@ii as varchar) + ').SubMatches'
if @hr = 0 select @strErrorMessage = 'Getting the number of submatches'
if @hr = 0 exec @hr = sp_OAGetProperty @objSubmatches, 'count', @submatchCount OUTPUT

Revised Code:
-- The SubMatches property of the Match object is a collection of strings.
-- It will only hold values if your regular expression has capturing groups.
-- The collection will hold one string for each capturing group.
-- The Count property (returned as SubmatchCount) indicates the number of string in the collection.
-- The Item property takes an index parameter, and returns the text matched by the capturing group.
--
if @hr = 0 select @strErrorMessage = 'Getting the SubMatches collection'
if @hr = 0 select @command = 'item(' + cast(@ii as varchar) + ').SubMatches'
if @hr = 0 select @strErrorMessage = 'Getting the number of submatches'
if @hr = 0 exec @hr = sp_OAGetProperty @objmatch, @command, @objSubmatches OUTPUT -- << THIS IS THE NEW LINE
if @hr = 0 exec @hr = sp_OAGetProperty @objSubmatches, 'count', @submatchCount OUTPUT

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Highway to Database Recovery
 Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.