27 November 2007

TSQL Regular Expression Workbench

Robyn and Phil start by writing a gentle introduction to using Regular expressions for validation, data cleaning and data import in TSQL, and finally end up with a routine for doing google-style searches that show the context of hits. It's all done in the spirit of 'try it and see...'

This Workbench is about using Regular expressions with SQL Server via TSQL. It doesn’t even attempt to teach how regular expressions work or how to pull them together. There are plenty of such resources on the Web. The aim is to demonstrate a few possibilities and try to persuade you to experiment with them if you don’t already use Regex with SQL Server.

We suggest that, if you are an ordinary mortal like Phil or I, without special powers, you should use an application such as RegexBuddy to form, edit and interpret Regular expressions. It makes learning them a lot easier. In order that people with access only to SQL Server 2000 can use the workbench, we’ll use OLE in the examples, but they are readily adapted to CLR.

As always, the source code is available to download at the bottom of the article.

Contents

  1. Introduction
  2. The OLE Functions
    1. The OLE Regex Match function
    2. The OLE Regex Replace function
    3. The OLE Regex Find (Execute) function
  3. Combining two Regexes
  4. OLE Regex Performance

Regular Expressions can be very useful to the Database programmer, particularly for data validation, data feeds and data transformations. A lot of the time, tools such as grep and awk or Funduc’s S&R will be the most suitable way of using regular expressions, but just occasionally, it is handy to be able to use them in TSQL as we’ll try to show.

Regular Expressions are not regular in the sense that there is any common dialect of expression that is understood by all Regex engines. On the contrary, regular expresssions aren’t always portable and there are many common, similar but incompatible, dialects in use, such as Perl 5.8, Java.util.regex, .NET, PHP, Python, Ruby, ECMA Javascript, PCRE, Apache, vi, Shell tools TCL ARE, POSIX BRE, Funduc and JGsoft.

Regular Expressions were never developed to be easy to understand. They are a condensed shorthand that, on preliminary inspection, looks as if someone has repeatedly sat on the keyboard. Even when interpreted, the logic isn’t always easy to follow.

Probably the best tutorial on the web for Regular Expressions is on www.regular-expressions.info but it is also worth reading Implementing Real-World Data Input Validation using Regular Expressions by Francis Norton for an introduction to regular expressions.

A great deal can be done using commandline applications that work with regular expressions such as GREP and AWK. However, there are times where it is handy to use Regex directly from TSQL. There are two Regex engines available to SQL Server. These are:

  • The .NET Regex which is in the system.text.regularexpression module
  • The ECMA Regex from VBScript.RegExp which is distributed with the IE browser and is used by Javascript and JScript.

Both of these are excellent standard implementations. Both work well in TSQL.

The .NET Regex requires the creation of CLR functions to provide regular expressions, and works only with SQL Server 2005, (and 2007) See CLR Integration by Christoffer Hedgate.

The ECMA Regex can be used via VBScript.RegExp, which are available to SQL Server 2000 as well. The regex is compatible with Javascript.

The advantage of using CLR is that the regular expressions of the NET framework are very good, and performance is excellent. However, the techniques are well-known, whereas some of the more powerful uses of VBScript.RegExp have hardly ever been published, so this workbench will concentrate on the latter

The OLE functions

There are various properties to consider in these functions:

IgnoreCase
By default, the regular expression is case sensitive. In the following functions, we have set the IgnoreCase property to True to make it case insensitive.
Multiline property
The caret and dollar only match at the very start and very end of the subject string by default. If your subject string consists of multiple lines separated by line breaks, you can make the caret and dollar match at the start and the end of those lines by setting the Multiline property to True. (there is no option to make the dot match line break characters).
Global property
If you want the RegExp object to return or replace all matches instead of just the first one, set the Global property to True.

Only the IgnoreCase is relevant in the first function but we’ve ‘hardcoded’ it to 1 as case-sensitive searches are a minority interest.

The OLE Regex Match function

Let’s start off with something simple, a function for testing a string against a regular expression:

The RegexMatch returns True or False, indicating if the regular expression matches (part of) the string. (It returns null if there is an error).

When using this for validating user input, you’ll normally want to check if the entire string matches the regular expression. To do so, put a caret at the start of the regex, and a dollar at the end, to anchor the regex at the start and end of the subject string.

Now, with this routine, we can do some complex input validation.

With this function, the passing back of errors is rudimentary. If an OLE error occurs, then a null is passed back.

There are two other basic Regex functions available. With them, you can use regular expressions in all sorts of places in TSQL without having to get to direct grips with the rather awkward OLE interface.

The OLE Regex Replace function

The RegexReplace function takes three string parameters. The pattern (the regular expression) the replacement expression, and the subject string to do the manipulation to.

The replacement expression is one that can cause difficulties. You can specify an empty string '' as the @replacement text. This will cause the Replace method to return the subject string with all regex matches deleted from it (see “strip all HTML elements out of a string” below).

To re-insert the regex match as part of the replacement, include $& in the replacement text. (see “find a #comment and add a TSQL –” below).

If the regexp contains capturing parentheses, you can use backreferences in the replacement text. $1 in the replacement text inserts the text matched by the first capturing group, $2 the second, etc. up to $9. (e.g. see import delimited text into a database below) To include a literal dollar sign in the replacements, put two consecutive dollar signs in the string you pass to the Replace method.

The OLE Regex Find (Execute) function

This is the most powerful function for doing complex finding and replacing of text. As it passes back detailed records of the hits, including the location and the backreferences, it allows for complex manipulations.

This is written as a table function. The Regex Routine actually passes back a collection for each ‘hit’. In the relational world, you’d normally represent this in two tables, so we’ve returned a left outer join of the two logical tables so as to pass back all the information. This seems to cater for all the uses we can think of. We also append an error column, which should be blank!

Combining two Regexes

Once you’ve experimented with the regex calls we’ve provided, you’ll realise that you can create some really cool functions and procedures that combine regexes. Here we have a procedure that does a ‘google-style’ search on text to find the words you specify. It returns the ‘context’ in that it quotes the substring where the match occurred. You can specify how close the words need to be to specify a ‘hit’.

OLE Regex performance

Whereas the use of the OLE VBScript.RegExp to scan large chunks of text is fine, it is good for complex validation, and it makes a great testbed for regexes, These OLE functions are too slow for use in queries. The overhead of making the calls is just too high because the performance of OLE in TSQL is not great. See Zach Nichter’s excellent article on the subject ‘Writing to a File Using the sp_OACreate Stored Procedure and OSQL’.

Here is an example, scanning a databases of nearly 50,000 names of public houses from out XML Jumpstart Workbench.

It is no consolation for those who are stuck with SQL Server 2000, but the CLR functions are a lot quicker for this sort of usage.

We’ve used a range of regex patterns from a number of sources in this workbench. Like a lot of programmers, we collect up snippets we come across, almost always forgetting to record the original author. We therefore apologise in advance for not crediting the source and original author, of regex patterns. As you can guess, they often take a long time and effort to develop. If you spot a regex which we should have cited, please add a comment and let us all know who originally wrote it!

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

Downloads

This post has been viewed 160690 times – thanks for reading.

Tags: , , , ,

  • Rate
    [Total: 74    Average: 4.6/5]
  • Share

Robyn Page and Phil Factor

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 20 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

View all articles by Robyn Page and Phil Factor

  • Anonymous

    Excelent stuff!
    Perhaps You can considder making a similar one using CRL and .Net .-)

  • Phil Factor

    re: Excelent stuff!
    Thanks for the compliment.

    If you mean CLR, we started out doing it with CLR but came to realise it had been done so many times already, and we like to come up with something new with these Workbenches. Also, the SQL Server 2000 crew appreciate it if they can join in too!
    The examples should work with CLR pretty closely though the EXECUTE call (which has a lot of uses in TSQL) will take a bit of thought! Can anyone provide a CLR Version of RegexFind?

  • Andy Clark

    SP_OACREATE turned off in SQL2005
    SQL 2005 users will need to enable Ole Automation Procedures using the Surface Area Configuration or sp_configure.

  • Bob Carretta

    Error on line 27 in T-SQL 2000
    Robyn, I get an error:
    Line 27: Incorrect syntax near ‘;’.

    which is the line:
    IF @hr <> 0

    I have updated the varaiable @matchstring varchar(8000)as noted in the code.

    Am Ioverlooking something else?

  • Robyn Page

    re: Error on line 27 in T-SQL 2000
    Hmm. We don’t get the same error. We’ve put an alternative SQL Server 200 source in the speech-bubble at the top of the page just in case anyone else hits problems.

  • Bob Carretta

    Error on line 27 in T-SQL 2000
    Robyn,
    I did indeed overlook the script in the bubble. The script works fine now.
    Thanks

  • Phil Factor

    Catastrophic backtracking
    Make sure you try stuff out comprehensively on a development server. A problem with a lot of Regex implementations is a mistake called ‘Catastrophic Backtracking’. It seems incredible that one can cause applications to go into endless loops (the .net regex) or stack overflows (perl) just by making a mistake with a pattern but this is the case in the wacky world of Regex.

    Catastrophic Backtracking, like much in Regular Expressions, takes a lot of explaining so I won’t even attempt it here. see http://www.regular-expressions.info/catastrophic.html When it happens, the spid can’t be killed, and the server slows to a crawl. The only way out I know of is to stop and restart the server.

  • Jeff Moden

    VBScript.RegExp not there…
    VBScript.RegExp isn’t anywhere to be found on my harddisk… and, I use IE7 (upgraded from 6 upgraded from 5, etc). I have both SQL Server 2000 and 2005 loaded, as well.

  • Phil Factor

    re: VBScript.RegExp not there…
    Jeff,

    The RegExp object was introduced in WSH 2 which is normally installed with the OS, (since NT) but was installed with IE5 onwards. A simple way of checking to see if it is installed is to see what icons are associated with a .js file, a .vbs file, and a .wsf file on your server.
    If you hit any problems, you can install the latest version from Microsoft. This is on MSDN: Download Centre-Windows Scripting.

  • Eider Mauricio Aristizabal

    Nice
    Very nice!

  • Giancarlo Delgado

    Very Nice
    Years ago when i was looking for a sql2000 solution to regex, i could only find dlls to buy. It wasnt till the CLR that i started using regex at the database level. This article should have been written a very long time ago, but will still serve its use today i bet. Thanks guys.

    http://www.RezoneSolutions.com

  • Anonymous

    TSQL Regular Expression Workbench
    I have used regular expression in font-end programs so many times but I have not used it in TSQL before. This article is very good and useful.

  • Anonymous

    Nice if on SQL 2000
    This would be good for SQL Server 2000 and earlier, but I believe it would be much simpler to create CLR UDFs to do this exact thing.

  • Robyn Page

    Re: Nice if on SQL 2000
    Sure, but we didn’t want to leave out the SQL Server 2000 users (see our comments about this in the article). The article is about using Regex, not OLE.

    I haven’t come across a CLR solution that gives all the information that comes from RegexFind, so it would be nice if you can send one that we can publish.

  • Amit Tiwari

    Always the best Stuff
    You always put up the best stuff. Your knowledge is incredible.
    Cheers

  • Maxkil

    Sweetness
    EXACTLY what I was looking for!
    I scoured the web trying to find the right syntax to use Regular Expression’s Replace method with sp_OACreate and sp_OAMethod. I know it was possible. I just didn’t know the right snytax. Thanks!

  • Robyn Page

    Re: Sweetness
    Bless you.
    Let us know of any other subjects that need the ‘workbench’ treatment and Grant, Adam, Phil or I will do our best.

  • Rowland Gosling

    We don’t need no stinkin’ CLR!
    I like my rum, tea, coffee and SQL straight. I became troubled by the awkwardness of such a solution too. I also started down the CLR path–excited as a school boy I at last found a real use for the CLR– only to realize it wasn’t very portable: my current client is on MSSQL2K and MSSQL2K5.

    Loved your solution–it’s verra nice
    </Steve Martin>

    Rowland

  • William Meitzen

    Befuzzled
    My SQL Server 2000 regex UDF worked, but the same UDF does not. The one posted here does not, nor do the other 5 I’ve tried. OLE automation is turned on.

    SELECT dbo.RegexMatch(‘b(w+)s+1b’,’this has has been repeated’)– returns NULL
    SELECT dbo.RegexMatch(‘b(w+)s+1b’,’this has not been repeated’)– returns NULL

    My SQL Server 2005 database login is tied to the sysadmin role, too.

  • Anonymous

    count of the select statements in a stored procedure
    i have 500 sps and some of them have select statements in it…
    i want to take the count of this select statements excluding the commented ones..
    is there any query for that?..
    or hoe can i get the count of the select statements in sps in my database

    is there any regular expression to do this..
    i am using mssql 2005 and 2000.
    thanks in advance

  • Nick

    WOW – thank you!
    You just saved me my job. This function worked on the second try. My sql user needed dbo rights to the master table, but after that it worked fantastic.

    I really appreciate you time on this workbench item!

  • RJSO

    RegEx results limit
    Hi all, i created a function that returns the number of matches within a string.

    For example:

    Select Name, dbo.ufn_RegexObjCount(‘bw+?b’,[Name]) from #Table1

    Name Count
    ————————————-
    1 ‘test 1 or 2 or 3’ 7
    2 ‘test 1 or 2 or 3’ 7
    3 ‘test 1 or 2 or 3’ 7

    It returns the number of words for every field on Name. The problem occurs when i have more then 128 registries in the Name field.

    Name Count
    ————————————-
    1 ‘test 1 or 2 or 3’ 7
    2 ‘test 1 or 2 or 3’ 7
    3 ‘test 1 or 2 or 3’ 7

    129 ‘test 1 or 2 or 3’ NULL
    130 ‘test 1 or 2 or 3’ NULL
    131 ‘test 1 or 2 or 3’ NULL

    You can see my code bellow:

    set ANSI_NULLS ON
    set QUOTED_IDENTIFIER ON
    GO
    ALTER FUNCTION [dbo].[ufn_regexObjCount]
    (
    @pattern nvarchar(4000),
    @source nvarchar(4000)
    )
    RETURNS INT AS
    BEGIN
    DECLARE @hr INT
    DECLARE @objMatch INT
    DECLARE @matchcount INT
    DECLARE @objRegExp INT

    EXEC @hr = sp_OACreate ‘VBScript.RegExp’, @objRegExp OUT
    EXEC @hr = sp_OASetProperty @objRegExp, ‘Pattern’, @pattern
    EXEC @hr= sp_OASetProperty @objRegExp, ‘MultiLine’, 1
    EXEC @hr= sp_OASetProperty @objRegExp, ‘IgnoreCase’, 1
    EXEC @hr = sp_OASetProperty @objRegExp, ‘Global’, 1
    EXEC @hr = sp_OAMethod @objRegExp, ‘execute’, @objMatch OUT, @source
    EXEC @hr = sp_OAGetProperty @objMatch, ‘count’, @matchcount OUT

    IF @hr <> 0
    BEGIN
    SET @matchcount = NULL
    END
    RETURN @matchcount
    END

  • tengtium

    regular expression problem
    good day sirs and madam,

    i badly needed help.. i have this regular expression

    ^(?!.*–)[A-Za-zd-]+$

    it accepts alphanumeric character optionally a dash (single dash only, not consecutive dash) for exampl:

    it accepts:
    12a-3c-4f3fg
    12ertgg2
    1-2-3-3-4-3
    dffgsfg
    d-f-f-g-s-f-g

    it does not accept:
    12-3c-4f&3fg
    12e%rt-gg2
    d–f-f-g-s-f-g
    1-2-3-3-4–3

    now i have problem, i cant find any in the .net that accepts the above valid expression with space-a non consecutive space. meaning the it needs to accept alphanumeric with optionally a non-consecutive dash and optionally a non-consecutive space.

    can someone help please.

  • rajib.bahar

    thanks
    thanks for the quick refresher…

    the bio forgot to include…

    “Robyn Page played Katie Williams, Ben’s sister, barmaid and man-eater” — pulled from wikipedia

  • Jeffrey

    the multline does not work
    i used this sentence "EXEC @hr= sp_OASetProperty @objRegExp, ‘MultiLine’, 1" to set MultiLine work
    but i found that doesn’t work

    can someone help please.

  • Jeffrey

    the multline does not work PS
    i used this sentence "EXEC @hr= sp_OASetProperty @objRegExp, ‘MultiLine’, 1" to set MultiLine work
    but i found that doesn’t work

    i can use regexreplace to delete all the "rn" but it isn’t an ideal solution
    can someone help please.

  • mat41

    Error trying to replace HTML tags
    Hi there. Wow this works amazing, thank you for your fine work!!

    Locally I created the function and then ran this to replace all the HTML tags in a field:

    SELECT dbo.RegexReplace(‘<(?:[^>”"]*|([”"]).*?1)*>’,
    ”,context,1,1) from hazHRA

    But in the live environment I an getting this error:

    The EXECUTE permission was denied on the object ‘sp_OACreate’, database ‘mssqlsystemresource’, schema ‘sys’.

    I am a DBO type user, is this the issue? Is there an instance level setting I need to ask to be changed? SQLServer 2008R2

    Thanks

  • rocketpanda

    Zip code
    Somewhat unrelated, but I thought it was funny that your example zip code (02115-4653) is about a mile from my apartment.

  • LenBinns

    Bug in identifying submatches in regExFind
    BTW, THANK YOU, this was a great article!

    Couple minor bugs in regExFind:

    1) Typo: you have "SELLECT" instead of "SELECT"
    2) Submatches don’t work because you’re missing a line of code.

    Need to add the following line of code:
    exec @hr = sp_OAGetProperty @objmatch, @command, @objSubmatches OUTPUT

    Current Code:
    — The SubMatches property of the Match object is a collection of strings.
    — It will only hold values if your regular expression has capturing groups.
    — The collection will hold one string for each capturing group.
    — The Count property (returned as SubmatchCount) indicates the number of string in the collection.
    — The Item property takes an index parameter, and returns the text matched by the capturing group.

    if @hr = 0 select @strErrorMessage = ‘Getting the SubMatches collection’
    if @hr = 0 select @command = ‘item(‘ + cast(@ii as varchar) + ‘).SubMatches’
    if @hr = 0 select @strErrorMessage = ‘Getting the number of submatches’
    if @hr = 0 exec @hr = sp_OAGetProperty @objSubmatches, ‘count’, @submatchCount OUTPUT

    Revised Code:
    — The SubMatches property of the Match object is a collection of strings.
    — It will only hold values if your regular expression has capturing groups.
    — The collection will hold one string for each capturing group.
    — The Count property (returned as SubmatchCount) indicates the number of string in the collection.
    — The Item property takes an index parameter, and returns the text matched by the capturing group.

    if @hr = 0 select @strErrorMessage = ‘Getting the SubMatches collection’
    if @hr = 0 select @command = ‘item(‘ + cast(@ii as varchar) + ‘).SubMatches’
    if @hr = 0 select @strErrorMessage = ‘Getting the number of submatches’
    if @hr = 0 exec @hr = sp_OAGetProperty @objmatch, @command, @objSubmatches OUTPUT — << THIS IS THE NEW LINE
    if @hr = 0 exec @hr = sp_OAGetProperty @objSubmatches, ‘count’, @submatchCount OUTPUT

  • della

    function calling
    Please show an example for the code to call the function dbo.RegexFind from a stored procedure with parameters ..