CLR Assembly RegEx Functions for SQL Server by Example

Phil Factor presents a simple CLR Assembly to allow SQL Server users to access all the powerful RegEx library methods in .NET. In the spirit of 'get something up and running', he takes a hands-on approach and demonstrates that you needn't be a C# hotshot or Visual Studio expert to develop a CLR assembly for SQL Server

Writing CLR-based functions and procedures for SQL Server isn’t necessarily a complex task. It doesn’t even need Visual Studio. It is possible to get things up and running very quickly.  For this example, I’ve chosen to revisit the hoary old example of the RegEx package. There is nothing heroic about doing so since it is a tiny smear of code that merely accesses the .NET Framework regular expression engine.  I decided on the CLR RegEx package because I’ve never found a complete example: The sample code that is readily available generally leaves something important out.

I decided to use VB.NET, so as to provide something fresh, and thought it might be interesting to do it without using Visual Studio at all, just to prove it could be done with just SQLExpress.  For the SQL programmer, there is no need to even look at the source code as it can be installed merely from the CREATE ASSEMBLY script which I’ve included. The assembly-creation script can be executed in the same way as any SQL.

The CLR assembly

The Visual Basic Source is here. It was too long to place in the article itself. You don’t need it if you just to try out the package, only if you want to modify or extend it.

This assembly gives you a reasonably full palette of RegEx functions within SQL Server such as…

RegExReplace 
replaces the substring matching the expression with a result string that can include the matches. This works the same way as the SQL Server REPLACE function, except it takes a full RegEx instead of the LIKE expression, and allows substitution of the matches in the replacement string using $0, $1…$n.
RegExReplaceX 
like RegExReplace, except that the parameters are in the more normal order of  pattern, input, replacement, options. The options parameter allows the RegEx to be used with a number of different options
RegExSplit
split a string into a table basing each row-delimiter on the substring that matches the expression, returns the table.
RegExMatch
returns the first substring that matches the expression
RegExIsMatch
returns non-zero if there is a match of the RegEx within the string
RegExMatches  
returns a table of all the substring that match the expression within the string
RegExEscape
returns the input string with all characters that could be interpreted as part of a RegEx expression ‘escaped’. This allows strings to be inserted into RegExes
RegExIndex
returns the index into the string of any substring that matches the RegEx. It works the same way as PatIndex
RegExOptionEnumeration
returns a bitmap of the options you select that controls the way that the RegEx is used.

Although this looks like a comprehensive list, it isn’t complete. The NET RegEx package allows you to specify the maximum number of matches or splits to be returned. I didn’t include it because it seemed less useful, and it would be a simple feature to add. I also avoided returning capturing groups, to RegExMatch and RegExMatches. This would have meant returning a de-normalized table. There are uses for it but the source of such a CLR routine it is already in the SQL Server  sample code if you need it.

SQL Server Assemblies can be written in any language that can produce a .NET assembly, though C# and VB .net are most often used.  They are loaded into SQL Server but are not directly visible to the database developer. Once they are in SQL Server, they can be easily distributed, saved, and loaded into other servers using TSQL.  Once you have them installed in SQL Server, you can get to the source very easily using .NET Reflector with an add-in, so it becomes very easy to modify them. You have no need for Visual Studio either. You can merely use the tools that are already provided within the .NET Framework to create and alter the code for these assemblies.

Creating and using a simple CLR in SQL Server

We’ll create a simple RegEx CLR function without Visual Studio and SSMS just to prove that it isn’t at all complex. Let’s take a very simple example first. We then have something we can expand on once we have it working. Here is a simple, but  useful, implementation of the  RegEx.IsMatch function from the .NET Regular Expression (RegEx) class. First, a bit of Visual Basic.

This just used the IsMatch method of the RegEx . It indicates whether the regular expression finds a match in the input string, using the regular expression specified in the pattern parameter and the matching options supplied in the options parameter. The real script is much longer and can be downloaded at the bottom of the article.

We simply create an assembly from the command line (just put it into your programmer’s  editor as a ‘tool’-  I did all the development work for this project in EditPad Pro.)

Where “%FILE%” is the full path and name of the file, and %PATH_TO_FRAMEWORK%\ is the path to the framework you are using.

Then a simple bit of SQL will have you up and running (this can be run by SQLCMD) Remember to substitute the file and path of the assembly for %FILE%

Running the full project

So now we are ready for the full project. You can either compile the DLL from the source code, or you can execute the CREATE ASSEMBLY from the build script if you are just interested in taking the code for a trip around the block.

To compile the full project, download the VB source and compile it using the following command in the command-line

Where “%FILE%” is the full path and name of the file, and %PATH_TO_FRAMEWORK%\ is the path to the framework you are using.

Then, from SSMS or SQLCMD run the …
        InstallRegExSample.sql file,
…which can be downloaded from the speech bubble

If you want to avoid compiling the DLL from the VB source, or you want to run the CLR assembly on a remote computer, run the build script,( which can be downloaded from the speech bubble)

followed by, from SSMS or SQLCMD, the …
        InstallRegExAssembly.sql file,
…which can be downloaded from the speech bubble

You can view the two files, the  InstallRegExSample.sql  and  InstallRegExAssembly.sql files. Amongst other things, they will tell you what parameters are passed, and in what order, for every SQLCLR function.

A Trip Around the Block.

The RegEx Options

RegEx functions can work in different ways by setting the option bitmap which is passed to the various functions. The most important of these is the flag that tells the RegEx to do case-insensitive searches. This project has a special function that translates these options into a bitmap. An example of its use was given in the ‘Trip around the block’, above

And the function that implements it  it is..

These flags, in order, are:

IgnoreCase
By default, the regular expression is case sensitive. In the examples we used, we set it to ignore case

The Multiline property
The caret and dollar only match at the very start and very end of the subject string by default. If your subject string consists of multiple lines separated by line breaks, you can make the caret and dollar match at the start and the end of those lines by setting the Multiline property to True. (there is no option to make the dot match line break characters).

Explicit Capture
Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>). This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:).

Compiled
Specifies that the regular expression is compiled to an assembly. This yields faster execution but increases startup time.

Singleline
Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except\n).

IgnorePatternWhitespace
Eliminates unescaped white space from the pattern and enables comments marked with #.
RightToLeft
Specifies that the search will be from right to left instead of from left to right
ECMAScript
Enables ECMAScript-compliant behaviour for the expression. This flag can be used only in conjunction with the IgnoreCase, Multiline, and Compiled flags. The use of this flag with any other flags results in an exception.
CultureInvariant
Specifies that cultural differences in language is ignored

 

So what next?

I’m hoping that this project can be appreciated on several levels. Firstly, it gives what I hope is a working implementation of a SQLCLR RegEx library of functions that is immediately useful. Secondly, I want to show how easy it is to develop  CLR routines in your favourite .NET language. Thirdly, it would be nice to think that it could be developed and extended to  support more complex RegEx operations, such as those involving matching groups .

If you are going to make extensive use of the package, then you’ll find everything you need to know in the

I’d be fascinated to hear of any improvements or bugfixes you make.

Downloads

Tags: , , , , ,

  • 99113 views

  • Rate
    [Total: 4    Average: 4.5/5]
  • Anonymous

    Tres Chic
    Very nice, and a useful CLR example as well! 🙂

  • CJ Randolph

    An idea whose time has come
    I have created a library of RegEx functions in C# that I use in SQL server. I find more and more uses for it everyday.

    I see mostly SELECT statements in your article but one of the coolest uses for regex through the CLR is in CHECK CONSTRAINTS. Have a field that is meant to hold company requisition form names that follow a naming convention like ABC-2009-X[Y,Z]? Write it up as a regex and then set the constraint:

    CREATE TABLE [RequisitionForms]
    (
    [FormName] varchar(10) CHECK (dbo.RegExMatch(
    [FormName], ‘^[A-Z]{3}-d{4}-[X-Z]{1,3}$’ ) = 1)
    )

    You can also do some very cool things in UNIONS, INTERSECT, EXCEPT and recursive CTE’s.

    Another adjacent tip: Since you are working in a database you can create a table to hold your commonly used expressions then have a function to call them based on a constant name.

    SELECT [ColumName]
    FROM [TableName]
    WHERE dbo.RegExMatch([ColumnName], dbo.RegExGetPattern(‘FREQUENTLY_USED_PATTERN_NAME’)) = 1

    That way as your expressions mature or you find a better one out on the web somehwere you can just update your regex table with the new expression. For instance, how many different regex patterns are there for email addresses? What if your pattern changes?

    Also note that in .NET if your expression causes catastrophic backtracking, the server will lock up and there is nothing to do except reset the entire server. So be careful letting inexperienced devs send regexes to your database with abandon.

    But honestly, this is an idea whose time has come. If you do a search on Regular Expressions in SQL Server you will see high demand for the feature, and some crazy implementations, going back ten years or more. And there are dozens or more implementations of regex and string manipulation through the CLR out there now. A standard library/API from Microsoft, like their xQuery implementation, would be a welcome sight to many developers and even more so to DBA’s nervous about developers putting assemblies in their databases.

  • Phil Factor

    Re: An idea whose time has come
    I’d be happy for this to move to Codeplex if there was enough interest in it. A couple of people have suggested this today. With NET Reflector, it is the work of a moment to create a C# version or anything else you fancy, and I somehow felt it would be cool to have the source in two different languages!
    Many thanks for all the suggestions. I agree entirely. The main thing I’d like to add is that for an old duffer like me, RegexBuddy and Editpad Pro are a godsend. RegexBuddy will even lay out your regEx expressions fully commented (there wasn’t enough space in the article to show this!)
    I’ve already published a OLE version of the Regex package, using the COM regex component. It would be nice to have the two versions compatible in syntax so one could use either depending on company policy.

  • tengtium

    regular expression problem
    good day sirs and madam,

    i badly needed help.. i have this regular expression

    ^(?!.*–)[A-Za-zd-]+$

    it accepts alphanumeric character optionally a dash (single dash only, not consecutive dash) for exampl:

    it accepts:
    12a-3c-4f3fg
    12ertgg2
    1-2-3-3-4-3
    dffgsfg
    d-f-f-g-s-f-g

    it does not accept:
    12-3c-4f&3fg
    12e%rt-gg2
    d–f-f-g-s-f-g
    1-2-3-3-4–3

    now i have problem, i cant find any in the .net that accepts the above valid expression with space-a non consecutive space. meaning the it needs to accept alphanumeric with optionally a non-consecutive dash and optionally a non-consecutive space.

    can someone help please.

  • Bob Boffin

    Very Minor Fixes
    This is an excellent solution to providing Regular Expression functions within SQL and has the merit of using Microsoft’s own .NET RegEx functionality.

    I have a couple of tiny comments to make to just tidy up the code slightly.

    Functions RegExReplace and RegExReplaceX are inconsistent as they have the pattern and input parameters in a different order from each other. My preference is for the pattern to appear first so I have changed them around in the VB source for RegExReplace but you may prefer to change RegExReplaceX.

    In the same functions the name of the replacement parameter is @Repacement not @Replacement as one would expect. This can be fixed by changing the names of the parameters in the installRegexSample.sql file which also needs the change in parameter order to be reflected in the RegExReplace or RegExReplaceX definition as required.

    Sorry for this rather picky comment but it’s not intended to be negative but rather in the spirit of making the code more consistent and even better to use.

  • Larry Smith

    Update for SQL 2008R2
    Thanks so much for your excellent library. I’ve put it to great use. I’d like to return the favor in a small way… deploying to a SQL 2008R2 server causes complaints for the two table valued functions (RegExMatches and RegExSplit). It wants the TableDefinition property of the SqlFunction attribute to be filled out. So with these revised headers, they work:

    <SqlFunction(DataAccess:=DataAccessKind.None, _
    IsDeterministic:=True, _
    IsPrecise:=True, _
    Name:=”RegExMatches”, _
    SystemDataAccess:=SystemDataAccessKind.None, _
    FillRowMethodName:=”NextMatchedRow”, _
    TableDefinition:=”match nvarchar(4000), matchIndex int, matchLength int”)> _
    Public Shared Function RegExMatches(ByVal pattern As SqlString, _
    ByVal input As SqlString, _
    ByVal Options As SqlInt32) _
    As IEnumerable

    ‘————

    <SqlFunction(DataAccess:=DataAccessKind.None, IsDeterministic:=True, _
    IsPrecise:=True, Name:=”RegExSplit”, _
    SystemDataAccess:=SystemDataAccessKind.None, _
    FillRowMethodName:=”NextSplitRow”, _
    TableDefinition:=”match nvarchar(4000)”)> _
    Public Shared Function RegExSplit( _
    ByVal pattern As SqlString, _
    ByVal input As SqlString, _
    ByVal Options As SqlInt32) _
    As IEnumerable

  • anto.justin

    Could not find Type ‘SimpleTalk.Phil.Factor.RegularExpressionFunctions’ error
    Dear Phil,

    Thank you for the wonderful information. I did try to download the code and execute the examples as you said.

    The assembly is getting registered. However while trying to create the SQL functions, I am getting the following error:

    Could not find Type ‘SimpleTalk.Phil.Factor.RegularExpressionFunctions’

    Is there anything I am doing wrong. It would be great if you could let me know…

    Regards

    Anto

  • anto.justin

    sorted out the problem
    Hi,

    Found the solution… was missing the parent root namespace in the SQL functions…

    Thanks

    Anto

  • niikola

    Bug or feature?
    RegExIndex returns index of first match where position of 1st character in string is 0, 2nd is 1, etc. (patindex returns 1 for 1st char, 2 for 2nd). Both functions return 0 for "not found".

    So RegExIndex will return 0 in 2 cases: "not found" and "found on 1st position".

    Print dbo.RegExIndex(‘a’, ‘abc’, 1) –> 0
    Print patindex(‘%a%’, ‘abc’) –> 1

    Print dbo.RegExIndex(‘b’, ‘abc’, 1) –> 1
    Print patindex(‘%b%’, ‘abc’) –> 2

    Print dbo.RegExIndex(‘x’, ‘abc’, 1) –> 0
    Print patindex(‘%x%’, ‘abc’) –> 0

  • liamcaffrey

    sql-server RegexIsMatch fails on case insensitive unicode match
    I am sanity checking addresses for higher unicode characters that I do not expect to find. I am using the following

    with
    regex_test as
    (select ‘919 LAKESHORE DRIVE’ as sample_address)
    select * from regex_test
    where master.dbo.RegExIsMatch(‘u0130’, isnull(sample_address, ”), master.dbo.RegExOptionEnumeration(1,0,0,0,1,0,0,0,0)) = 1;

    This is clearly wrong as u0130 is a capital I with a dot on top (i.e. ‘İ’)

    I run the same query again but now switching off the IgnoreCase option (i.e. first parameter in RegexOptionEnumeration)

    with
    regex_test as
    (select ‘919 LAKESHORE DRIVE’ as sample_address)
    select * from regex_test
    where master.dbo.RegExIsMatch(‘u0130’, isnull(sample_address, ”), master.dbo.RegExOptionEnumeration(0,0,0,0,1,0,0,0,0)) = 1;

    Now I get the expected result!

    The result makes no difference whether I specify a unicode string with the N string prefix and/or specify various collations with or without case sensitivity.

    I am running Microsoft SQL Server 2012 – 11.0.2100.60 (X64) Feb 10 2012 19:39:15 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1) (Hypervisor)

    Server and database collation is Latin1_General_CI_AS

    Using this to compile….
    C:WindowsMicrosoft.NETFramework64v4.0.30319vbc.exe to compile

    This is strange behaviour. Anybody have any ideas?

    Liam

  • superJohn

    License Details
    I want to know about the license of your code, can tell me the details or give me your email so we can talk on this

  • superJohn

    lincese details
    I want to know about the license of your code, can tell me the details or give me your email so we can talk on this

  • jkibble

    RegexIsMatch working properly with Unicode
    I have been testing this code for use for the past 24 hours and I was bothered by the comment liamcaffrey posted.

    After further testing, I now understand this issue and I wanted to put anyone concerned about that issue at ease.

    with
    regex_test as
    (select ‘919 LAKESHORE DRIVE’ as sample_address)
    select * from regex_test
    where master.dbo.RegExIsMatch(‘u0130’, isnull(sample_address, ”), master.dbo.RegExOptionEnumeration(1,0,0,0,1,0,0,0,0)) = 1;

    The preceding query returns a match. Liam thought that this was incorrect, but I believe that result to be correct.

    The reason that the match is returned when case insensitivity is turned on, but not when it is turned off, is because the lower case version of ‘u0130’ is the standard lower case ‘i’. This means that the ‘I’ in ‘DRIVE’ does indeed match to the ‘u0130’ when both are converted to lower case. This also explains why they do not match when we do a case sensitive match.

    Hopefully, this clears up that issue and allows more people to use this code without worry.

    Thanks.

  • VinMack

    Is there a C# Version of the Source Code or somewhere I can download as I am a C# person and have never really delved into VB.

    • Alan R.

      Internet is amazing! http://converter.telerik.com/
      Not perfect, but very good start point for proper conversion.

      • vinmack

        Thanks for the help.
        I will try this out.