28 April 2008

SQL String User Function Workbench: part 2

In which Robyn and Phil continue with their popular series on TSQL String User-functions. In this final episode, they pull together the themes from their TSQL String Array Workbench and String User Function workbench, to provide a simple TSQL string-handling package.

The Return of the Killer Tuples

This workbench finishes of what has been a three-part series of string functions. In it, we introduce the idea of using XML to provide a very simple array for doing string handling. This allows us to use functions for searching and splitting strings that will be familiar to users of procedural languages such as PHP and Python.

The first part, TSQL String Array Workbench showed how the basics worked and then demonstrated how it could be used with a PHP-style string function. What inspired us to write this workshop was when Phil had to endure a PHP programmer sounding off about how much better PHPs string handling as than TSQL. He then made the discovery that it was actually possible to pass a string in an XML parameter, detect the fact and convert it to a single-item list, so as to emulate the facility of PHP and Python to pass either lists or single strings.

Of course, this principle could be extended to arrays and matrices. We don’t handle the representation of lists, arrays and matrices in any standard way, as this would be a distraction at this stage, and multi-dimensional lists of arrays aren’t used much for strings.

We then got rather diverted by the Python string functions (now string methods) and so wrote the SQL String User Function Workbench: part 1 that emulated all the python string functions that didn’t have lists as parameters so we didn’t use the XML array mechanism. Unfortunately there were a whole group that did, so here to round things up are the…

SQL String User Functions (from Python) that can use lists

Contents

  1. Split function
  2. SplitLines
  3. Within (not from python)
  4. EndsWith
  5. StartsWith
  6. Contains (not from python)
  7. Join
  8. Parts (not from python)
  9. Partition
  10. RPartition

Split Function

Return an array of the words in the string, using @delimiter as a delimiter. If @maxsplit is given, at most @maxsplit splits are done. (thus, the list will have at most maxsplit+1 elements). If @maxsplit is not specified, then there is no limit on the number of splits (all possible splits are made). Consecutive delimiters are not grouped together and are deemed to delimit empty strings. The sep argument may consist of several characters.

If @Delimiter is not specified or is None, a different splitting algorithm is applied. First, whitespace characters spaces, tabs, newlines, returns, and formfeeds) are stripped from both ends. Then, words are separated by arbitrary length strings of whitespace characters. Consecutive whitespace delimiters are treated as a single delimiter. Splitting an empty string or a string consisting of just whitespace returns an empty list.

P.S. We took this second ‘splitting algorithm’ to mean that a list of the words was required. Our solution is only tested for English and will need fine tuning for other languages. Phil swore ages ago that Hell would freeze over before he ever published yet another string-splitting algorithm. We may have hit on a solution to global warming here.

Now we simply output the temporary table variable as XML using our standard string-array format.

So now we test it out (The real test rig is longer and more boring).

SplitLines string Function

Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

p.s. This is such a simple modification to ‘Split’ that you wonder why they bothered.

within string Function

Return non-zero if the string contains the specified substring, otherwise return False. Suffix can also be a list of substrings to look for. With the optional start parameter, the test should begin at that position. With the optional end,the test should stop comparing at that position.

P.S. This isn’t a Python method, but it underpins the Startswith, and EndsWith routines. We add the contains function to hide the wildcard.

EndsWith string Function

Return non-zero if the string ends with the suffix, otherwise return False. The suffix can also be a list of suffixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

StartsWith string Function

Return non-zero if the string starts with the prefix, otherwise return False. Prefix can also be a list of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

Contains string Function

Return non-zero if the string contains the substring, otherwise returns 0. substring can also be a list of substrings to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

Join string Function

Joins together the given array as a string with the @separator as separator:

Parts string Function

Split the string at the first occurrence of sep, and return an array containing the part before the separator, the separator itself, and the part after the separator. IF the separator is not found, return an array containing the string itself, followed by two empty strings.

P.S. This is not part of the Python suite. It is used to support Partition and RPartition. Again, Phil required calming down before he knuckled down to write this, since he once swore he would never publish another string splitting routine.

Partition string Function

Split the string at the first occurrence of sep, and return an array containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return an array containing the string itself, followed by two empty strings.

RPartition string Function

Split the string at the last occurrence of sep, and return an array containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return an array containing the string itself, followed by two empty strings.

So, just to summarise:

…and so on, and so forth. The combinations and possibilities are endless. We find that having the functions there will speed development. We always say that, for speed-critical sections of code, we’ll re-code using the built-in functions; but it is surprising how seldom this is actually required.

Please remember that the code is, as ever, available to download below. Just to make things simpler, we’ve added the code for the previous string User-Function workbenches too.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

Downloads

This post has been viewed 32266 times – thanks for reading.

Tags: , , , ,

  • Rate
    [Total: 21    Average: 4.7/5]
  • Share

Robyn Page and Phil Factor

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 20 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

View all articles by Robyn Page and Phil Factor

  • Matthew Mark

    Suggestion
    You might want to belabor what you probably consider the obvious. I’m the SQL Server “guy” here and I work with 5 others who use SQL Server but are mainly C# and C programmers. We use C#.NET in most db apps. That said, I don’t have a lot of time to review articles like this. I believe if you made it a practice of giving a few “real-world” examples of “why” knowing this is useful. I usually stumble on this sort of thing when looking for a specific solution. It appears you’ve certainly done your homework, but maybe the relevance is lost on some of us.

    I am interested but need to “get” the application value.

    Thanks
    mm

  • Jesse

    try CLR?
    I have always been under the impression that string parsing is more efficiently done in a compiled language, not TSQL. For the few cases I’ve tested, that’s been true.

    Seems you might be better off implementing all of these in an assembly as CLR functions. Did you try that?

  • Phil Factor

    re: try CLR
    Hi, Jesse, nice to see you on the site again.

    Yes, where performance is an overriding concern, then there is definitely something to be said for leaving your current development platform (TSQL/SSMS in our case) and using something else. However, performance is generally only a concern in a small part of an application. We’ve used this package in an application and it has caused no performance deficit simply because we haven’t used it in speed-critical sections.

    We think that there is a long leap in logic from saying that because some part of an application is quicker to run in a database, or a compiled procedural language, it should always be done there. Our own particular priority is to produce applications that are neat, reliable, quick to develop, internally consistent, and easy to maintain. Hmm.. I’ve left out something….let me see.. oh yes… and quick!

  • Phil Factor

    Just don’t let those lists get too long
    This is fine for reasonably short lists but don’t let them get too long, say into the hundreds. See http://www.simple-talk.com/community/blogs/philfactor/archive/2012/01/05/105167.aspx