Relational databases aren’t really designed to deal easily with arbitrary sequence, though this is improving with the window functions. Strings and text are sequences. Lists are often sequenced. If you hear people describe an entity such as an invoice in terms of its ordinal sequence ‘the first invoice’ or ‘the fourth invoice’, then you know that it is.
Let’s take a typical ‘sequence’ problem that comes up in science, and occasionally in commerce. We’ll use strings for our example, but they could be words in text, tree-ring measurements, genes, or even crime statistics.
We have two strings, and we want to determine the longest sequence of characters contained the first string that are present in the same order in a second string. Unlike the ‘Longest Common Substring’ problem, we are not specifying that they must be adjacent in either string.
This looks easy, one would think. Not so fast; it is a minefield. If you take away the requirement that the characters need to be adjacent, then the number of alternative solutions become much greater.
Take the classic sample strings ‘thisisatest’ and ‘testing123testing’. If you just choose the first available matches in common you get the wrong solution. Tisi, you might get ‘test’ but there is a longer common substring ‘tsitest’. You can just do the brute-force approach of finding all common subsequences and choosing the best, but believe me it doesn’t scale! The best approach I know of is the dynamic programming solution. (see also the Hirschberg algorithm and the ‘greedy algorithm’) This has been described in great length elsewhere, even on Youtube. There are a number of solutions for different languages, and even in SQL, but I thought I’d do my own, based on a number table.
We’ll leave out the number table build, since I’ve already posted that in a previous blog
IF OBJECT_ID (N'LongestCommonSubsequence') IS NOT NULL
DROP FUNCTION LongestCommonSubsequence
Create FUNCTION LongestCommonSubsequence
The longest common subsequence (LCS) problem is the problem of finding the
longest subsequence common to all sequences in two sequences. It differs
from problems of finding common substrings: unlike substrings, subsequences
are not required to occupy consecutive positions within the original
sequences. For example, the sequences "1234" and "1224533324" have an LCS
Author: Phil Factor
date: 05 Dec 2014
Select dbo.LongestCommonSubsequence ('1234', '1224533324')
Select dbo.LongestCommonSubsequence ('thisisatest', 'testing123testing')
Select dbo.LongestCommonSubsequence ( 'XMJYAUZ', 'MZJAWXU')
Select dbo.LongestCommonSubsequence ( 'beginning-middle-ending',
the longest common subsequence as a string
Declare @Array Varchar(MAX)
Declare @ArrayMax int
Declare @west char(1)
Declare @Lines Varchar(max)
Declare @ii int, @jj int, @iiMax int, @jjMax int, @index int
Select @iiMax=len(@FirstString), --the length of the first string
@jjMax=len(@SecondString), --the length of the second string
@index=@jjMax+1 --where the first new item (matrix has an extra row & column)
Select @ArrayMax=(@iiMax)*(@jjMax)--total length of the array iterated
Select @Array=replicate(char(0), @jjMax+1) --add in the first nought-filled row
@Index=@Index+case when (number-1) % @jjMax = 0 then 2 else 1 end, --current index
@west=case when (number-1) % @jjMax = 0 then Char(0) else substring(@Array,@index-1,1) end,
--as we add an extra column in, this is a dangerous operation otherwise
@Array=@Array --we add a character to the string array, Char(0)--char
+ case when (number-1) % @jjMax = 0 then Char(0) else '' end -- begin new row always zero
+ case --check first to see if there was a match. If so, take north west number incremented
=substring(@SecondString, ((number-1)% @jjMax)+1 ,1)
--remember to set the appropriate collation for this comparison!
then Char(Ascii(substring(@Array,@index-@jjmax-2,1))+1)--increment the NW value
when Ascii(substring(@Array,@index-@jjmax-1,1))>ascii(@west) --compare west to north
then substring(@Array,@index-@jjmax-1,1) --and take the larger
else @west-- number to west if larger
from numbers where number<=@ArrayMax
--Now all we need to do is to backtrack through the matrix to find the best solution
Declare @commonString Varchar(max), @X_Y int
Select @CommonString ='' --this contains the string
while (@ii>1 and @jj>1)
Select @X_Y = ((@ii-1)*(@jjMax+1))+@jj --the current matrix location
if (substring(@firststring,@ii-1,1) = substring(@Secondstring,@jj-1,1))
BEGIN --if there is a match, add the character (we'll reverse it later)
select @jj=@jj-1, @ii=@ii-1 --go north-west
else if ascii(Substring(@Array,@X_Y,1)) = ascii(Substring(@Array, @X_Y-@jjMax-1, 1))
ELSE if ascii(Substring(@Array,@X_Y,1)) = ascii(Substring(@Array, @X_Y-1, 1))
if @@error>0 break
Declare @timing datetime Select @Timing=GetDate()
if dbo.LongestCommonSubsequence ('1234', '1224533324')<>'1234'
raiserror('test 1 failed',16,1)
if dbo.LongestCommonSubsequence ('thisisatest', 'testing123testing')<>'tsitest'
raiserror('test 2 failed',16,1)
if dbo.LongestCommonSubsequence ('XMJYAUZ', 'MZJAWXU')<>'MJAU'
raiserror('test 3 failed',16,1)
if dbo.LongestCommonSubsequence ('yab', 'xabyrbyab')<>'yab'
raiserror('test 4 failed',16,1)
if dbo.LongestCommonSubsequence ('beginning-middle-ending','beginning-diddle-dum-ending')
<>'beginning-iddle-ending' raiserror('test 5 failed',16,1)
select datediff(millisecond,@timing,GetDate()) as milliseconds
There are some very ingenious solutions in Oracle and some good purely iterative SQL Solutions. I’m certain that there is a good solution using window functions but I thought I ought to leave something for the readers of this blog!