11 July 2014

Calculating and Verifying Check Digits in T-SQL

A lot of numbers that we use everyday such as Bank Card numbers, Identification numbers, and ISBN codes, have check digits. As part of the routine data cleansing of such codes on data entry we must check that the code is valid- but do we? Dwain Camps shows how it can be done in SQL in such a way that it could even be used in a constraint, to keep bad data out of the database.

 Check digits have become ubiquitous in the digital age. Oftentimes we may not realize that the numbers that occur in many real-life situations contain a check digit. Things like the product code on our box of cereal, many national identity numbers, international bank account numbers and even the vehicle identification number on our personal cars all have a check digit embedded somewhere within that long string of digits and characters.

The idea of including a check digit in a number was invented to avoid common human transcription errors that easily occur when keying from a document into a data processing application. Think about it. If someone is keying in your bank account number so they can send you some money, wouldn’t you like to have some level of confidence that they got that number correct?

While it is relatively easy to come up with an algorithm to calculate a check digit on a string of numbers and letters, it is quite another thing to prove that it is a good one. The mathematics behind such a proof is spectacularly complex. Fortunately for us, since I am no mathematician, we will focus on the more mundane aspect of calculating check digits using some known, good algorithms that have been invented by those much smarter than me. We will also use our calculation techniques to show how they can be applied to validate that the check digit embedded in a string is correct, thus validating that the underlying coded string is at least potentially legitimate. Based on the specific examples we’ll provide, you should ultimately be able to apply the techniques we will use in the event that you need to calculate a check digit from a different algorithm using T-SQL.

All of the check digit calculation routines we’ll present use set-based T-SQL code, so should perform better than an equivalent looping solution (although we will not attempt to prove this).

Human Transcription Errors

First we’ll mention a few common, human-keying errors that can occur, for which check digits were invented to avoid.

  •   Single digit errors, such as simply entering a 2 when you meant to enter a 1.
  • Transposition errors (particularly challenging for those who are dyslexic) are when you enter two characters that are reversed in sequence, such as 54 when you meant to enter 45.
  • Twin errors, which are like a single digit error but occur in pairs, for example typing 33 when you meant to type 44.
  • Jump transposition errors, which involve incorrect arrangement of a multi-digit sequence, such as 729 instead of 972.
  • Jump twin errors, such as 282 instead of 181, which can easily occur if both hands are engaged in typing in the number using the number keys above the letters on the keyboard, when one of the hands is offset by one key position.
  •  Phonetic errors, which can occur when someone tells you a number over the phone, like hearing 17 when the caller said 70.

Numerical methods exist that can generate all of the possible transcription errors of the types described for a given length string, and these can then be assembled into test strings to be applied against the check digit algorithm. From this, you can arrive at a probability distribution that describes how efficient the algorithm is in avoiding these transcription errors. Such proofs are beyond the scope of this article.

Splitting a String

Most check digit calculation algorithms require applying some transformation to each digit in a string. In T-SQL, it is relatively awkward to do this directly to each character in a string, particularly when the string is not of fixed length. The best-practice approach is to split that string so each character appears as a separate row. To do this we’ll fall back on one of our favorite T-SQL constructs: the Tally table. In our examples, we’ll consistently assume that our string will be a VARCHAR(8000) data type and we’ll use one of our favorite tally tables that generates exactly 8,000 rows we can use to split the string. This tally table construct will work in SQL 2008 or later, but if you need to use a Tally table in an earlier version of T-SQL, you can consult the linked article for numerous examples.

Using LEN(@CheckString) in the TOP clause of the SELECT ensures we only return one row for each character in our string, i.e., returning no rows with empty strings at the end.

Another T-SQL syntax that we’ll draw upon heavily is CROSS APPLY. If you are not familiar with using CROSS APPLY, I’d highly recommend two articles by SQL MVP Paul White entitled Understanding and Using APPLY Part 1 and Part 2. Several of our examples will make use of Cascading CROSS APPLYs, for which Chris Morris has provided some excellent examples in his linked article.

We’ll now learn how to combine these techniques in several common check digit algorithms.

Calculating the Check Digit for a Universal Product Code (UPC)

Universal Product Codes or UPCs as they are commonly known, appear on just about every product that is bulk-manufactured/packaged and sitting in a display case in a retail store. Unbeknownst to many, the UPC includes a check digit at the end of the string. As it turns out, most check digits occur at the end of the string, but that is not universally  true, and we’ll provide one example that doesn’t.

For this and other check digit algorithms, you should make sure you’re familiar with the T-SQL operator (%) for modulo (remainder upon division).

We’ll begin by describing the algorithm and then giving an example.

  1.   Number each digit in the UPC from left to right, up to but not including the check digit.
  2. Multiply each odd-numbered digit by three and add them together.
  3.  Add the total of #2 to the sum of the even-numbered digits.
  4. Divide the total of #3 by 10 and subtract the remainder to get the check digit. If the remainder was 0, subtracting from 10 gives you 10, so 0 is the check digit.

Now let’s see how to compute a UPC check digit in practice.

UPC:

036000241457

 

Check Digit

Split the digits:

0

3

6

0

0

0

2

4

1

4

5

7

Number the digits:

1

2

3

4

5

6

7

8

9

10

11

12

Multiplier:

3

1

3

1

3

1

3

1

3

1

3

Total

Total:

0

3

18

0

0

0

6

4

3

4

15

53

 

Remainder of division by 10:

3

Check digit (10 – Remainder):

7

Our T-SQL algorithm will follow very closely the steps shown above (top to bottom in bold).

  1.  Remove the check digit and split the digits (numbering them is done in the same step from the tally table).
  2.  Apply the multiplier to get the total and then sum the totals. Note that we are using the distributive property of multiplication here instead of the method described in the original algorithm.
  3. Calculate the remainder (modulo) of the division by 10 and subtract that from 10.
  4.  If the result of #3 is 10 because the remainder was zero, the check digit is 0.

In our graphical representation of the algorithm and the description that follows it, we have attempted to convert the original algorithm into one that is more suitable for implementation in a set-based fashion. Note how we reuse the exact code above to split the string and develop the code in two steps.

The results columns are quite similar to each row in the table above. You can copy/paste them into Excel and then use copy/paste special-transpose if you are unclear.

In our second step, we’ll sum the totals column, subtract the remainder of the division by 10 from 10 and remove the extraneous, intermediate calculations.

And finally we can package this into an in-line Table Valued Function (iTVF) WITH SCHEMABINDING, to hide the logic and make it a performance-efficient tool we can add to our tool chest.

Our iTVF returns both the string to check (StringToCheck) and the check digit (CheckDigit). You can see from the example code how it can be called. Because it returns a table, it can also be referenced in a CROSS APPLY in case you need to perform this same validation across many rows. We’ll provide an example of this in the next section.

Validating the Check Digit in a Universal Product Code (UPC)

Now that we have a way to calculate the check digit for our UPC, we can construct a new function we’ll use to validate whether the check digit embedded within our UPC is correct or not. Let’s state some requirements for this new validation function:

  1. The arguments to the function will be the UPC code (including the check digit) and the position in the string where the check digit occurs.
  2. If the check digit position argument is invalid, i.e., it is not a number between one and the length of the string; assume that the check digit occurs at the end of the string.
  3. Return an error code that is 0 (no error or in other words the check digit is correct) or 1 if the check digit is not correct.

In the case of the check digit of a UPC, since it is always at the end, the second argument to the function stated in the requirements isn’t really needed. However it might be in a case of a more generic validation function, so we’ve included it here.

Here is that schema-bound iTVF.

You can see how, in the ValidateCheckDigitUPC function and in the sample call of it, you can CROSS APPLY on any iTVF as if it returns a table (because it does).

The same encapsulation technique can be used to validate a check digit based on the calculation of the check digit, as in our next examples. In the validation function, you may also wish to include other criteria, for example if the length of the string being checked must be of a certain length.

There is one other important point about the two iTVFs we’ve created because they are schema-bound. If for some reason you ever need to ALTER FUNCTION CalculateCheckDigitUPC, you must first DROP FUNCTION ValidateCheckDigitUPC, otherwise you’ll get this message:

Calculating a Check Digit using the Luhn Algorithm

The Luhn algorithm is an interesting and quite effective check digit algorithm. It is interesting because it is used in many different types of codes, for example credit card numbers and International Mobile Station Equipment Identity (IMEI) numbers for all mobile devices like smart phones and tablets.

It is quite effective because it is able to detect any single digit and almost all two-digit transposition errors. It is also elegant because left padding the string with zeroes results in the exact same check digit.

The algorithm proceeds as follows:

  1. Reverse the string (excluding the check digit) and number the position of each digit from left to right.
  2. For odd positions, multiply the digit by 2. If the result is greater than 9, sum the two digits.
  3. For even positions, multiply the digit by 1.
  4. Sum the results of #2 and #3 and then multiply by 9.
  5. Divide the result of #4 by 10 and the remainder (or in other words the last digit) is your check digit.

In a tabular format, the algorithm can be expressed as follows.

IMEI

860442020567338

 

Reverse IMEI w/o check digit

33765020244068

 

Check Digit

Split the digits:

3

3

7

6

5

0

2

0

2

4

4

0

6

8

8

Number the digits:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

 

Multiplier:

2

1

2

1

2

1

2

1

2

1

2

1

2

1

Digit * Multiplier:

6

3

14

6

10

0

4

0

4

4

8

0

12

8

Total

Sum the Digits:

6

3

5

6

1

0

4

0

4

4

8

0

3

8

52

 

Multiply by 9:

468

Remainder after divide by 10:

8

Once again, we’ll use our string splitting tally table, this time on the REVERSE of the string, and add just a bit of code to produce columns that once again look very much like the rows in our table.

An interesting point to note in the code above is the implementation of “sum the digits.” For a one or two digit number (n), i.e., any non-negative integer less than 100, this formula works quite nicely:

Sum the digits (in n) = n/10 (integer division) + n%10 (remainder on division by 10)

Two examples:

  1. Sum the digits (in 56) = 56/10 + 56%10 = 5 + 6 = 11
  2. Sum the digits (in 8) = 8/10 + 8%10 = 0 + 8 = 8

All that remains is to total up the SumOfDigits column, multiply by 9 and then divide by 10 to get the remainder. Once again, in the code that follows we have eliminated the superfluous intermediate results.

We could have done this without the cascading CROSS APPLYs, but somehow it just looks cleaner to me with them. Once again, we can encapsulate and hide the complexity in a schema-bound iTVF that we can also add to our tool chest of check sum calculators.

Writing a validation function for IMEIs is just as easy as for UPCs, with essentially the same requirements and resulting function, with the added validation rule that IMEIs are always 15 characters in length. We’ll leave that as an exercise for the interested reader.

Calculating the Check Digit for an International Banking Account Number (IBAN)

For our final example of calculating a check digit, or in this case the two check digits, we’ll consider the case of an International Banking Account Number (IBAN) because it poses some additional challenges. According to Wikipedia, at least 64 countries have implemented this standard. Another name for this particular check digit calculation is “modulo 97,” which alludes to the reason why there are two check digits.

Here is an example of a fictitious but potentially valid IBAN (from the Wiki page): GB82 west 1234 5698 7654 32. The first two characters are the standard country code (ISO 3166-1 alpha-2) and the second two represent the check digit. “WEST” can be an arbitrary sequence of four letters. The next six and eight numbers are a sort code and the account number, respectively. So the total length of this IBAN (4+4+6+8) is 22, but this may vary by country.

The algorithmic description of calculating the check digits for an IBAN is:

  1.  Validate that the length of the IBAN is correct based on the country. We will ignore this and assume in our check digit calculator that the length is correct. This validation would be performed in a validation function, or would otherwise be incorporated into the business rules for constructing the base IBAN.
  2.  Replace the check digits by 00.
  3. Upper case the characters and move the four initial characters to the end of the string so now it appears as follows (with blank spaces removed): WEST12345698765432GB00.
  4. Convert letters to numbers using the following cipher: A=10, B=11, C=12, etc.
  5.  Convert the string to an integer. Since that would be a bit too large for T-SQL to handle, we’ll use a DECIMAL data type for this.
  6.  Calculate the remainder upon division by 97, subtract that remainder from 98 and left pad the result with a zero if needed so that it consists of exactly two digits.

Let’s look at this algorithm graphically:

IBAN:

GB82 west 1234 5698 7654 32

 

Remove blanks, upper case, rearrange and zero out the check digits:

WEST12345698765432GB00

 

Split the string:

W

E

S

T

1

2

3

4

5

6

9

8

7

6

5

4

3

2

G

B

0

0

Transform Characters to a Number:

32

14

28

29

1

2

3

4

5

6

9

8

7

6

5

4

3

2

16

11

0

0

Reassemble the pieces:

3214282912345698765432161100

 

Remainder after divide by 97:

16

 

98 – Remainder above

82

<– The check digit

For our first attempt at coding this in T-SQL, we’ll try to produce a row set (two columns) that represent the two widest of the rows in the table. There are multiple ways that this can be accomplished, pretty much all of which would use the tally table string splitter, but in different ways.

There are several interesting aspects to this new query’s form:

  •  We have moved the tally table from being in a Common Table Expression (CTE) to being a derived table for the CROSS APPLY, so we could use the length of the reduced IBAN (after spaces were removed) to limit the returned rows.
  •  We included an extra (the middle) column to show exactly how the PATINDEX creates a flag we can use in the CASE to decide whether the character is a letter that needs to be transformed.
  •  We have constructed an unusual ORDER BY (sort) that puts the first four characters at the end, regardless of the length of the string (although we could have used a number less than 8000 since we know that IBANs are never that long).

The first and third columns in our results set are exactly the values showing in the two wide rows of our graphic! We must now concatenate across the rows using the diminutive form of the technique described by Wayne Sheffield in Creating a comma-separated list. I call it the diminutive form because it omits the TYPE coding on FOR XML PATH, which is a little faster and not required because the XML won’t contain any special characters. We then CAST the concatenated string to a large DECIMAL and perform the remaining modulo and subtraction operations required by the algorithm.

Notice how we implemented step #6 of the algorithm. Instead of subtracting from 98, we subtract from 198 to ensure that we get a three digit number. We can then take the RIGHT two digits and be certain it contains leading zeroes. I have proven in my blog that this is just slightly faster than doing the same using string concatenation (see The One Million Row Test Harness).

To clarify the query just a little bit more, we’ve put the previous query, including the FOR XML PATH to perform the concatenation, into a CTE that returns a single row with a string, and removed the extraneous columns and calculations. Our result is exactly the check digits that were in the original IBAN, confirming the correctness of our implementation of the algorithm.

Naturally we’d like to encapsulate this bit of code magic into another schema-bound, iTVF to put into our tool chest.

Once again, we will leave the corresponding validation wrapper function to the interested reader who needs to create one.

Note that since an IBAN may be up to 34 alphanumeric characters in length, the CAST to DECIMAL(38, 0) may result in an arithmetic overflow in some cases of really long IBANs with lots of alphabetic characters. The alternative is to do a piece-wise modulo 97 on successive strings extracted from the IBAN. In the T-SQL function above, simply replace the last line (the CROSS APPLY) with this:

Note that you may need to add one or two more CROSS APPLYs to handle a worst case scenario, but we’ll leave that decision to your testing.

Summary

There are many existing algorithms for calculating check digits, and having check digits in coded values significantly reduces the chance of mis-keying those coded values when humans are doing the work. Perhaps you’ve considered adding a check digit to a coded value of your own definition, but were put off by the complexity of the check digit calculation or simply could not find a suitably high-performance, set-based method to implement the check digit in T-SQL.

We have demonstrated how intrinsically, procedural algorithms for calculating check digits can easily be converted into a set-based approach, by looking to a simple tabular example calculated in Excel. We have done this for three common check digit algorithms: UPC (technically UPC-A), the Luhn formula and finally the IBAN (even ones up to 34 characters in length).

We have not examined CLR-based methods for calculating check digits, but certainly if you have them at your disposal and you’re allowed to use them in you SQL Server they are likely to perform quite well (probably better than the set-based pure SQL solutions covered here).

Check digits are so common you probably carry one around in your pocket (your mobile phone’s IMEI) so we all should understand high-performance techniques for calculating them in T-SQL.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 25473 times – thanks for reading.

Tags: , , , ,

  • Rate
    [Total: 30    Average: 4.6/5]
  • Share

Dwain Camps

View all articles by Dwain Camps

  • crs2007

    realy good
    Thank you. Great post.

  • TheSQLGuru

    Marvelous!
    I always look forward to digging into your posts Dwain! Incredibly well thought out and great performance tips too.

    Best,

    Kevin