Implementing Real-World Data Input Validation using Regular Expressions

Francis Norton shows how to use regular expressions to fulfil some real world data validation requirements, demonstrating techniques ranging from simple number format checks, to complex string validation that requires use of regex's powerful "lookahead" feature.

This article explains how to use .NET regular expressions to enforce the kind of logically complex input validation requirements that we sometimes confront in real specifications. This will allow us to start with basics and go on to exploit some fairly advanced features.

Because regular expressions are powerful and complex enough to be the subject of entire books, I’m going to stick strictly to their use in validation. I will entirely ignore otherwise interesting and valid topics like performance, comparison with non-.NET implementations, token extraction and replacement, in order to take you somewhere new on this topic while keeping some clarity and focus.

I will test the regexes using the PowerShell command line, which you can download for free at http://www.microsoft.com/technet/scriptcenter/topics/msh/download.mspx. Because Microsoft’s architectural plan is that you can access the same .NET regex library whatever you’re writing, from ASP.NET (dead easy) to SQL Server 2005 (slightly greater difficulty – I include a reference at the end of the article that gives further details on this), the regular expression skills you learn in one context are directly transferable to another.

Some real validation requirements

These all come from real specs, I’ve simply selected some examples and arranged them in order of increasing logical complexity.

  1. Num: Numbers only. Can be negative or positive, for example 1234 or -1234.
  2. Dec: May be fixed length. A numeric amount (positive or negative) including a maximum of 2 decimal places unless stated otherwise, for example12345.5, 12345, 12345.50 or -12345.50 are all valid Dec 7 inputs
  3. UK Bank Sort Code: Six digits, either xx-xx-xx or xxxxxx input format allowed.
  4. House: Alphanumeric. Must not include the strings’PO Box’, ‘P.O. Box’, ‘P.O.Box’, ‘P.O Box’ or ‘POBox’ (any case)

Basics: Implementing NUM using “^”…”$”, “[“…”]”, “?” and “+”

This section will illustrate some core regex concepts and syntax, so if you’re familiar with the use of the above symbols in patterns, feel free to skip forwards.

Let’s take another look at the Num requirement:

  • Numbers only. Can be negative or positive, for example 1234 or -1234.

I take this to mean that we’ll accept anything consisting of an optional minus sign followed by one or more digits.

We can specify the “one or more digits” part by using square brackets and a dash for character ranges, and the plus sign (“+”) for repetition. Let’s start with character ranges, in this case the range of characters from “0” to “9”:

PS C:\Notes> [Regex]::IsMatch(“1”, “[0-9]”)
True
PS C:\Notes> [Regex]::IsMatch(“i”, “[0-9]”)
False
PS C:\Notes>

NOTE:
If you’re new to PowerShell, you can read “[Regex]::IsMatch” as “use the static method ‘IsMatch’ of the .NET library ‘Regex'”. In fact we could use PowerShell’s “-cmatch” operator, which is precisely equivalent to a [Regex]::IsMatch() expression, but I like the clarity of using the .NET class directly.

The square bracket expression is a character class. In effect, it gives us a concise way of doing a character-level OR expression, so “[0-9]” can be understood as “does the input character equal 0, 1, 2…or 9?” The dash (“-“) acts as a range operator in this context so “[0-9]” is exactly equivalent to “[0123456789]”.

At the moment we’re simply testing whether the test string contains a match for the regex, which would be fine for searches, but when we’re doing validation we want to ensure that the test string doesn’t also contain non-matching text. For example:

PS C:\Notes> [Regex]::IsMatch(“1”, “[0-9]”)
True
PS C:\Notes> [Regex]::IsMatch(“ninety 9 point nine”, “[0-9]”)
True

We can stop that behaviour using the special characters “^” and “$” to specify that the regex pattern must match from the start to the end of the test string:

PS C:\Notes> [Regex]::IsMatch(“1”, “^[0-9]$”)
True
PS C:\Notes> [Regex]::IsMatch(“ninety 9 point nine”, “^[0-9]$”)
False

Now we’ll make the regex accept one or more digits by using the “+” modifier on the “[0-9]” character class. The “+” means, in general, “give me one or more matches for whatever I’ve been attached to”, so in this case means “give me one or more digits”.

PS C:\Notes> [Regex]::IsMatch(“123”, “^[0-9]$”)
False
PS C:\Notes> [Regex]::IsMatch(“123”, “^[0-9]+$”)
True

That just leaves the optional minus sign. The good news and the bad news is that outside a character class (like “[0-9]”) the dash is just a literal character (good news because it means we won’t have to escape it; bad news because treating the same character as a literal in some parts of a pattern and a special character in others is a triumph of terseness over readability). We’ll make it optional with the “?” modifier, which can be read as “give me zero or one matches”.

PS C:\Notes> [Regex]::IsMatch(“-123”, “^[0-9]+$”)
False
PS C:\Notes> [Regex]::IsMatch(“-123”, “^-?[0-9]+$”)
True
PS C:\Notes> [Regex]::IsMatch(“123”, “^-?[0-9]+$”)
True

Using “{” … “}”, “(” … “)”, “\”, and “d” to implement Repetition

These “?” and “+” modifiers are very nice and convenient, but suppose we have a counting system that can express more than None, One, and Many?

Let’s take another look at the DECIMAL format requirement:

  • Dec: May be fixed length. A numeric amount (positive or negative) including a maximum of 2 decimal places unless stated otherwise, for example 12345.5, 12345, 12345.50 or -12345.50 are all valid Dec 7 inputs

Ignoring the fixed length option for now, let’s look at the decimal section. It seems that we’re expected to accept numbers with a decimal point and one or two decimals or with no decimal point and decimals at all.

Our first challenge is the decimal point. We want to use the “.” sign, but this gives us some strange behaviour:

PS C:\Notes> [Regex]::IsMatch(“.”, “^.$”)
True
PS C:\Notes> [Regex]::IsMatch(“,”, “^.$”)
True

We’ve discovered that “.” is a special character in regular expressions – in fact it matches any character. We need to escape it with the “\” prefix to make it a literal:

PS C:\Notes> [Regex]::IsMatch(“.”, “^\.$”)
True
PS C:\Notes> [Regex]::IsMatch(“,”, “^\.$”)
False

The next step is to use the braces modifier to specify that we want one to two digits following the decimal point – we can put the minimum and maximum number of matches (in our case 1 and 2, which we’ll test with zero to three) inside the “{” and “}” curly brackets:

PS C:\Notes> [Regex]::IsMatch(“.”, “^\.[0-9]{1,2}$”)
False
PS C:\Notes> [Regex]::IsMatch(“.0”, “^\.[0-9]{1,2}$”)
True
PS C:\Notes> [Regex]::IsMatch(“.01”, “^\.[0-9]{1,2}$”)
True
PS C:\Notes> [Regex]::IsMatch(“.012”, “^\.[0-9]{1,2}$”)
False

Now we can add the entire decimal suffix pattern, ” \.[0-9]{1,2}”, to our existing number pattern, and test it:

PS C:\Notes> [Regex]::IsMatch(“123.45”, “^-?[0-9]+\.[0-9]{1,2}$”)
True
PS C:\Notes> [Regex]::IsMatch(“123”, “^-?[0-9]+\.[0-9]{1,2}$”)
False

Aha, we should still be accepting numbers with no decimal places, but we’re not. We know how to make a single character optional using the “?” modifier, but how can we do this to larger sub-patterns? The pleasantly obvious answer is to use parentheses to wrap the decimal suffix sub-pattern in “(” and “)”, and then apply the “?”.

PS C:\Notes> [Regex]::IsMatch(“123.45”, “^-?[0-9]+(\.[0-9]{1,2})?$”)
True
PS C:\Notes> [Regex]::IsMatch(“123”, “^-?[0-9]+(\.[0-9]{1,2})?$”)
True

And before we leave this pattern, one more trick to make regular expressions more readable and more robust: we can replace “[0-9]” with “\d” (escape + d) which is pre-defined to mean “any digit”. Be aware that this is case-sensitive and “\D” means the opposite!

PS C:\Notes> [Regex]::IsMatch(“123.45”, “^-?\d+(\.\d{1,2})?$”)
True

Using “|” to implement a logical OR

We know how to use character classes, i.e. the “[” … “]” expressions, to accept alternative single characters, but the requirement for UK Bank Sort Codes requires us to accept input strings that fall into one of two different patterns.

Let’s take another look at the requirement:

  • UK Bank Sort Code: Six digits, either xx-xx-xx or xxxxxx input format allowed.

Accepting either one of these on its own is straightforward (remembering that “-” is just a literal character outside character classes):

PS C:\Notes> [Regex]::IsMatch(“123456”, “^\d\d\d\d\d\d$”)
True
PS C:\Notes> [Regex]::IsMatch(“12-34-56”, “^\d\d-\d\d-\d\d$”)
True

We can match one pattern or the other using the “|” (or) operator. We’re going to have to use parentheses too, as we’ll discover when we start testing.

PS C:\Notes> [Regex]::IsMatch(“123456”,
“^\d\d\d\d\d\d|\d\d-\d\d-\d\d$”)
True
PS C:\Notes> [Regex]::IsMatch(“123456 la la la”,
“^\d\d\d\d\d\d|\d\d-\d\d-\d\d$”)
True

What happened when we matched that second value? The “$” sign at the end of the pattern was intended to reject input with text following the sort code itself, but the “|” meant that it was only applied to the right-hand sub-pattern. (Try working out how to get a sort code with leading junk accepted by the pattern above)

We can fix this by using parentheses again:

PS C:\Notes> [Regex]::IsMatch(“123456”,
“^(\d\d\d\d\d\d|\d\d-\d\d-\d\d)$”)
True
PS C:\Notes> [Regex]::IsMatch(“123456 la la la”,
“^(\d\d\d\d\d\d|\d\d-\d\d-\d\d)$”)
False

Using “(?=” … “)” to implement a logical AND

You may have noticed that we have some unfinished business with the Decimal requirement, specifically that sentence “May be fixed length”. It’s clear from the examples that the fixed length refers to the number of digits, not the number of characters (which could include minus signs and decimal points).

We could adapt our existing decimal pattern, with its optional minus sign and decimal point, to restrict input to just seven digits, but this is inadvisable. It would be better to keep our existing pattern, which is relatively simple and well-tested, and apply a second regular expression to count the number of digits, each optionally preceded by a non-digit character.

Remembering that “\d” means “any digit” and “\D” means “any non-digit”, we can do this to restrict the input to, say, no more than seven digits:

PS C:\Notes> [Regex]::IsMatch(“-123456.7”, “^(\D?\d){1,7}$”)
True
PS C:\Notes> [Regex]::IsMatch(“-123456.78”, “^(\D?\d){1,7}$”)
False

This is fine if we’re in a position to validate a single input with multiple regular expressions, but sometimes we’re going need to do it all in one regex. This raises a problem – both of our expressions necessarily start at the beginning of the input string and work their way, character by character, to the end. If we are going to do “logical and” patterns as opposed to simply “and then” patterns, we need a way of applying multiple sub-patterns to the same input.

Fortunately .NET regular expressions support the obscurely named, but very powerful, “lookahead” feature which allows us to do just that. Using this feature we can, from our current position in the input string, test a pattern over the rest of the string (all the way to the end if necessary), then resume testing from where we were.

A lookahead sub-pattern is wrapped in “(?=” … “)” and here’s how we can use it to implement the requirement “up to seven digits AND a valid decimal number” by combining our two existing patterns:

PS C:\Notes> [Regex]::IsMatch(“-123456.7”,
“^(?=(\D*\d){1,7}$)-?\d+(\.\d{1,2})?$”)
True
PS C:\Notes> [Regex]::IsMatch(“-123456.78”,
“^(?=(\D*\d){1,7}$)-?\d+(\.\d{1,2})?$”)
False
PS C:\Notes> [Regex]::IsMatch(“-12345.6.7”,
“^(?=(\D*\d){1,7}$)-?\d+(\.\d{1,2})?$”)
False

And this completes our implementation of the decimal requirement.

Using “(?!” … “)” to implement AND NOT

Our final input validation requirement was for address lines, to exclude any that used a PO Box instead of a real (residential) address.

As usual, let’s revisit the friendly spec:

  1. House: Alphanumeric. Must not include the strings’PO Box’, ‘P.O. Box’, ‘P.O.Box’, ‘P.O Box’ or ‘POBox’ (any case)

Let’s first implement the rule that the string must be alphanumeric. This means that the string can contain alphabetic and numeric characters, spaces, dashes, full stops (period), commas or slashes. We can implement this rule quite easily, remembering that the space character is a literal, not a separator:

PS C:\Notes> [Regex]::IsMatch(“Platform 9 1/2,”, “^[-a-zA-Z\d .,/]*$”)
True

Now let’s write a pattern that will find any obvious variation of “P O Box” anywhere after the start of the input, which is where we test it. Remember from earlier that the space character is a literal, and that the “.” is a special character unless we escape it, “\.”

PS C:\Notes> [Regex]::IsMatch(“No PO Box here”, “^.*P\.? ?O\.? ?Box”)
True

Next, we’ll reverse the result by asking for the pattern not to be found, and combine it with our alphanumeric pattern, both done using “(?!” … “)” notation:

PS C:\Notes> [Regex]::IsMatch(“Platform 9 1/2,”,
“^(?!.*P\.? ?O\.? ?Box)[-a-zA-Z\d .,/]*$”)
True
PS C:\Notes> [Regex]::IsMatch(“Platform 9 1/2, PO Box 64”,
“^(?!.*P\.? ?O\.? ?Box)[-a-zA-Z\d .,/]*$”)
False

Finally, we’ll make the PO Box rule case-insensitive. This can be done by setting a mode at the start of the expression that will apply to everything that follows it. We can specify “case insensitive mode” with the notation “(?i)” – notice that since we’re going to be case-insensitive anyway, I’ve also simplified the alpha bit of the alphanumeric pattern

PS C:\Notes> [Regex]::IsMatch(“Platform 9 1/2,”,
“^(?i)(?!.*P\.? ?O\.? ?Box)[-a-z\d .,/]*$”)
True
PS C:\Notes> [Regex]::IsMatch(“Platform 9 1/2, po box 64”,
“^(?i)(?!.*P\.? ?O\.? ?Box)[-a-z\d .,/]*$”)
False

Conclusion

Like any good tool, regular expressions can be used or abused. The purpose of this article is to help you write regular expressions that are fit for the purpose of validating inputs against typical business validation rules.

In order to do this we’ve covered writing straight-forward patterns using literals, special characters and character classes, and applying them to the whole input using “^” … “$”. We’ve also seen how to combine simple patterns to implement logical OR, AND and NOT rules.

References

Using Regular expressions to validate input in ASP.NET:
http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=46

Using Regular expressions to in SQL Server 2005:
http://msdn.microsoft.com/msdnmag/issues/07/02/SQLRegex/default.aspx

Regular expression options in the .NET library:
http://msdn2.microsoft.com/en-us/library/yd1hzczs(VS.80).aspx

A concise summary of all special characters recognised by .NET regular expressions:
http://regexlib.com/CheatSheet.aspx

Tags: , , , , , ,

  • 46388 views

  • Rate
    [Total: 0    Average: 0/5]
  • Anonymous

    + and *
    Somehow, magically, the plus became and asterisk. I’ve re-read but haven’t found any explanations for it. Are they interchangeable?

  • Anonymous

    re: + and *
    You’re quite right – I explicitly introduced and explained the “+” but quietly slipped in the “*”.

    They’re similar but not interchangeable – while “+” means “match one or more of the previous sub-pattern”, “+” means “match none or more of the previous sub-pattern”. So “bi*g” matches “bg”, “big” and “biiiiiiig”, but not “bog” or “bing”.

    Well spotted – nice (if a little scary) to know that ti’s being read so closely!

  • Anonymous

    Confusing explanation
    Congratulations on this excelent article.
    Just a minor thing:
    Can you read your one explanation to the diferences between * and +?
    It seems you missed to explain the use of *… or the +…

    Regards
    AR

  • Anonymous

    Great article!
    I particularly liked how you introduced each one step-by-step and showed the pitfalls you’d encounter if you got it wrong!

    A wonderful beginners’ article.

    Thanks.
    Paul

  • Anonymous

    Explanation of confusing explanation
    They’re similar but not interchangeable – while “+” means “match one or more of the previous sub-pattern”, “*” means “match none or more of the previous sub-pattern”. 😉

  • Peter de Marffy

    Greetings
    Dear Francis,

    I have learned about cybernetics and all kinds of computer languages prior to 1970. How the world has changed. I realized by reading your article how I have forgotten everything.

    Best wishes from all of us.

    Peter

  • Prasenjit

    Validator
    The regularexpression Validator to validate two different formt for the same text box

  • Anonymous

    pattern match
    can u give me a egrep which will match a nine digit account number from a file

  • Anonymous

    rules for bad characters
    if a white list of good characters is allowed in any data input field. i.e. 0-9, a-z, ., c, $, if someone ones to allow the & and ‘(ampersand and apostrophe) characters for example, is it safe to have a rule that say if & is input it must be surrounded by only a alpha or numeric character and if apostrophe must be surrounded by alpha. This would be a global list used by the entire system for data input

  • Anonymous

    rules for bad characters
    if a white list of good characters is allowed in any data input field. i.e. 0-9, a-z, ., c, $, if someone ones to allow the & and ‘(ampersand and apostrophe) characters for example, is it safe to have a rule that say if & is input it must be surrounded by only a alpha or numeric character and if apostrophe must be surrounded by alpha. This would be a global list used by the entire system for data input

  • ronniedobbs

    getting errors with the PO Box in .NET
    Hi, I’m trying to implement the PO Box regex on the page and it seems that it runs fine in Windows PowerShell, but throws syntax errors when used in either an asp.net regex validator or in actual code. The errors I’m getting are when i try to build with the regex in code are:

    Unrecognized escape sequence (this is erroring out on the . (converting periods to literal) and the d character.

    I tried removing them but still received syntax errors in the regex validator.

    Any ideas as to what might be going on?

  • roundand

    re: getting errors with the PO Box in .NET
    Sorry, I haven’t been monitoring this page recently. If you’re still interested, or for anyone with the same problem, I imagine that the “” is being treated as an escape character by C#. Try escaping it by using “\” in your code for each “” in the article, or using an @string (@”…”) to turn off escaping.

  • roundand

    validation for a minimum number of distinct characters
    Something that I should have included in the article: how to use regex groups to enforce rules like “six characters, at least four of which must be different from each other”.

    The regex “group” feature is similar to a variable. .Net uses parenthesis notations to create named groups like “(?<first>.?)” and un-named groups like “(.)” – we’ll use the second, more concise version.

    Having created an unnamed group, you can reference it numerically – your first group is “1”, the next “2” etc.

    So we can use groups and negative lookahead to specify “(.)(?!.*1)” meaning “match a character that doesn’t occur in the remainder of the string”.

    Repeat that another three times, and combine with a lookahead length check, and you have the following:

    PS C:UsersFrancis>
    [regex]::ismatch(“123555″,”(?=^.{6}$)(.)(?!.*1).*(.)(?!.*2).*(.)(?!.*3).*(.)(?!.*4)”)
    True

    PS C:UsersFrancis>
    [regex]::ismatch(“125555″,”(?=^.{6}$)(.)(?!.*1).*(.)(?!.*2).*(.)(?!.*3).*(.)(?!.*4)”)
    False

    PS C:UsersFrancis>
    [regex]::ismatch(“125556″,”(?=^.{6}$)(.)(?!.*1).*(.)(?!.*2).*(.)(?!.*3).*(.)(?!.*4)”)
    True

    That’s your test for “six characters, four of which must be distinct”, and it should be reasonably obvious how you can deal with all sorts of password variations using this approach as a starting point.

  • roundand

    re: validation for a minimum number of distinct characters
    (Unfortunately I can’t correct my own comment)

    There’s a little bug in the above regex:

    “(?=^.{6}$)(.)(?!.*1).*(.)(?!.*2).*(.)(?!.*3).*(.)(?!.*4)”

    should be:

    “(?=^.{6}$).*(.)(?!.*1).*(.)(?!.*2).*(.)(?!.*3).*(.)(?!.*4)”

    (otherwise the regex expects the first character of the input to be unique, regardless of how many other distinct characters there are)

  • rizwan6feb

    Common Regular Expressions
    <a href=”http://www.qualitycodes.com/tutorial.php?articleid=28″>http://www.qualitycodes.com/tutorial.php?articleid=28</a&gt; lists all the common regular expressions

  • tengtium

    regular expression problem
    good day sirs and madam,

    i badly needed help.. i have this regular expression

    ^(?!.*–)[A-Za-zd-]+$

    it accepts alphanumeric character optionally a dash (single dash only, not consecutive dash) for exampl:

    it accepts:
    12a-3c-4f3fg
    12ertgg2
    1-2-3-3-4-3
    dffgsfg
    d-f-f-g-s-f-g

    it does not accept:
    12-3c-4f&3fg
    12e%rt-gg2
    d–f-f-g-s-f-g
    1-2-3-3-4–3

    now i have problem, i cant find any in the .net that accepts the above valid expression with space-a non consecutive space. meaning the it needs to accept alphanumeric with optionally a non-consecutive dash and optionally a non-consecutive space.

    can someone help please.

  • roundand

    a useful PowerShell commandline regex tip
    Anyone using a PowerShell / Command line approach like my article above might find like the idea of using the foreach statement to test multiple inputs on a single line, eg:

    PS C:Notes> “1”, “1.”, “1.0”, “1.00”, “1.000”, “1.00.” | foreach{“$_ ” + [regex]::ismatch($_,”^(d+)$|(^d+.dd?$)”)}
    1 True
    1. False
    1.0 True
    1.00 True
    1.000 False
    1.00. False
    PS C:Notes>