String Comparisons in SQL: The Metaphone Algorithm

When exploring the use of the Metaphone algorithm for fuzzy search, Phil couldn't find a SQL version of the algorithm so he wrote one. The Metaphone algorithm is built in to PHP, and is widely used for string searches where you aren't always likely to get exact matches, such as ancestral research and historical documents. It is particularly useful when comparing strings word-by-word. With a SQL version, it is easy to experiment on large quantities of data!

Metaphone algorithms are designed to produce an approximate phonetic representation, in ASCII, of regular “dictionary” words and names in English and some Latin-based languages. It is intended for indexing words by their English pronunciation. It is one of the more popular of the phonetic algorithms and was published by Lawrence Philips in 1990. A Metaphone is up to ten characters in length.

It is used for fuzzy searches for records where each string to be searched has an index with a Metaphone key. You search for all records with the same or similar metaphone key and then refine the search by some ranking algorithm such as Damerau–Levenshtein distance. Metaphone searches are particularly popular with ‘ancestor’ sites that search on surnames where spellings vary considerably for the same surname. The current version, Metaphone 3, is actively maintained by Lawrence Philips, developed to account for all spelling variations commonly found in English words, first and last names found in the United States and Europe, and non-English words whose native pronunciations are familiar to English-speakers. The source of Metaphone 3 is proprietary, and Lawrence charges a fee to supply the source.

There is a version of Metaphone that is built into in PHP, and I have used the original public domain source of Metaphone, and modified it slightly  to get the same result as the PHP version.

I have commented the source as clearly as I can to explain what is going on.

The source to this function is here on Github.

See also :-

  • 4470 views

  • Rate
    [Total: 4    Average: 4.8/5]
  • Ross Presser

    Minor nitpick: with the loop starting on line 35, you’re not stripping out all non-ASCII characters, you’re stripping out all non-alphabetic characters. Particularly since you’re using the Latin-1 encoding which marks some non-ASCII characters (accented letters like é, or n-tilde, etc.) as alphabetic.

    • Thanks for spotting that. I was a bit hurried with the documentation! I’ve updated it both here and on Github!

  • Robert Sterbal

    would filtering out or directly evaluating the top 10 to 100 words/names have a significant impact on performance?

    • You wouldn’t put this in the actual function that calculated the Metaphone. You’d pre-calculate the metaphone for all the columns that you wished to search, and place them is a separate indexed column. The function is actually faster than you’d imagine so even is you have quite a few strings to search for within your mataphone column, the conversion of these strings can be done on the fly. I’ll be publishing an article soon on the various tricks you can use with algorithms like this.

  • BillTalada

    It is quite fast for as long as the script is. I tried it against a list of 206,000 words in a dictionary and it took 5 seconds. It appears to generate about two thirds as many unique metaphones as words so it won’t be a one-to-one if you’re wondering. I just had to find duplicate customers for someone; I’ll try this to see what additional stuff this catches.

  • Keith Rowley

    This is really cool. I am thinking about all kinds of ways I could use something like this now.

  • Stuart Denyer

    It seems Metaphone 3 was inadvertently released under a permissive licence: https://news.ycombinator.com/item?id=8779517