Friday, 31 August 2007

What's this Soundex in Delphi?

Delphi has a Soundex routine found in StrUtils, so why am I giving you another? This blog is about how Soundex works. understanding Soundex will help you decide how and when to use it to enhance your program. I am not suggesting for a moment that you dump the already written Soundex routine within Delphi, unless you really want to, but rather I am suggesting that this may be a way to understand how a Soundex code is put together.

In the early 1980's I wrote a Soundex routine in Turbo Pascal and released it to the world through the bulletin boards that were available back then (before the Internet became readily available to the masses). By releasing it to the public domain, I was more interested in sharing than in copyrighting everything.

Over the following years, to my great amusement, I found my routine copied several times, sometimes word for word, line for line including comments, with my name replaced with someone else claiming to be the author (I would not be so amused these days). As far as I could tell with the limited searching available at the time, I was the first person to write a Soundex routine entirely in Turbo Pascal, but it is a simple routine so I could easily be wrong on that count.

The original Turbo Pascal source has been lost to .. well .. somewhere in that box of 5.5" floppies over in the corner there that I've been taking to the dump any day now for about the last 15 years.

Soundex converts a name to four characters (one letter and three numbers). The conversion will result in the same numbers for like-sounding names. For example all the following names will all result in a Soundex code of "S530" - smith, Smith, smythe, smitt, shmidt, shmidt, snith, snyth, snythe, smmith, etc. This means that if a user enters "Smith", and I search the database for the Soundex code rather than the name "Smith", I will be presented with all those and other similar sounding names.

Soundex was originally invented by Robert C. Russel and Margaret K. Odell around 1918-1922 and initially used for immigration and census information. Being before computers, the Soundex code is naturally simple so that people can calculate Soundex manually. It is used most in computer applications now. Searching your own database for names that sound like 'Smith', for spell checking and other areas.

While Soundex is clever, it is not infallible so common sense must prevail. 'Clark', and 'Klark' will come out with a different Soundex code. other examples are Bate (B300), will not be the same as Bates (B320), or Gate (G300).

In other instances names that really don't sound the same can sometimes, by sheer coincidence, have the same soundex code. Have you ever wondered why sometimes your spell checker wants to replace a mistyped word with something completely different and totally out of context? You are left wondering how on earth the spell checker came up with that word. Well, now you know. Its Soundex. However, it is still clever enough for most uses.

So what's the process? Well, it was simple for a reason (above) and still is.

  • Take the first letter
  • Replace the letters BFPV with '1'
  • Replace the letters CGJKQSXZ with '2'
  • Replace the letters DT with '3'
  • Replace the letter L with '4'
  • Replace the letters MN with '5'
  • Replace the letter R with '6'
  • Ignore all other letters and ignore double letters (e.g. the 2 T's in "Letterman").
  • If the result is less than 4 characters, then pad out with zeros.
  • If result is greater than 4 characters then use only the first 4

And there you have your Soundex code.

There are many ways to implement Soundex and I have chosen one that I hope will allow you to follow what is happening. Take a look at the Soundex routine in StrUtils for another way.

function MySoundex(sName: string): string;
var
Ch, LastCh: Char;
i : integer;
sx : string;
begin
sName := UpperCase(trim(sName));
if length(sName) < 1 then
sx := '' // got nothing, send nothing back
else
begin
LastCh := #0;
for i := 1 to length(sName) do
begin // step through each character in the name
if i = 1 then
sx := sName[1] // store the first character
else
begin
ch := #0;
if sName[i] <> LastCh then
begin
case sName[i] of
'B', 'F', 'P', 'V': ch := '1';
'C', 'G', 'J', 'K',
'Q', 'S', 'X', 'Z': ch := '2';
'D', 'T': ch := '3';
'L': ch := '4';
'M', 'N': ch := '5';
'R': ch := '6';
// Note no ELSE - ignore all other letters
end;
if ch <> #0 then
sx := sx + ch;
if length(sx) > 3 then
break; // we got all we need
end;
end;
LastCh := sname[i];
end;
while length(sx) < 4 do
sx := sx + '0'; // pad out remaining with zero
end;
result := sx;
end;


There are lots of variations and improvements on that depending on what you are doing.

So where would we use Soundex? One obvious way would be to give your users the ability to search for a name by the way it sounds.

Consider a database full of hundreds of thousands of customers. Simply add a field to the Customer table called .. oh I don't know, how about "Soundex". Then add the soundex code for the customer surname on each customer.

Because of the way Soundex works, you may find that finding all customers that sound like "Smith" may be a lot faster than finding all customers with the exact name of "Smith". How can this be? Well, since there will be less unique Soundex codes than unique surnames, the index will have less problems finding a Soundex code. Lets say that you are searching for a name in Auckland (approx 1.3 mill), an index on unique surnames will possibly total around 800,000 (rough guess). However an index on unique Soundex codes for surnames may only total about 100,000 (another wild guess with no substance at all, but you get the idea).

Ramblings


Its beautiful outside again today and although yesterday the wind was getting up a little, I can start to feel the rumblings of spring around here. Are we at last getting out of the deep dark winter that held our grip for the past few months and kept us close to our box of tissues, warm jerseys, and heaters? I hope so. It made me think of where I would be if I had accepted the offer to spend the next few years in Mongolia. I would have loved the experience of learning a new culture and language, but "The Mount" has a way of making me smile.

I climbed to the top of the mount again last weekend where the view is spectacular.
Mount Maunganui from the top of the Mount. It was sad to think I'll soon be moving inland to Hamilton, an hour and a half away.

Have a wonderful day wherever you are and God's blessings.

7 comments:

  1. it's 5 1/4" floppies, not 5.5 or 5.25 :)

    ReplyDelete
  2. Thanks Steve, I enjoyed that.

    ReplyDelete
  3. The Delphi's StrUtils Soundex is enough. No need in custom procedure, IMHO. Also hard to read too much text.

    ReplyDelete
  4. Anon: Yes, you are of course correct :)

    Delhihobbyist: Thanks. Love your site btw.

    Alex ... and good morning to you too :)

    ReplyDelete
  5. Having a Soundex code like "A4521" isn't very efficient, though.
    It's better to use an algo that produce a "pure integer" value code like "654521" or similar.
    In other words, avoid to store the first "char"; instead replace it with an integer.

    The main advantage is when the soundex value is to be stored into i.e. retrieved from a database - using indexing functionality makes things faster :)

    ReplyDelete
  6. Loïs, you are correct. Although I would suggest only the 4 characters in length, you could definately alter the Soundex system to suit your specific requirements. This would improve disparities like "Gates" and "Bates" having different codes.

    Other improvements include the "Daitch-Mokotoff" soundex system. In a book "SQL For Smarties", Joe Celko outlines a "Celko Improved Soundex" algorithm.

    The New York State Division of Criminal Justice Services introduced the NYSIIS Soundex system in 1970.

    Looking up Soundex in the Wikipedia, it states... "Lawrence Philips developed the Metaphone algorithm .. later developed an improvement to Metaphone, which he called Double-Metaphone". I hadn't come across either.

    Thanks for the insight Loïs.

    ReplyDelete
  7. What about another languages that use latin letters?

    ReplyDelete

Note: only a member of this blog may post a comment.