A while ago I wrote a little algorithm that tried to cope with the large amount of time that the Levenshtein edit-distance algorithm took to compare two strings. It was a really funny algorithm, but it was fast enough to show me the similarity between strings that, after all, should have been either very similar or different.
Meanwhile I started thinking if I could improve on it. I was sure I could, because it wasn't very scientific, it was empirical. And finally I did it. This is the Sift2 algorithm, along with an SQL server function that can make one million string comparisons in 2.5 minutes. Compared with Levenshtein, Sift2 performs 10 to 25 times faster and it is four times faster than Sift.
- Starting from the beginning of both words, compare the letters on the same position.
- If the same, move forward, else search the letter from the first word in the next maxOffset letters in the second word and viceversa.
- Offset the words according to the closest found letter, add 1 to distance
- Repeat until one of the words ends
- Add to the calculated distance half of the length of string remained unparsed from the other word
The algorithm seems to be working fine for letter insertions, typos, letter inversions and such. It gives slightly different values than Levenshtein when big words are inverted. When a lot of letters are inverted, the difference from Levenshtein increases. Example: abcdefghijkl vs. badcfehgjilk (every two letters inverted) results in 0.42 similarity in Levenshtein and 0.08 in Sift2. But let's face it, the two words are really different.
I've optimised the algorithm a little and also changed the formula so that it matches the Levenshtein distance as close as possible. The basic idea remains the same, but now the average error from Levenshtein (calculated or real company name and address data) is only 3%.
Some people saw the goto line and immediately laughed. Well, I tried replacing it with a break and a subtraction and even removed the subtraction altogether (thus maiming the algorithm) and the goto implementation was still the fastest. So I will continue to use goto, as it is more efficient.
I have seen a lot of people actually searched on Google and got here. I would really really really like some comments, links to implementations or whatever... It's my baby after all. Thanks!