.NET (290) administrative (42) Ajax (42) AngularJS (1) ASP.NET (144) bicycle (2) books (179) browser (8) C# (128) cars (1) chess (27) CodePlex (10) Coma (8) database (46) deployment (3) Entity Framework (2) essay (109) flash/shockwave (2) flex (1) food (3) friend (2) game (20) idea (5) IIS (8) javascript (81) LInQ (2) Linux (6) management (4) manga (43) misc (664) mobile (1) movies (88) MsAccess (1) murder (2) music (65) mysql (1) news (98) permanent (1) personal (66) PHP (1) physics (2) picture (309) places (12) politics (13) programming (495) rant (118) religion (3) science (40) Sharepoint (3) software (57) T4 (2) technology (11) Test Driven Development (4) translation (2) VB (2) video (100) Visual Studio (44) web design (45) Windows API (8) Windows Forms (3) Windows Server (4) WPF/Silverlight (63) XML (11)

Tuesday, January 09, 2007

Super fast string distance algorithm: Sift2

Update!! Read the Sift3 post. It is an even better algorithm.


A while ago I wrote a little algorithm that tried to cope with the large amount of time that the Levenshtein edit-distance algorithm took to compare two strings. It was a really funny algorithm, but it was fast enough to show me the similarity between strings that, after all, should have been either very similar or different.

Meanwhile I started thinking if I could improve on it. I was sure I could, because it wasn't very scientific, it was empirical. And finally I did it. This is the Sift2 algorithm, along with an SQL server function that can make one million string comparisons in 2.5 minutes. Compared with Levenshtein, Sift2 performs 10 to 25 times faster and it is four times faster than Sift.

The concept:
  • Starting from the beginning of both words, compare the letters on the same position.
  • If the same, move forward, else search the letter from the first word in the next maxOffset letters in the second word and viceversa.
  • Offset the words according to the closest found letter, add 1 to distance
  • Repeat until one of the words ends
  • Add to the calculated distance half of the length of string remained unparsed from the other word
That's it! Here is the code:


C# Code (double click to show/hide)




T-SQL code (double click to show/hide)



You can find a nice OOP Javascript implementation at IT BASE.

Performance:
The algorithm seems to be working fine for letter insertions, typos, letter inversions and such. It gives slightly different values than Levenshtein when big words are inverted. When a lot of letters are inverted, the difference from Levenshtein increases. Example: abcdefghijkl vs. badcfehgjilk (every two letters inverted) results in 0.42 similarity in Levenshtein and 0.08 in Sift2. But let's face it, the two words are really different.

Update:
I've optimised the algorithm a little and also changed the formula so that it matches the Levenshtein distance as close as possible. The basic idea remains the same, but now the average error from Levenshtein (calculated or real company name and address data) is only 3%.
Some people saw the goto line and immediately laughed. Well, I tried replacing it with a break and a subtraction and even removed the subtraction altogether (thus maiming the algorithm) and the goto implementation was still the fastest. So I will continue to use goto, as it is more efficient.

Request:
I have seen a lot of people actually searched on Google and got here. I would really really really like some comments, links to implementations or whatever... It's my baby after all. Thanks!

39 comments:

Anonymous said...

I also had a prblem with the Levenshtein implementation - it killed my CPU.

This is a great implementation. Thanks!

Simon said...

This is awesome, thanks a lot! I will use it to compare large numbers names with different spellings and wordings, and it works beautifully.

Anonymous said...

Goto is extremely useful sometimes. People sometimes read something somewhere and they acknowledge without even understanding why.

Great code. I am on my way of testing it, so you`ll have some feedback ;)

Anonymous said...

I'm testing this algorithm with the intention of using it to correct spelling mistakes in town names. I've found that it reports the distance between "guilford" and "guildford" as 1, and the distance between "ford" and "guildford" also as 1. I can see how that's right in the first instance, but the edit distance between "guildford" and "ford" should surely be more than 1?

Siderite said...

That's a particular case when one word is prefixed with a few letters to form the second. I have no real solution for this inside Sift2. Maybe a future Sift3, but as I said in the article, Sift2 only finds out very similar words, then you could/should use Levenshtein to make sure.

Anonymous said...

Hmm, if you are now down to the single-pass scan it might be cheap enough to keep a pointer (and length) to the longest common substring found.

That should allow handling guildford -- ford, and probably a few other big cases. OTOH, I wonder what would happen with guildford -- fordshire ...

Siderite said...

It would be pointless, since this algorithm only looks for a letter in the vicinity of the position of the letter and thus finding the common longest continuous substring would only work if it is offset by a few characters.

I guess a combination of Sift3, Levenshtein, length comparison and counting the numbers of each character in the strings can yield optimal results, but in the detriment of the speed Sift3 was conceived for.

christoph said...

Just came here searching for Levenshtein. I will try your algorithm in the future and give you feedback.
Thanks for the work.

Anonymous said...

nifty implementation.

I did wonder about the goto myself... what I did, and what might work and should be nearly as efficient is a state boolean value...

outer loop...
.. bFound = false
.. for each loop
.. set bFound = true, break out of for loop.
.. only increment dist if bfound = false.

I believe that maintains your logic integrity.

Anonymous said...

Hey there, i've just converted your code into vb.net (.net 3.5). i'm not sure if you have this in C3, but i swapped out the 'goto' with 'exit for'

Siderite said...

vb.Net... is that like... the ugly C# ? :D

Anonymous said...

Hi Siderite

Deeds here, First of all great job you algoritms are quite nice.

However, after doing performance test i have found that sift2 is faster than sift3 although you (and others) have postet that you found sift3 to be faster.

no doubt that sift3 is more accurate and a better solution than sift2. I'm just curious why my implementation seems slower...

any ideas?

pelle said...

Thanks! Great work!

Anonymous said...

Im using this to pick out RFID tags that have been written to a file. Its picking out partial reads perfectly, even thouth the id's are sequential (probably helped by the crc)

Siderite said...

Thank you for sharing your experience with me. May I ask what makes you use Sift2 rather than Sift3?

Anonymous said...

Is it possible to have this in vb.net..?

Siderite said...

If you Google for "online convert C# to VB" you will find a series of tools that do this automatically. Here is a link of the first thing I found:
http://www.developerfusion.com/tools/convert/csharp-to-vb/.

Anonymous said...

A nice function, written well. It just seems to give to broad of return for any given word if it's miss-spelled.

janapati said...

I am trying to compare two strings from Excel and Access (.mdb). I need to compare excel string with mdb list of strings and fetch closest match percentage and group them based on percentage match like above 0.1 to group A and 0.50 - 0.99 to group B.

The problem is, excel string ‘abc capital pvt ltd’ is not matching with ‘abc capital private limited’ in mdb, hence this is going into group B, both are same actually.

Manual work is impossible for me as the compare list has over 40K strings. Any help on this is highly appreciated.

Thanks a lot.

Siderite said...

In your case, I think the best way is to try to bring the string to a canonical form, something that tries to find keywords like private and limited and whatever , orde them corectly, then compare with the other strings brought to the same form.

Sift is used to find similar strings, best for looking for typos. In order to find similar semantics, you need to somehow define the grammar of strings.

janapati said...

Hi Siderite,

Thanks for your reply, i am trying to remove words like pvt private etc and compare but, this is only a time being solution for me.

Could you please provide any useful links for me?

Thanks

Siderite said...

Well, I would have to Google for links, but that you can do yourself. If it were me, I would create a list of regular expressions that act like successive transformations. I would pass all strings through this and store the result (the canonical form). Then run some algorithm to compare the strings.

Of course, you can do this in a more complex way, where you split the string into company name, title, etc and then the comparison can be more specific than a simple string comparison.

Another solution would be to split the strings into words, then compare the words. Imagine this algorithm for two strings:

1. split the strings into words
2. for each word in string1, find the closest match in the list of words in string2
3. add the found match to a new string (result)
4. run Sift3 or other string comparison algorithm on string1 and result

Of course, the problem here is the word matching algorithm. I am sure there are words that are closest to pvt than private, so you still have to do some manual tranformations.

Also, this completely ignores word order.

Duma said...

What is this MaxOffset thing? Can you please explain a bit ...

janapati said...

Hi Siderite,

Thanks for the hints, i will try work it out. Will post the result to you as i complete.

Thanks for your time.

Siderite said...

First of all, Sift2 is obsolete. Follow the link at the start of the post for Sift3.

MaxOffset is the maximum offset from the position of a character in string1 where the same character is seekd in string2.

Example: MaxOffset=5, strings abbbbb and bbbbba
The algorithm will find 'a' in string2, because it is the range of 5 characters. if there was another 'b' in string 2, Sift would not have found the last 'a' character.

That's why Sift is so fast, because it only looks for matches in a subsection of the other string.

Duma said...

Thanks, Also I want to know the maximum string length ( Total number of characters in each string ) that sift can handle ...

Siderite said...

There is no limit. There is no memory table or anything like that. The algorithm just goes on, tries to greedily find matches and if it makes a mistake, it doesn't care and persists. Kind of like my style of programming ;)

Anonymous said...

I need a code for finding plagiarism, one for text and one for sourceCode. I'll run several text-comparing runs with your algorithm (ported to php, if this is possible) and report back. :D

Do you have an idea how code plagiarism detection differs from text?

Siderite said...

Rubbish! All code is open and free. There is no code plagiarism! ;)

I think that for what you need, you should build a specific algorithm. Imagine that I would take a piece of code and reformat it, change the name of the variables, add or remove white-space or even obfuscate it. What you say you need sounds akin to a compiler and even so, it would be easily circumvented.

However, considering a simple task as simply finding the pieces of code that are shared between some students or another, I think Sift (and btw, Sift3 is newer and better than Sift2!! Doesn't anyone notice the first line of the blog entry?) would do nicely once you remove all whitespace and rename all variables consistently.

The problem could also be reduced to finding similarities in graphs. If you think of typical OOP code with classes, methods, properties and fields, then one could build a top-to-bottom tree of the functionality of the program, subtrees of the functionality of different methods. Comparing the resulting trees would come as a bottom-up aproach, trying to find similar methods and match them to see if the code at least looks the same in structure. Then one might go for detailed code analysis when execution trees are similar enough.

That being said, I have never done that before and thinking about it it seems that there must be tools to do exactly that out there.

Anonymous said...

Thanks for your quick respone o:

However, we've already found some code similartiy detectors, but they all are bloated. Removal of VarNames was what i had in mind, too, but since we have to compare Java Code, your OOP-Tree-Idea seems nifty. (I admire your coding skills.)

I will check with my teammate and try and report back, then. :)

Anonymous said...

Do you have a way to compare two text files for concept/semantic similarity instead of string similarity?

For instance,
text 1:
Between war and peace many choose peace without knowing its price. War is not necessarily irrational. However, to truly understand it, understand humanity is the starting point...

text 2:
History is made up of many many wars. War is great for strong nations. I'm a citizen of a strong nation, so, I'm all for it.

Siderite said...

Well, I would try a Google service for that :) I personally don't have something like this.

fRequEnCy said...

I just wanted to thank you for this wonderful algorithm. I've been using with great success and the speed is ridiculously fast.

Siderite said...

I am very glad to hear that. I am curious, though, why don't you use Sift3? Everybody seems to be using version 2 for some reason...

fRequEnCy said...

Well. The reason I was using Swift2 was that it was adapted to autoit scripting language by another user. It met my needs so starting using it pretty much a few days before I posted here. I will definitely take a look at Swift3 and adapt it as well as I'm impressed with it as is. You algorithm is exactly what I was looking for to use in an app I'm working on. So again, thanks!

Logan Murray said...

Don't worry about the GOTOs when they make sense. Some people have a zealous avoidance of them that borders on religion. But if changing it would produce even less readable (or in this case, performant--which is the whole point of your post) code, I say stick with it.

Kirk Sayre said...

Here is a Java version of the C# version. The only functional change I made was dividing by the average length of the strings rather than by the max length of the strings to punish short strings versus long strings.

/**
* This class defines fast ways for computing the
* differences/similarity of 2 strings. This Java code is a Java
* implementation of the C# code from
* http://siderite.blogspot.com/2007/01/super-fast-string-distance-algorithm.html
* originally written by Siderite Zackwehdex.
*/
public class StringSift2 {

private int maxOffset = 5;

public StringSift2() {}

public StringSift2(int maxOffset) {
this.maxOffset = maxOffset;
}

/**
* Compute the # of edits needed to make one string equal to another
* string.
*
* @param s1 The 1st string
* @param s2 The 2nd string.
*
* @return Return the # of character edits needed to make s1 == s2.
*/
public double NumEdits(String s1, String s2) {

// Do we have a string for s1?
if ((s1 == null) || (s1.length() == 0)) {

// No, the distance is based solely on string s2.
if ((s2 == null) || (s2.length() == 0)) {
return 0.0;
}
else {
return s2.length();
}
}

// Do we have a string for s2?
if ((s2 == null) || (s2.length() == 0)) {

// No, so the # of edits needed to get s1 == s2 is equal to the
// length of string s1.
return s1.length();
}

// If we get here we actually have strings for s1 and s2. Figure
// out the edit distance to make s1 == s2.
int c = 0;
int offset1 = 0;
int offset2 = 0;
int dist = 0;
while ((c + offset1 < s1.length()) &&
(c + offset2 < s2.length())) {

if (s1.charAt(c + offset1) != s2.charAt(c + offset2)) {
offset1 = 0;
offset2 = 0;

for (int i = 0; i < maxOffset; i++) {
if ((c + i < s1.length()) && (s1.charAt(c + i) == s2.charAt(c))) {
if (i > 0) {
dist++;
offset1 = i;
}
break;
}
if ((c + i < s2.length()) && (s1.charAt(c) == s2.charAt(c + i))) {
if (i > 0) {
dist++;
offset2 = i;
}
break;
}
}
dist++;
}

c++;
}

return (dist + (s1.length() - offset1 + s2.length() - offset2)/2 - c);
}

/**
* Return a metric describing the similarity between 2 strings.
*
* @param s1 The 1st string
* @param s2 The 2nd string.
*
* @return Return a metric describing the similarity between 2
* strings, where 1 means the strings are equal and 0 means the
* strings are completely different.
*/
public double Similarity(String s1, String s2) {
double dis = NumEdits(s1, s2);
int avgLen = (s1.length() + s2.length())/2;
if (avgLen == 0) return 1;
else
return 1 - dis / avgLen;
}

/**
* Return a metric describing the difference between 2 strings.
*
* @param s1 The 1st string
* @param s2 The 2nd string.
*
* @return Return a metric describing the difference between 2
* strings, where 0 means the strings are equal and 1 means the
* strings are completely different.
*/
public double Difference(String s1, String s2) {
return (1 - Similarity(s1, s2));
}
}

Anonymous said...

Ran you code on SQL 2008R2 and compared it to the Levenshtein formula in the Master Data Services pack. Yours takes about 8% longer to run on 10,0000 comparisons, which is quite impressive. All the best

James

Anonymous said...

I also had a problem with a Levenshtein implementation - I needed to calculate similarities in a big matrix of short strings (lengths between 10 and 200 characters) and it killed my CPU.

Thanks for the ideas and implementation.

PS: People who judge gotos learned programming from books, they have never been involved in real world applications ;-)