Update!! Read the Sift3 post. It is an even better algorithm.
A while ago I wrote a little algorithm that tried to cope with the large amount of time that the Levenshtein edit-distance algorithm took to compare two strings. It was a really funny algorithm, but it was fast enough to show me the similarity between strings that, after all, should have been either very similar or different.
Meanwhile I started thinking if I could improve on it. I was sure I could, because it wasn't very scientific, it was empirical. And finally I did it. This is the Sift2 algorithm, along with an SQL server function that can make one million string comparisons in 2.5 minutes. Compared with Levenshtein, Sift2 performs 10 to 25 times faster and it is four times faster than Sift.
The concept:
Starting from the beginning of both words, compare the letters on the same position.
If the same, move forward, else search the letter from the first word in the next maxOffset letters in the second word and viceversa.
Offset the words according to the closest found letter, add 1 to distance
Repeat until one of the words ends
Add to the calculated distance half of the length of string remained unparsed from the other word
That's it! Here is the code:
You can find a nice OOP Javascript implementation at IT BASE.
Performance: The algorithm seems to be working fine for letter insertions, typos, letter inversions and such. It gives slightly different values than Levenshtein when big words are inverted. When a lot of letters are inverted, the difference from Levenshtein increases. Example: abcdefghijkl vs. badcfehgjilk (every two letters inverted) results in 0.42 similarity in Levenshtein and 0.08 in Sift2. But let's face it, the two words are really different.
Update: I've optimised the algorithm a little and also changed the formula so that it matches the Levenshtein distance as close as possible. The basic idea remains the same, but now the average error from Levenshtein (calculated or real company name and address data) is only 3%. Some people saw the goto line and immediately laughed. Well, I tried replacing it with a break and a subtraction and even removed the subtraction altogether (thus maiming the algorithm) and the goto implementation was still the fastest. So I will continue to use goto, as it is more efficient.
Request: I have seen a lot of people actually searched on Google and got here. I would really really really like some comments, links to implementations or whatever... It's my baby after all. Thanks!
36 comments:
Anonymous
said...
I also had a prblem with the Levenshtein implementation - it killed my CPU.
I'm testing this algorithm with the intention of using it to correct spelling mistakes in town names. I've found that it reports the distance between "guilford" and "guildford" as 1, and the distance between "ford" and "guildford" also as 1. I can see how that's right in the first instance, but the edit distance between "guildford" and "ford" should surely be more than 1?
That's a particular case when one word is prefixed with a few letters to form the second. I have no real solution for this inside Sift2. Maybe a future Sift3, but as I said in the article, Sift2 only finds out very similar words, then you could/should use Levenshtein to make sure.
It would be pointless, since this algorithm only looks for a letter in the vicinity of the position of the letter and thus finding the common longest continuous substring would only work if it is offset by a few characters.
I guess a combination of Sift3, Levenshtein, length comparison and counting the numbers of each character in the strings can yield optimal results, but in the detriment of the speed Sift3 was conceived for.
Deeds here, First of all great job you algoritms are quite nice.
However, after doing performance test i have found that sift2 is faster than sift3 although you (and others) have postet that you found sift3 to be faster.
no doubt that sift3 is more accurate and a better solution than sift2. I'm just curious why my implementation seems slower...
Im using this to pick out RFID tags that have been written to a file. Its picking out partial reads perfectly, even thouth the id's are sequential (probably helped by the crc)
I am trying to compare two strings from Excel and Access (.mdb). I need to compare excel string with mdb list of strings and fetch closest match percentage and group them based on percentage match like above 0.1 to group A and 0.50 - 0.99 to group B.
The problem is, excel string ‘abc capital pvt ltd’ is not matching with ‘abc capital private limited’ in mdb, hence this is going into group B, both are same actually.
Manual work is impossible for me as the compare list has over 40K strings. Any help on this is highly appreciated.
In your case, I think the best way is to try to bring the string to a canonical form, something that tries to find keywords like private and limited and whatever , orde them corectly, then compare with the other strings brought to the same form.
Sift is used to find similar strings, best for looking for typos. In order to find similar semantics, you need to somehow define the grammar of strings.
Well, I would have to Google for links, but that you can do yourself. If it were me, I would create a list of regular expressions that act like successive transformations. I would pass all strings through this and store the result (the canonical form). Then run some algorithm to compare the strings.
Of course, you can do this in a more complex way, where you split the string into company name, title, etc and then the comparison can be more specific than a simple string comparison.
Another solution would be to split the strings into words, then compare the words. Imagine this algorithm for two strings:
1. split the strings into words 2. for each word in string1, find the closest match in the list of words in string2 3. add the found match to a new string (result) 4. run Sift3 or other string comparison algorithm on string1 and result
Of course, the problem here is the word matching algorithm. I am sure there are words that are closest to pvt than private, so you still have to do some manual tranformations.
First of all, Sift2 is obsolete. Follow the link at the start of the post for Sift3.
MaxOffset is the maximum offset from the position of a character in string1 where the same character is seekd in string2.
Example: MaxOffset=5, strings abbbbb and bbbbba The algorithm will find 'a' in string2, because it is the range of 5 characters. if there was another 'b' in string 2, Sift would not have found the last 'a' character.
That's why Sift is so fast, because it only looks for matches in a subsection of the other string.
There is no limit. There is no memory table or anything like that. The algorithm just goes on, tries to greedily find matches and if it makes a mistake, it doesn't care and persists. Kind of like my style of programming ;)
I need a code for finding plagiarism, one for text and one for sourceCode. I'll run several text-comparing runs with your algorithm (ported to php, if this is possible) and report back. :D
Do you have an idea how code plagiarism detection differs from text?
Rubbish! All code is open and free. There is no code plagiarism! ;)
I think that for what you need, you should build a specific algorithm. Imagine that I would take a piece of code and reformat it, change the name of the variables, add or remove white-space or even obfuscate it. What you say you need sounds akin to a compiler and even so, it would be easily circumvented.
However, considering a simple task as simply finding the pieces of code that are shared between some students or another, I think Sift (and btw, Sift3 is newer and better than Sift2!! Doesn't anyone notice the first line of the blog entry?) would do nicely once you remove all whitespace and rename all variables consistently.
The problem could also be reduced to finding similarities in graphs. If you think of typical OOP code with classes, methods, properties and fields, then one could build a top-to-bottom tree of the functionality of the program, subtrees of the functionality of different methods. Comparing the resulting trees would come as a bottom-up aproach, trying to find similar methods and match them to see if the code at least looks the same in structure. Then one might go for detailed code analysis when execution trees are similar enough.
That being said, I have never done that before and thinking about it it seems that there must be tools to do exactly that out there.
However, we've already found some code similartiy detectors, but they all are bloated. Removal of VarNames was what i had in mind, too, but since we have to compare Java Code, your OOP-Tree-Idea seems nifty. (I admire your coding skills.)
I will check with my teammate and try and report back, then. :)
Do you have a way to compare two text files for concept/semantic similarity instead of string similarity?
For instance, text 1: Between war and peace many choose peace without knowing its price. War is not necessarily irrational. However, to truly understand it, understand humanity is the starting point...
text 2: History is made up of many many wars. War is great for strong nations. I'm a citizen of a strong nation, so, I'm all for it.
Well. The reason I was using Swift2 was that it was adapted to autoit scripting language by another user. It met my needs so starting using it pretty much a few days before I posted here. I will definitely take a look at Swift3 and adapt it as well as I'm impressed with it as is. You algorithm is exactly what I was looking for to use in an app I'm working on. So again, thanks!
Don't worry about the GOTOs when they make sense. Some people have a zealous avoidance of them that borders on religion. But if changing it would produce even less readable (or in this case, performant--which is the whole point of your post) code, I say stick with it.
36 comments:
I also had a prblem with the Levenshtein implementation - it killed my CPU.
This is a great implementation. Thanks!
This is awesome, thanks a lot! I will use it to compare large numbers names with different spellings and wordings, and it works beautifully.
Goto is extremely useful sometimes. People sometimes read something somewhere and they acknowledge without even understanding why.
Great code. I am on my way of testing it, so you`ll have some feedback ;)
I'm testing this algorithm with the intention of using it to correct spelling mistakes in town names. I've found that it reports the distance between "guilford" and "guildford" as 1, and the distance between "ford" and "guildford" also as 1. I can see how that's right in the first instance, but the edit distance between "guildford" and "ford" should surely be more than 1?
That's a particular case when one word is prefixed with a few letters to form the second. I have no real solution for this inside Sift2. Maybe a future Sift3, but as I said in the article, Sift2 only finds out very similar words, then you could/should use Levenshtein to make sure.
Hmm, if you are now down to the single-pass scan it might be cheap enough to keep a pointer (and length) to the longest common substring found.
That should allow handling guildford -- ford, and probably a few other big cases. OTOH, I wonder what would happen with guildford -- fordshire ...
It would be pointless, since this algorithm only looks for a letter in the vicinity of the position of the letter and thus finding the common longest continuous substring would only work if it is offset by a few characters.
I guess a combination of Sift3, Levenshtein, length comparison and counting the numbers of each character in the strings can yield optimal results, but in the detriment of the speed Sift3 was conceived for.
Just came here searching for Levenshtein. I will try your algorithm in the future and give you feedback.
Thanks for the work.
nifty implementation.
I did wonder about the goto myself... what I did, and what might work and should be nearly as efficient is a state boolean value...
outer loop...
.. bFound = false
.. for each loop
.. set bFound = true, break out of for loop.
.. only increment dist if bfound = false.
I believe that maintains your logic integrity.
Hey there, i've just converted your code into vb.net (.net 3.5). i'm not sure if you have this in C3, but i swapped out the 'goto' with 'exit for'
vb.Net... is that like... the ugly C# ? :D
Hi Siderite
Deeds here, First of all great job you algoritms are quite nice.
However, after doing performance test i have found that sift2 is faster than sift3 although you (and others) have postet that you found sift3 to be faster.
no doubt that sift3 is more accurate and a better solution than sift2. I'm just curious why my implementation seems slower...
any ideas?
Thanks! Great work!
Im using this to pick out RFID tags that have been written to a file. Its picking out partial reads perfectly, even thouth the id's are sequential (probably helped by the crc)
Thank you for sharing your experience with me. May I ask what makes you use Sift2 rather than Sift3?
Is it possible to have this in vb.net..?
If you Google for "online convert C# to VB" you will find a series of tools that do this automatically. Here is a link of the first thing I found:
http://www.developerfusion.com/tools/convert/csharp-to-vb/.
A nice function, written well. It just seems to give to broad of return for any given word if it's miss-spelled.
I am trying to compare two strings from Excel and Access (.mdb). I need to compare excel string with mdb list of strings and fetch closest match percentage and group them based on percentage match like above 0.1 to group A and 0.50 - 0.99 to group B.
The problem is, excel string ‘abc capital pvt ltd’ is not matching with ‘abc capital private limited’ in mdb, hence this is going into group B, both are same actually.
Manual work is impossible for me as the compare list has over 40K strings. Any help on this is highly appreciated.
Thanks a lot.
In your case, I think the best way is to try to bring the string to a canonical form, something that tries to find keywords like private and limited and whatever , orde them corectly, then compare with the other strings brought to the same form.
Sift is used to find similar strings, best for looking for typos. In order to find similar semantics, you need to somehow define the grammar of strings.
Hi Siderite,
Thanks for your reply, i am trying to remove words like pvt private etc and compare but, this is only a time being solution for me.
Could you please provide any useful links for me?
Thanks
Well, I would have to Google for links, but that you can do yourself. If it were me, I would create a list of regular expressions that act like successive transformations. I would pass all strings through this and store the result (the canonical form). Then run some algorithm to compare the strings.
Of course, you can do this in a more complex way, where you split the string into company name, title, etc and then the comparison can be more specific than a simple string comparison.
Another solution would be to split the strings into words, then compare the words. Imagine this algorithm for two strings:
1. split the strings into words
2. for each word in string1, find the closest match in the list of words in string2
3. add the found match to a new string (result)
4. run Sift3 or other string comparison algorithm on string1 and result
Of course, the problem here is the word matching algorithm. I am sure there are words that are closest to pvt than private, so you still have to do some manual tranformations.
Also, this completely ignores word order.
What is this MaxOffset thing? Can you please explain a bit ...
Hi Siderite,
Thanks for the hints, i will try work it out. Will post the result to you as i complete.
Thanks for your time.
First of all, Sift2 is obsolete. Follow the link at the start of the post for Sift3.
MaxOffset is the maximum offset from the position of a character in string1 where the same character is seekd in string2.
Example: MaxOffset=5, strings abbbbb and bbbbba
The algorithm will find 'a' in string2, because it is the range of 5 characters. if there was another 'b' in string 2, Sift would not have found the last 'a' character.
That's why Sift is so fast, because it only looks for matches in a subsection of the other string.
Thanks, Also I want to know the maximum string length ( Total number of characters in each string ) that sift can handle ...
There is no limit. There is no memory table or anything like that. The algorithm just goes on, tries to greedily find matches and if it makes a mistake, it doesn't care and persists. Kind of like my style of programming ;)
I need a code for finding plagiarism, one for text and one for sourceCode. I'll run several text-comparing runs with your algorithm (ported to php, if this is possible) and report back. :D
Do you have an idea how code plagiarism detection differs from text?
Rubbish! All code is open and free. There is no code plagiarism! ;)
I think that for what you need, you should build a specific algorithm. Imagine that I would take a piece of code and reformat it, change the name of the variables, add or remove white-space or even obfuscate it. What you say you need sounds akin to a compiler and even so, it would be easily circumvented.
However, considering a simple task as simply finding the pieces of code that are shared between some students or another, I think Sift (and btw, Sift3 is newer and better than Sift2!! Doesn't anyone notice the first line of the blog entry?) would do nicely once you remove all whitespace and rename all variables consistently.
The problem could also be reduced to finding similarities in graphs. If you think of typical OOP code with classes, methods, properties and fields, then one could build a top-to-bottom tree of the functionality of the program, subtrees of the functionality of different methods. Comparing the resulting trees would come as a bottom-up aproach, trying to find similar methods and match them to see if the code at least looks the same in structure. Then one might go for detailed code analysis when execution trees are similar enough.
That being said, I have never done that before and thinking about it it seems that there must be tools to do exactly that out there.
Thanks for your quick respone o:
However, we've already found some code similartiy detectors, but they all are bloated. Removal of VarNames was what i had in mind, too, but since we have to compare Java Code, your OOP-Tree-Idea seems nifty. (I admire your coding skills.)
I will check with my teammate and try and report back, then. :)
Do you have a way to compare two text files for concept/semantic similarity instead of string similarity?
For instance,
text 1:
Between war and peace many choose peace without knowing its price. War is not necessarily irrational. However, to truly understand it, understand humanity is the starting point...
text 2:
History is made up of many many wars. War is great for strong nations. I'm a citizen of a strong nation, so, I'm all for it.
Well, I would try a Google service for that :) I personally don't have something like this.
I just wanted to thank you for this wonderful algorithm. I've been using with great success and the speed is ridiculously fast.
I am very glad to hear that. I am curious, though, why don't you use Sift3? Everybody seems to be using version 2 for some reason...
Well. The reason I was using Swift2 was that it was adapted to autoit scripting language by another user. It met my needs so starting using it pretty much a few days before I posted here. I will definitely take a look at Swift3 and adapt it as well as I'm impressed with it as is. You algorithm is exactly what I was looking for to use in an app I'm working on. So again, thanks!
Don't worry about the GOTOs when they make sense. Some people have a zealous avoidance of them that borders on religion. But if changing it would produce even less readable (or in this case, performant--which is the whole point of your post) code, I say stick with it.
Post a Comment