Tuesday, June 25, 2013

Sift3 gone wild!

A while ago (geez, it's been 6 years!) I wrote an algorithm that was supposed to quickly and accurately find the distance between two strings. After a few iterations it got really simple to implement, understand and use, unlike more academic algorithms like Levenshtein, for example. I placed all the code in this blog and allowed everyone to use it in any way they saw fit. Let me make this clear that it is not the greatest invention since fire, but it is mine and I feel proud when people use it. And today I accidentally stumbled upon something called Mailcheck, by Derrick Ko and Wei Lu. Not only did they use my algorithm, but they also graciously linked to my blog. And, according to the description from their GitHub page, this javascript library is being used by the likes of Kicksend, Dropbox, Kickstarter, Minecraft and the Khan Academy. Talking about Sift going wild! Woohoo!

So I started to Google for other uses of Sift3. Here is a list:
  • Mailcheck, the software that I was talking about above.
  • Sift3 for AutoKey - Autokey does "Fast scriptable desktop automation with hotkeys". Toralf also published the result on GitHub Gist: AutoHotkey: StrDiff() and his implementation is now used in 7Plus, a software to improve usability in Windows
  • Longest common substring problem - wikibooks varient vs sift3 varient - which seems to make Wikibooks the winner. Drat! :) loser! On second look I noticed that the values did not show time, but operations per second, so more is better. Also, looking at the implementation I noticed that it uses a maxOffset not of 5, but of the minimum length of the compared strings, which makes it more accurate, but much slower (and still wins!)
  • A Java implementation on BitBucket
  • A PDF document suggesting the algorithm is being used in an Italian software called CRM Deduplica

All in all I am very satisfied of how Sift3 is being used in the wild and, I have to say, grateful to the people that trusted my work enough to include it in theirs. It took 6 years, but look how much it has grown!

Update: To celebrate the usage of my algorithm, I've added an improved Javascript version in the original post, a form of the algorithm that I call "3B", since there are only minor improvements.
Now I have a weird idea of an algorithm that would compute the similarity between lists of strings (which is the usual usage of string distance). Could it be done, in a simple and straightforward manner like Sift3? What do you think?


Anonymous said...

This helped me a lot. Thanks!