.NET (295) administrative (41) Ajax (42) AngularJS (2) ASP.NET (144) bicycle (2) books (180) browser (8) C# (133) cars (1) chess (28) CodePlex (10) Coma (8) database (47) deployment (3) Entity Framework (2) essay (110) flash/shockwave (2) flex (1) food (3) friend (2) game (20) idea (5) IIS (8) javascript (82) LInQ (2) Linux (6) management (4) manga (42) misc (670) mobile (1) movies (90) MsAccess (1) murder (2) music (64) mysql (1) news (99) permanent (1) personal (68) PHP (1) physics (2) picture (307) places (12) politics (13) programming (503) rant (120) religion (3) science (43) Sharepoint (3) software (58) space (1) T4 (2) technology (11) Test Driven Development (4) translation (2) VB (2) video (97) Visual Studio (44) web design (46) Windows API (8) Windows Forms (3) Windows Server (5) WPF/Silverlight (63) XML (11)

Wednesday, April 04, 2007

Super Fast and Accurate string distance algorithm: Sift3

Update June 25th 2013: I've decided to play a little with the suggestions in the comments and check for validity. This was spurned by the realization that a lot of people use my algorithm. So, in order to celebrate this, here is the "3B" version of the Sift3 algorithm:
function sift3B(s1, s2, maxOffset, maxDistance) {
    if (s1 == null||!s1.length) {
        if (s2 == null) {
            return 0;
        }
        return s2.length;
    }

    if (s2 == null||!s2.length) {
        return s1.length;
    }

    var c1 = 0;
    var c2 = 0;
    var lcs = 0;
    var temporaryDistance;

    while ((c1 < s1.length) && (c2 < s2.length)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            lcs++;
        } else {
            if (c1<c2) {
                c2=c1;
            } else {
                c1=c2;
            }
            for (var i = 0; i < maxOffset; i++) {
                if ((c1 + i < s1.length) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1+= i;
                    break;
                }
                if ((c2 + i < s2.length) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c2+= i;
                    break;
                }
            }
        }
        c1++;
        c2++;
        if (maxDistance)
        {
            temporaryDistance=(c1+c2)/1.5-lcs;
            if (temporaryDistance>=maxDistance) return Math.round(temporaryDistance);
        }
    }
    return Math.round((s1.length + s2.length) / 1.5 - lcs);
}

It is made in Javascript, this time, as it was easier to test and has the following extra features:
  • a maxDistance value that tells the algorithm to stop if the strings are already too different.
  • two pointers c1 and c2, rather than a single pointer c and two offsets
  • Instead of dividing to 2 the total length of the strings compared, now I divide it with 1.5. Why? Because this way the value is closer to the Levenshtein distance computed per random strings

Happy usage!
Update: the Sift algorithm is now on Codeplex.

A while ago I wrote an entry here about Sift2, an improvement of Sift, the original and silly string distance algorithm. Now I am publishing Sift3, which is way more accurate and even simpler as an algorithm.

I found out that my algorithm is part of a class of algorithms that solve the Longest Common Substring problem, therefore I calculated the LCS, not the distance, then the distance from the LCS. The result is way more robust, easy to understand and closer to the Levenshtein algorithm both on random strings and user databases. Not to mention that there is no goto in this one.

BTW, if you are looking for an algorithm that detects switched words, this is not it :) This just looks for typos and small regional differences between the strings. I mean, you could normalize the strings, so that words are ordered by some mechanism, then it would work because the words wouldn't be switched :)

I promise to work on a word switching algorithm, but not in the near future.
Without further due, here is the code:

The C# code is a method in an object that has a private member maxOffset. As in Sift2 maxOffset should be around 5 and it represents the range in which to try to find a missing character.

public float Distance(string s1, string s2, int maxOffset)
{
  if (String.IsNullOrEmpty(s1))
  {
    return String.IsNullOrEmpty(s2)
             ? 0
             : s2.Length;
  }
  if (String.IsNullOrEmpty(s2))
  {
    return s1.Length;
  }
  int c = 0;
  int offset1 = 0;
  int offset2 = 0;
  int lcs = 0;
  while ((c + offset1 < s1.Length)
         &&
         (c + offset2 < s2.Length))
  {
    if (s1[c + offset1] ==
        s2[c + offset2])
      lcs++;
    else
    {
      offset1 = 0;
      offset2 = 0;
      for (int i = 0;
           i < maxOffset;
           i++)
      {
        if ((c + i < s1.Length)
            &&
            (s1[c + i] == s2[c]))
        {
          offset1 = i;
          break;
        }
        if ((c + i < s2.Length)
            &&
            (s1[c] == s2[c + i]))
        {
          offset2 = i;
          break;
        }
      }
    }
    c++;
  }
  return (s1.Length + s2.Length)/2 - lcs;
}


And here is the T-Sql code. This version is actually an improvement of my original source, gracefully provided by Todd Wolf:

CREATE FUNCTION [DBO].[Sift3distance2]
(
 @s1 NVARCHAR(3999),@s2 NVARCHAR(3999),@maxOffset INT
)
RETURNS FLOAT
AS 
 BEGIN
    DECLARE @s1LEN INT,@s2LEN INT
    
    SELECT @s1LEN=Len(Isnull(@s1,'')),@s2LEN=Len(Isnull(@s2,''))
    
    IF @s1LEN=0 RETURN @s2LEN
    ELSE
    IF @s2LEN=0 RETURN @s1LEN
    
    IF Isnull(@maxOffset,0)=0 SET @maxOffset=5
    
    DECLARE @currPos INT,@matchCnt INT,@wrkPos INT,@s1Offset INT,@s1Char VARCHAR,@s1Pos INT,@s1Dist INT,@s2Offset INT,@s2Char VARCHAR,@s2Pos INT,@s2Dist INT
    
    SELECT @s1Offset=0,@s2Offset=0,@matchCnt=0,@currPos=0
    
    WHILE(@currPos+@s1Offset<@s1LEN AND @currPos+@s2Offset<@s2LEN)
    BEGIN
        SET @wrkPos=@currPos+1
        
        IF(Substring(@s1,@wrkPos+@s1Offset,1)=Substring(@s2,@wrkPos+@s2Offset,1)) SET @matchCnt=@matchCnt+1
        ELSE
        BEGIN
            SET @s1Offset=0
            
            SET @s2Offset=0
            
            SELECT @s1Char=Substring(@s1,@wrkPos,1),@s2Char=Substring(@s2,@wrkPos,1)
            
            SELECT @s1Pos=Charindex(@s2Char,@s1,@wrkPos)-1,@s2Pos=Charindex(@s1Char,@s2,@wrkPos)-1
            
            SELECT @s1Dist=@s1Pos-@currPos,@s2Dist=@s2Pos-@currPos
            
            IF(@s1Pos>0 AND (@s1Dist<=@s2Dist OR @s2Pos<1) AND @s1Dist<@maxOffset) SET @s1Offset=(@s1Pos-@wrkPos)+1
            ELSE
            IF(@s2Pos>0 AND (@s2Dist<@s1Dist OR @s1Pos<1) AND @s2Dist<@maxOffset) SET @s2Offset=(@s2Pos-@wrkPos)+1
        END
        
        SET @currPos=@currPos+1
    END
    
    RETURN(@s1LEN+@s2LEN)/2.0-@matchCnt
END

It doesn't give the same exact results as my own code, yet the result is close enough and the speed is about 20% higher.

And thanks to Diogo Nechtan, the version in PHP:
function sift3Plus($s1, $s2, $maxOffset) {
    $s1Length = strlen($s1); 
    $s2Length = strlen($s2);
    if (empty($s1)) {
        return (empty($s2) ? 0 : $s2Length);
    }
    if (empty($s2)) {
        return $s1Length;
    }
    $c1 = $c2 = $lcs = 0;

    while (($c1 < $s1Length) && ($c2 < $s2Length)) {
        if (($d = $s1{$c1}) == $s2{$c2}) {
            $lcs++;
        } else {
            for ($i = 1; $i < $maxOffset; $i++) {
                if (($c1 + $i < $s1Length) && (($d = $s1{$c1 + $i}) == $s2{$c2})) {
                    $c1 += $i;
                    break;
                }
                if (($c2 + $i < $s2Length) && (($d = $s1{$c1}) == $s2{$c2 + $i})) {
                    $c2 += $i;
                    break;
                }
            }
        }
        $c1++;
        $c2++;
    }
    return (($s1Length + $s2Length) / 2 - $lcs);
}

And thanks to Fernando Jorge Mota, the version in Python:


Also, here is the Javascript version, used in Mailcheck, by Derrick Ko and Wei Lu.

function sift3Distance(s1, s2) {
    if (s1 == null || s1.length === 0) {
        if (s2 == null || s2.length === 0) {
            return 0;
        } else {
            return s2.length;
        }
    }

    if (s2 == null || s2.length === 0) {
        return s1.length;
    }

    var c = 0;
    var offset1 = 0;
    var offset2 = 0;
    var lcs = 0;
    var maxOffset = 5;

    while ((c + offset1 < s1.length) && (c + offset2 < s2.length)) {
        if (s1.charAt(c + offset1) == s2.charAt(c + offset2)) {
            lcs++;
        } else {
            offset1 = 0;
            offset2 = 0;
            for (var i = 0; i < maxOffset; i++) {
                if ((c + i < s1.length) && (s1.charAt(c + i) == s2.charAt(c))) {
                    offset1 = i;
                    break;
                }
                if ((c + i < s2.length) && (s1.charAt(c) == s2.charAt(c + i))) {
                    offset2 = i;
                    break;
                }
            }
        }
        c++;
    }
    return (s1.length + s2.length) / 2 - lcs;
}

Another implementation, this time in Java, by Eclesia:


You might also be interested in a customised version in AutoKey, by Toralf:


Thanks all for your comments and I look forward to more. Just tell me it worked or not and, most important, why. Good luck!

54 comments:

Brad Wood said...

I like this. Using the SQL function, I have had a couple strange results right off the bat with a single back slash "\" making a really big difference. Is "\" a reserved character in SQL? This is SQL Server 2005.

Brad Wood said...

Nevermind my previous post as my problem appears to have nothing to do with SQL server. I re-wrote it in ColdFusion script and experienced the exact same behavior. I am comparing a larger amount of text like a couple of paragraphs.
If I were to compare two sentences which were idential with the exception of an extra letter being present at the beginning of on of the sentences, then the two would compare as being very different. Furthermore if I was to omit or add a single letter at the end of a string, they would compare as being very similar.
That would obviously not be the case, since it was only a single-character difference in each scenario. Perhaps I shouldn't use this algorithm for larger strings such as sentences.

Brad Wood said...

Sorry to be spamming you comments, but I wanted to say that it appears the behavior I reported in my second post was simply becase I had a maxOffset of 0. With a larger maxOffset I am able to insert several words into the middle or beginning of one string and it still produces a relativley close match.

Of course, I can still trick it by simply re-arranging the same peices of data.
For instance, if I was to compare the strings:
"I like spam. Let's go to the store."
and
"Let's go to the store. I like spam."

then they produce a pretty low comparison even though they are the same sentences, but rearranged.
To compare those would be much more complicated though.

Siderite said...

Thank you for your comments. I didn't have time to test this properly so I might have missed something ;)
Can you give me some examples of where it didn't quite work?

Also, this algorithm is basically finding typos. maxOffset should be around 5, and it detects (or it should :-/) mispelled words or small 4 letter word insertions and so on.

What you need for getting around the "trick" with the word rearranging is an algorithm that searches and compares groups of two or three letters or even entire words. But that's beyond the scope of Sift3.

Again, thank you for posting comments of your experiences with Sift.

Pawprint said...

In the C# sample, was maxOffset supposed to be a parameter like in the T-SQL version? Because it's not declared in the function itself.

Pawprint said...

oh, wait, I looked at the code for Sift2 and I see that it's a private field initialized during construction. This is an excellent example of why I am a proponent of decorating such fields, either with _ (Microsoft's recommendation) or m_ (the older, MFC-style version). If it had been called _maxOffset I probably would have figured it out myself.
Anyways, thanks for the code. I'm trying to implement matching/duplicate elimination in-house and this will be my starting point. There's a lot of talk on the 'Net about edit distance and a lot of mathematical formulas that I probably would have understood 20 years ago in school, but very little actual code. Again, thanks.

Siderite said...

Sorry about that. Published Sift3 in a bit of a hurry and I only copy pasted the method from an object.
I am glad I could help, I am no mathematician and after looking at some of the formulas you are talking about and not understanding anything, I started from scratch with my silly brain.
I am still wondering it worked :)

Pawprint said...

Okay, I have tested this function and I can say that it's not giving me the results I expect. For example, I compared the name "Dimas" to "Jim" and "Tina" and in both cases I got a distance of 2 when I should have gotten 3. To make sure we're both on the same page here, this is why I think I should have gotten three:
Jim-> Change first letter, delete two letters at the end: 3
Tina-> Change first letter, change third letter, delete one letter at the end: 3.
Do you agree?

Pawprint said...

One more thing: your function is typed to return float but the return statement is operating on ints, so you'll never send back anything but an int. I don't think edit distance has the concept of anything but ints, anyways, so I think the function ought to reflect that.

Siderite said...

Maybe it wasn't clear enough that this algorithm is aproximating the Levenshtein distance, it doesn't always get it right.

I tried to minimize the difference between the Levenshtein distance and the result of this function and this is what I came up with. However, the first priority was speed.

In order to improve on the accuracy, you should use Levenshtein for short words (let's say if they have less than 6 letters) or on the top similarity results from Sift. In this way you double check the results, but only on those strings similar enough to be worth it.

As for the int, it's a good catch, but I am returning float so I don't have to cast anything when computing similarity and to compare with Sift2, which returned a float. Dividing by 2.0 increased the difference from Levenshtein, too :)

Anonymous said...

witch one is faster? Sift2 or Sift3?

Tks

Siderite said...

Sift 3 is better and faster.

John Davies said...

I was looking for a method to compare two digital ham radio messages to see if they were the same message. Since there can be some line noise, it has to be a little forgiving.

I tried Swift3 and at the default maxOffset of 5 it worked like crap in my test cases. Before giving up and moving on, I tried 20. Now it works like a champ.

I'm very impressed.

Siderite said...

I am glad I could help. Thank you for your comment!

DaberElay said...

Dear,

I've downloaded and start goofing arround with your algo.
first off - great work, some things are best left of simple :-)

I'd like to propose 2 improvements that would not hit performance and may be of good use:

1. make maxOffset varing according to strings lengths. i've found that u best use lower maxoffset when u have short words, than in a sentence. - easy to implement, no cpu overload, and results seem better (didn't heavy test it though).

2. location priority - again simple implementation (i.e *log(pos)), as in some sentences the begining is more important than the end or vice versa

I'd be happy to discuss it with you if you think my ideas has any value to you.

best,
DaberElay.

Siderite said...

Thanks for the input. Your sugestions certainly seem reasonable. Alas, I lack the time to work on this algorithm at the moment.
If you do transform your ideas in code (and maybe some test data as well), please let me know, I will add your contribution to this post.

Todd Wolff said...

I have a store of possibly misspelled words with a count of around 300,000. I have a dictionary of around 1000 terms. Your algo gave me some great results but takes quite a bit of time because of the loooping. I modified the SQL UDF to be more efficient and thought that I would send you my results. (Yes, I know I'm way too anal about spacing ;-) )

CREATE FUNCTION [dbo].[Sift3Distance](@s1 nvarchar(3999), @s2 nvarchar(3999),@maxOffset int)
RETURNS FLOAT
AS
BEGIN

DECLARE @s1LEN INT,
@s2LEN INT

SELECT @s1LEN = LEN(ISNULL(@s1, '')),
@s2LEN = LEN(ISNULL(@s2, ''))

IF @s1LEN = 0
RETURN @s2LEN
ELSE IF @s2LEN = 0
RETURN @s1LEN

IF ISNULL(@maxOffset,0)=0
SET @maxOffset=5

DECLARE @currPos INT,
@matchCnt INT,
@wrkPos INT,

@s1Offset INT,
@s1Char VARCHAR,
@s1Pos INT,
@s1Dist INT,

@s2Offset INT,
@s2Char VARCHAR,
@s2Pos INT,
@s2Dist INT

SELECT @s1Offset=0,
@s2Offset=0,
@matchCnt=0,
@currPos=0


WHILE (@currPos+@s1Offset<@s1LEN AND @currPos+@s2Offset<@s2LEN)
BEGIN
SET @wrkPos = @currPos + 1
IF (SUBSTRING(@s1,@wrkPos+@s1Offset,1)=SUBSTRING(@s2,@wrkPos+@s2Offset,1))
SET @matchCnt=@matchCnt+1
ELSE
BEGIN
SET @s1Offset=0
SET @s2Offset=0

SELECT @s1Char = SUBSTRING(@s1,@wrkPos,1),
@s2Char = SUBSTRING(@s2,@wrkPos,1)

SELECT @s1Pos = CHARINDEX(@s2Char, @s1, @wrkPos) - 1,
@s2Pos = CHARINDEX(@s1Char, @s2, @wrkPos) - 1

SELECT @s1Dist = @s1Pos-@currPos,
@s2Dist = @s2Pos-@currPos

IF(@s1Pos > 0 AND (@s1Dist <= @s2Dist OR @s2Pos<1) AND @s1Dist < @MaxOffset)
SET @s1Offset = (@s1Pos-@wrkPos) + 1
ELSE IF(@s2Pos > 0 AND (@s2Dist < @s1Dist OR @s1Pos<1) AND @s2Dist < @MaxOffset)
SET @s2Offset = (@s2Pos-@wrkPos) + 1
END

SET @currPos=@currPos+1
END

RETURN (@s1LEN+@s2LEN)/2.0-@matchCnt
END


I'm still trying to determine if this will give me the throughput that I'm looking for but I sure appreciate the hard work that you did!

Siderite said...

Ah, the good old days when I had the time to pursue my own ideas and research. Thank you for giving me feedback. For me it was a pleasure to create the algorithm and most of all to see it used.

I will try to test your improvement and also to integrate the other cool ideas from the others, maybe change version to Sift4. If only I had the time...

Todd Wolff said...

I totally understand. It seems these days that I only learn new stuff for which I have a need. No more time to just go have some "fun". :-)

I've been toying with the idea of adding a MaxDistance param to short circuit the loop when it falls into range to maybe shave a few more milliseconds off.

Probably won't do much for the casual app but when your comparing 13,000 known terms to 300,000 possible misspellings each iteration means something!

Will post anything else that comes out of my testing.

Siderite said...

I have a few ideas. Already got a small algorithm that is twice as fast as Sift3, but terribly innacurate, partly based on your optimisation. Made it today.

David Larson said...

One minor suggestion...include maxOffset as an input parameter and make the method Static.

Brad Wood said...

Hey, it's me again. This post is pretty old, but I figured I owed you a heads up. Quite some time ago I ported your SIFT3 over to ColdFusion script and then modified the heck out of it as an experiment. I came up with a version that I used to highlight difference in longer strings of text (1 or 2 paragraphs). With a maxoffset of anywhere from 10 to 50 it would recognize spelling differences as well as minor insertions/deletions. It would highlight the differences by inserting HTML tags into the strings for display in a web page as well as reporting the difference count and percentage. The function is nowhere as small and fast as SIFT3 was, but since I based my work off of SIFT3 I wanted to be fair and let you know.

If you want to see it, I have a small example of it along with the ColdFusion code posted on my site so I could share it with people. I credited your name and blog in the comments.

http://www.bradwood.com/string_compare/


Thanks!

~Brad

Siderite said...

You are most welcome. I really appreciate you leaving this comment. I am glad I helped.

Simon said...

Thank you; an interesting piece of code. I expect to test/use it in near future and report on my experience.

Simon

Simon Hawkin said...

Sorry, I did not test it much, cannot judge how good it is. (You left a message on my blog asking for feedback, so here it is.)

Mikkel said...

Hi Siderite!

Your 'Sift3' algortihm is just what I was looking for -- thank you very much for having made it! I immediately made a Java port of it for my project, where I was stuck with the traditionel and FAR too slow O(n^2) dynamic programming algorithm for string distance calculation. Sift3 was WAY faster and well suited to my kind of data, i.e. inaccurately OCR'ed text. However, studying Sift3 a bit deeper, I came to wonder why it makes use of just a single position pointer 'c' for both strings being compared, instead of two pointers "c1" and "c2", one for each string. The single-postition-pointer approach of Sift3 implies a fairly small limit (i.e., maxOffset) on the distance between equivalent indices in the strings. Why this limit? I wrote a two-position-pointers version of Sift3; 96,75% of the time it yields the same results as Sift3 with my data, and it is somewhat faster than Sift3, too. Here's the Java code (don't know how to make it look more readable in this forum; sorry!):

public static int sift3Plus(CharSequence s1, CharSequence s2, int maxOffset) {
final int s1Length = s1.length(), s2Length = s2.length();
if(s1 == null || s1Length == 0)
return((s2 == null || s2Length == 0) ? 0 : s2Length);
if(s2 == null || s2Length == 0)
return(s1Length);
int c1 = 0;
int c2 = 0;
int lcs = 0;
while((c1 < s1Length) && (c2 < s2Length)) {
if((d = s1.charAt(c1)) == s2.charAt(c2))
lcs++;
else {
for(int i = 1; i < maxOffset; i++) {
if((c1 + i < s1Length) && ((d = s1.charAt(c1+i)) == s2.charAt(c2))) {
c1 += i;
break;
}
if((c2 + i < s2Length) && ((d = s1.charAt(c1)) == s2.charAt(c2+i))) {
c2 += i;
break;
}
}
}
c1++;
c2++;
}
return((s1Length + s2Length)/2 - lcs);
}

Thanks again,
Mikkel

Siderite said...

It is a matter of semantics rather than code.

The c pointer shows a general position from which offsets are computed. Assuming both strings are identical, then the c pointer just moves along, while the offsets are preserved. If they are not, offsets are changed to skip the differences, then c advances as before.

The algorithm changes a little when no match is found. In that case the pointer is advanced to a position where there is a match between any two letters.

You are correct that using two pointers would speed things up a little, but also you would find a problem when trying to reset the counter to a previous position.

It's a matter of personal choice, I think.

Mikkel said...

Thanks for replying!

Indeed it's a matter of semantics: Why would you want to reset, at any time, the position pointer to a previous position? I mean, once the algorithm has matched the character indexed 'c1' in the one string with the character indexed 'c2' in the other string, it seems weird, philosophically, to allow the algorithm to try, at some point, to re-match either c1 or c2 with yet another index -- which Sift3's resetting mechanism does. After all, any index of the one string should correspond at most to one index of the other one. This is why I consider a two-pointers version of the algorithm more straightforward.

Siderite said...

Because the algorithm always looks at the first match for one of the letter at the current position. If a match is found, it moves the offset accordingly, up to maxOffset. If then it finds no match, it should start with the characters it skipped when it moved the offset.

For something like AXXXXB and XXXXAB, first it matches A, then if must reset the offset to continue with B.

But thinking of this I just realized the algorihtm has a small bug. If the offset is large, it might reach the end of the string. In that case it will not reset the offsets.

Mikkel said...

The two-pointer version has no need to do any such resetting to match A with A and then B with B in such a pair of strings. With >>A1234B<< and >>5678AB<<, c1 and c2 are first 0 and 4, respectively; both pointers are then incremented, to 1 and 5, and finally c1 is set to 5 as well. Cf. this "spypointish" output from a test run:

Sift3Plus comparing >>A1234B<<, >>5678AB<<:
c1 is 0, c2 is 0
>>A1234B<<[0]=A doesn't match >>5678AB<<[0]=5
-- >>A1234B<<[1]=1 doesn't match >>5678AB<<[0]=5
-- >>A1234B<<[0]=A doesn't match >>5678AB<<[1]=6
-- >>A1234B<<[2]=2 doesn't match >>5678AB<<[0]=5
-- >>A1234B<<[0]=A doesn't match >>5678AB<<[2]=7
-- >>A1234B<<[3]=3 doesn't match >>5678AB<<[0]=5
-- >>A1234B<<[0]=A doesn't match >>5678AB<<[3]=8
-- >>A1234B<<[4]=4 doesn't match >>5678AB<<[0]=5
-- >>A1234B<<[0]=A matches >>5678AB<<[4]=A
c1 is 0, c2 is 4
c1 is 1, c2 is 5
>>A1234B<<[1]=1 doesn't match >>5678AB<<[5]=B
-- >>A1234B<<[2]=2 doesn't match >>5678AB<<[5]=B
-- >>A1234B<<[3]=3 doesn't match >>5678AB<<[5]=B
-- >>A1234B<<[4]=4 doesn't match >>5678AB<<[5]=B
-- >>A1234B<<[5]=B matches >>5678AB<<[5]=B
c1 is 5, c2 is 5
LCS is AB

Here it's always a matter of incrementing the position pointers, never one of resetting them to previous values.

Siderite said...

What about AxxxB and xBAxxx ? wouldn't the incrementing of the A pointer prohibit the matching of B?

Mikkel said...

It sure would, it sure would. But why would you want to match the Bs when they aren't on the same side of the matched As? If we imagine that A and B have been switched, a string distance algorithm should certainly impose a penalty for this. Also, if we consider the algorithm an LCS finder, the two strings have no LCS "AB", just the LCS "A" (or "B").

Siderite said...

The story of the algorithm is pretty funny. It started with removing the identical letters at the same index, then shifting the strings with an offset, then repeating the operation, until a number of operations has been completed. Hence the name of "sift", as the strings where moving back and forth.

Sift2 and Sift3 are natural evolutions of the algorithm, by optimizing the code, not the concept. Later on, someone wrote about it belonging to the class of LCS algorithms, so I changed the sense of a variable and renamed it to lcs, as someone might want to compute the LCS and there is a clear formula relating distance and lcs.

Now, I also remember (it was a long time ago after all) that one version of the algorithm used two pointers and no offset.

In other words, I didn't invent this based on some previous work or on any knowledge of string distance algorithms, it's something I did out of my own head. Also the "logic" of the algo came only later, when I tried to explain to myself why it worked. So it's all a little backwards :)

You may be right about possible optimizations, but right now I don't have the time to work on it. I am interested, though, as this is one of my soul children. So I will give a complete new analysis, maybe in another blog entry, when I can find a spare moment.

Again, thank you for your interest and please let me know about any other ideas that you have about Sift. After all, Sift4 is long overdue :)

Major Tal said...

Thanks for your post!
I converted it to Python: http://pastebin.com/iDUN0G1G

Sadly, it is just slightly faster than my Damerau Levenshtein implementation...

Siderite said...

What maxOffset did you use? Do you think there is a performance issue in the algorithm?

Siderite said...

I have a suggestion, you could cache the length of s1 and s2. I don't know what the performance is for the len function in Python.

Anonymous said...

Damn it is so fast. It is not very accurate though. Or maybe the interpretation of Accuracy is different :)

Siderite said...

:-P Perhaps. It was designed to find typos, not compare long texts and it performed well in (and was tuned by) tests on randomly generated strings.

You use it to find if strings are basically the same with small changes rather than a perfect distance score.

What sort of metric did you have in mind? Maybe a discussion on the desired functionality of the algorithm might help improve it.

Diogo Nechtan said...

I converted it to PHP code:

function sift3Plus($s1, $s2, $maxOffset) {
$s1Length = strlen($s1); $s2Length = strlen($s2);
if (empty($s1)) return (empty($s2) ? 0 : $s2Length);
if (empty($s2)) return $s1Length;
$c1 = $c2 = $lcs = 0;

while (($c1 < $s1Length) && ($c2 < $s2Length)) {
if (($d = $s1{$c1}) == $s2{$c2}) $lcs++;
else {
for ($i = 1; $i < $maxOffset; $i++) {
if (($c1 + $i < $s1Length) && (($d = $s1{$c1 + $i}) == $s2{$c2})) {
$c1 += $i;
break;
}
if (($c2 + $i < $s2Length) && (($d = $s1{$c1}) == $s2{$s2 + $i})) {
$c2 += $i;
break;
}
}
}
$c1++;
$c2++;
}
return (($s1Length + $s2Length) / 2 - $lcs);
}


echo sift3Plus('nechtan', 'nectan', 4); // returns 1.5;

Siderite said...

Thank you, Diogo! I will update the post. Please let me know how you used the algorithm and if it worked for you.

Fernando Jorge Mota said...

Here is a simple version of this script in Python: https://gist.github.com/3067867

Siderite said...

Thanks, Fernando! I've updated the post with your code.

thibjl said...

Thanks you for this algorithm.

There is just an error in PHP code.
$s2{$s2 + $i} should be $s2{$c2 + $i}

Siderite said...

Thank you for the bug report! I am updating the post.

darran thomas said...

Hi Siderite - thank you for this algorithm. I hope you don't mind, but I have cited you and your algorithm in my Dissertation, and I have implemented it in MYSQL in a system I am writing (I hope that's OK?)

Here is the MYSQL implementation for your blog. Once again - thank you.

/***************************************** FN_Sift3Distance **************************************/

/*6000 is guessed as the upper limit size of a FONT LIST or MIME INFO - either way, its enough for a fingerprint*/

DROP FUNCTION IF EXISTS FN_Sift3Distance;

$$CREATE FUNCTION `FN_Sift3Distance`(s1 varchar(6000), s2 varchar(6000), maxOffset int)

RETURNS float



BEGIN

DECLARE s1LEN INT;

DECLARE s2LEN INT;

DECLARE currPos INT;

DECLARE matchCnt INT;

DECLARE wrkPos INT;

DECLARE s1Offset INT;

DECLARE s1Char CHAR;

DECLARE s1Pos INT;

DECLARE s1Dist INT;

DECLARE s2Offset INT;

DECLARE s2Char CHAR;

DECLARE s2Pos INT;

DECLARE s2Dist INT;

DECLARE returnVar FLOAT;



SET s1LEN = LENGTH(s1), s2LEN = LENGTH(s2), maxOffset = IFNULL(maxOffset,2);

SET s1Offset = 0, s2Offset = 0, matchCnt=0, currPos=0;



WHILE ( (currPos+s1Offset < s1LEN) AND (currPos+s2Offset < s2LEN) ) DO

SET wrkPos = currPos + 1;

IF ( SUBSTRING(s1,wrkPos+s1Offset,1) = SUBSTRING(s2,wrkPos+s2Offset,1) ) THEN

SET matchCnt = matchCnt + 1;

ELSE

SET s1Offset = 0;

SET s2Offset = 0;

SET s1Char = SUBSTRING(s1,wrkPos,1), s2Char=SUBSTRING(s2,wrkPos,1);

SET s1Pos = LOCATE(s2Char,s1,wrkPos)-1, s2Pos = LOCATE(s1Char,s2,wrkPos)-1;

SET s1Dist = s1Pos-currPos, s2Dist = s2Pos-currPos;

IF (s1Pos > 0 AND (s1Dist <= s2Dist OR s2Pos < 1) AND s1Dist < maxOffset) THEN

SET s1Offset = (s1Pos-wrkPos) + 1;

ELSEIF (s2Pos > 0 AND (s2Dist < s1Dist OR s1Pos < 1) AND s2Dist < maxOffset) THEN

SET s2Offset= (s2Pos-wrkPos) + 1;

END IF;

END IF;

SET currPos = currPos + 1;

END WHILE;



SET returnVAR = (s1LEN + s2LEN) / 2.0 - matchCnt;

RETURN returnVAR;



END$$







/***************************************** START FN_Sift3Similarity **************************************/

DROP FUNCTION IF EXISTS FN_Sift3Similarity;

$$CREATE FUNCTION `FN_Sift3Similarity`(s1 varchar(6000), s2 varchar(6000), maxOffset int)

RETURNS float



BEGIN



DECLARE distance float;

DECLARE maxLen int;

DECLARE returnVar float;



SET distance= FN_Sift3Distance(s1,s2,maxOffset);

SET maxLen = CASE WHEN LENGTH(s1) > LENGTH(s2) THEN

LENGTH(s1)

ELSE LENGTH(s2)

END;



SET returnVar = CASE WHEN maxLen = 0 THEN

1

ELSE

(1- (distance / maxLen)) * 100

END;

RETURN returnVar;

END$$

/***************************************** END FN_Sift3Similarity **************************************/


Siderite said...

Of course it's OK, buddy! Just tell me how it worked out in the end.

Unknown said...

hello there...
Looking for a very fast fuzzy compare function I stepped on yours.
I found it really great BUT I could not find things like clermond ferrand using the captcha string clermon feran.
After some research I found that it was because your formula was not taking into account the length of the two strings in the results...
This can be easily corrected just by dividing the result by the sum of the lentgh of the chains...
The exact formula would be :
(length(string1) +length(string2)-matchcount)/(length(string1) +length(string2))...
this will give you a number betwen 0 and 1 that will give 0 if the two string are exactly the same and one if they are 'totaly' different (not a char in common)...
This would be something like a probabilty distance creating a cloud arround the string...
you could use 1 minus the values if you want a fuzzy logic paramter. It would give you 1 if the two chains are exactly the same...
I thinkt this formula will bring a great improvement over you algorythm and that it is giving it more sense... Just try it and see...
If you need more explanation, you can join me on this email : catodique@hotmail.com (Carmelo Guarneri ).

Carmelo Guarneri said...

sorry, this email will be better.
email : catodique@gmail.com (Carmelo Guarneri ).

Siderite said...

You are describing a similarity function, one that is computed from a distance function. For example the similarity formula I use when having the distance is 1-distance/max(l1,l2). You propose (l1 +l2-matchcount)/(l1+l2) which is 1-matchcount/(l1+l2) which is similar (pardon the pun).

A distance is a length, a similarity is a percentage and there should be some relationship between them.

I can quote (formula used with current variables) from http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Relation_to_other_problems: "The edit distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion, is: l1+l2-2*LCS" which also inspired me to refactor Sift to use the ending formula as it is now.

Виктор Загорский said...

Hello, thanks much for the algorithm. Switching from levenshtein distance to it caused my application to run 100 times faster.

I can provide ruby version of the code.


By the way, do you have any standard example strings pairs for testing purpose?

Виктор Загорский said...

Gist with ruby implementation:

https://gist.github.com/shaggyone/8645347

Siderite said...

Thanks, Victor!
I don't have standard strings because I am lazy :) I tested it with randomly created strings - which I know don't really test anything - and with the data in different projects where I needed the algorithm.

Incidentally I am now researching string kernels and something called http://en.wikipedia.org/wiki/Ukkonen%27s_algorithm. I don't know your specific requirements, though.

Anonymous said...

A java version , thanks for the algorithm

public class SIFT3 extends StringDistance {
private final static byte CASE_BITWISE_MASK = (byte)0x20;
private final static float CASE_PUNISHMENT = 0.05f;
private static SIFT3 instance;

private SIFT3(){
}

public final static SIFT3 getInstance(){
if(null==instance){
instance = new SIFT3();
}
return instance;
}

@Override
public float computeDistance(String s1, String s2 , int maxOffset) {
if(null==s1){
return null==s2?0:s2.length();
}
else if(null==s2){
return null==s1?0:s1.length();
}

boolean isCaseDiff = false;

int l1 = s1.length();
int l2 = s2.length();

int c = 0;
int offset1 = 0;
int offset2 = 0;
float lcs = 0f;

while((c + offset1 < l1) && (c + offset2 < l2)){
if ((s1.charAt(c + offset1)|CASE_BITWISE_MASK) == (s2.charAt(c + offset2)|CASE_BITWISE_MASK)){
lcs++;
if(!isCaseDiff && s1.charAt(c + offset1)!=s2.charAt(c + offset2)){
isCaseDiff=true;
}
}
else{
offset1 = 0;
offset2 = 0;

for (int i = 0;i< maxOffset;i++){
if ((c + i < l1) && ((s1.charAt(c + i)|CASE_BITWISE_MASK) == (s2.charAt(c)|CASE_BITWISE_MASK))){
offset1 = i;
if(!isCaseDiff && s1.charAt(c + i)!=s2.charAt(c)){
isCaseDiff=true;
}
break;
}

if ((c + i < l2) && ((s1.charAt(c)|CASE_BITWISE_MASK) == (s2.charAt(c + i)|CASE_BITWISE_MASK))){
offset2 = i;
if(!isCaseDiff && s1.charAt(c)!=s2.charAt(c+i)){
isCaseDiff=true;
}
break;
}
}
}
c++;
}

return (l1+l2)/2.0f - lcs + (isCaseDiff?CASE_PUNISHMENT:0);
}

@Override
public float computeSimilarity(String s1, String s2 , int maxOffset){
if(null==s1){
return null==s2?0:1f/s2.length();
}
else if(null==s2){
return null==s1?0:1f/s1.length();
}

return 1.0f-computeDistance(s1,s2,maxOffset) / (float)Math.max(s1.length(), s2.length());
}
}

Siderite said...

Thank you! But you added some extra stuff to it. It would be nice if you could explain it to others.