Click to show private chat

Want to thank me? Put this in your blog:

Siderite Zackwehdex's Blog

↑ Grab this Headline Animator

Low band version

Wednesday, April 04, 2007

Super Fast and Accurate string distance algorithm: Sift3
24 comments

Update: the Sift algorithm is now on Codeplex.

A while ago I wrote an entry here about Sift2, an improvement of Sift, the original and silly string distance algorithm. Now I am publishing Sift3, which is way more accurate and even simpler as an algorithm.

I found out that my algorithm is part of a class of algorithms that solve the Longest Common Substring problem, therefore I calculated the LCS, not the distance, then the distance from the LCS. The result is way more robust, easy to understand and closer to the Levenshtein algorithm both on random strings and user databases. Not to mention that there is no goto in this one.

BTW, if you are looking for an algorithm that detects switched words, this is not it :) This just looks for typos and small regional differences between the strings. I mean, you could normalize the strings, so that words are ordered by some mechanism, then it would work because the words wouldn't be switched :)

I promise to work on a word switching algorithm, but not in the near future.
Without further due, here is the code:

The C# code is a method in an object that has a private member maxOffset. As in Sift2 maxOffset should be around 5 and it represents the range in which to try to find a missing character.

        public float Distance(string s1, string s2, int maxOffset)
        {
            if (String.IsNullOrEmpty(s1))
                return
                    String.IsNullOrEmpty(s2) ? 0 : s2.Length;
            if (String.IsNullOrEmpty(s2))
                return s1.Length;
            int c = 0;
            int offset1 = 0;
            int offset2 = 0;
            int lcs = 0;
            while ((c + offset1 < s1.Length)
                   && (c + offset2 < s2.Length))
            {
                if (s1[c + offset1] == s2[c + offset2]) lcs++;
                else
                {
                    offset1 = 0;
                    offset2 = 0;
                    for (int i = 0; i < maxOffset; i++)
                    {
                        if ((c + i < s1.Length)
                            && (s1[c + i] == s2[c]))
                        {
                            offset1 = i;
                            break;
                        }
                        if ((c + i < s2.Length)
                            && (s1[c] == s2[c + i]))
                        {
                            offset2 = i;
                            break;
                        }
                    }
                }
                c++;
            }
            return (s1.Length + s2.Length)/2 - lcs;
        }


And here is the T-Sql code. This version is actually an improvement of my original source, gracefully provided by Todd Wolf:

CREATE FUNCTION [DBO].[SIFT3DISTANCE2] 
(@s1 NVARCHAR(3999),
@s2 NVARCHAR(3999),
@maxOffset INT)
RETURNS FLOAT
AS
BEGIN

DECLARE @s1LEN INT,
@s2LEN INT

SELECT @s1LEN = LEN(ISNULL(@s1,'')),
@s2LEN = LEN(ISNULL(@s2,''))

IF @s1LEN = 0
RETURN @s2LEN
ELSE
IF @s2LEN = 0
RETURN @s1LEN

IF ISNULL(@maxOffset,0) = 0
SET @maxOffset = 5

DECLARE @currPos INT,
@matchCnt INT,
@wrkPos INT,

@s1Offset INT,
@s1Char VARCHAR,
@s1Pos INT,
@s1Dist INT,

@s2Offset INT,
@s2Char VARCHAR,
@s2Pos INT,
@s2Dist INT

SELECT @s1Offset = 0,
@s2Offset = 0,
@matchCnt = 0,
@currPos = 0

WHILE (@currPos + @s1Offset < @s1LEN
AND @currPos + @s2Offset < @s2LEN)
BEGIN
SET @wrkPos = @currPos + 1

IF (SUBSTRING(@s1,@wrkPos + @s1Offset,1) =
SUBSTRING(@s2,@wrkPos + @s2Offset,1))
SET @matchCnt = @matchCnt + 1
ELSE
BEGIN
SET @s1Offset = 0

SET @s2Offset = 0

SELECT @s1Char = SUBSTRING(@s1,@wrkPos,1),
@s2Char = SUBSTRING(@s2,@wrkPos,1)

SELECT @s1Pos = CHARINDEX(@s2Char,@s1,@wrkPos) - 1,
@s2Pos = CHARINDEX(@s1Char,@s2,@wrkPos) - 1

SELECT @s1Dist = @s1Pos - @currPos,
@s2Dist = @s2Pos - @currPos

IF (@s1Pos > 0
AND (@s1Dist <= @s2Dist
OR @s2Pos < 1)
AND @s1Dist < @MaxOffset)
SET @s1Offset = (@s1Pos - @wrkPos) + 1
ELSE
IF (@s2Pos > 0
AND (@s2Dist < @s1Dist
OR @s1Pos < 1)
AND @s2Dist < @MaxOffset)
SET @s2Offset = (@s2Pos - @wrkPos) + 1
END

SET @currPos = @currPos + 1
END

RETURN (@s1LEN + @s2LEN) / 2.0 - @matchCnt
END





It doesn't give the same exact results as my own code, yet the result is close enough and the speed is about 20% higher.

Thanks all for your comments and I look forward to more. Just tell me it worked or not and, most important, why. Good luck!

24 comments:

Brad Wood said...

I like this. Using the SQL function, I have had a couple strange results right off the bat with a single back slash "\" making a really big difference. Is "\" a reserved character in SQL? This is SQL Server 2005.

Brad Wood said...

Nevermind my previous post as my problem appears to have nothing to do with SQL server. I re-wrote it in ColdFusion script and experienced the exact same behavior. I am comparing a larger amount of text like a couple of paragraphs.
If I were to compare two sentences which were idential with the exception of an extra letter being present at the beginning of on of the sentences, then the two would compare as being very different. Furthermore if I was to omit or add a single letter at the end of a string, they would compare as being very similar.
That would obviously not be the case, since it was only a single-character difference in each scenario. Perhaps I shouldn't use this algorithm for larger strings such as sentences.

Brad Wood said...

Sorry to be spamming you comments, but I wanted to say that it appears the behavior I reported in my second post was simply becase I had a maxOffset of 0. With a larger maxOffset I am able to insert several words into the middle or beginning of one string and it still produces a relativley close match.

Of course, I can still trick it by simply re-arranging the same peices of data.
For instance, if I was to compare the strings:
"I like spam. Let's go to the store."
and
"Let's go to the store. I like spam."

then they produce a pretty low comparison even though they are the same sentences, but rearranged.
To compare those would be much more complicated though.

Siderite said...

Thank you for your comments. I didn't have time to test this properly so I might have missed something ;)
Can you give me some examples of where it didn't quite work?

Also, this algorithm is basically finding typos. maxOffset should be around 5, and it detects (or it should :-/) mispelled words or small 4 letter word insertions and so on.

What you need for getting around the "trick" with the word rearranging is an algorithm that searches and compares groups of two or three letters or even entire words. But that's beyond the scope of Sift3.

Again, thank you for posting comments of your experiences with Sift.

Pawprint said...

In the C# sample, was maxOffset supposed to be a parameter like in the T-SQL version? Because it's not declared in the function itself.

Pawprint said...

oh, wait, I looked at the code for Sift2 and I see that it's a private field initialized during construction. This is an excellent example of why I am a proponent of decorating such fields, either with _ (Microsoft's recommendation) or m_ (the older, MFC-style version). If it had been called _maxOffset I probably would have figured it out myself.
Anyways, thanks for the code. I'm trying to implement matching/duplicate elimination in-house and this will be my starting point. There's a lot of talk on the 'Net about edit distance and a lot of mathematical formulas that I probably would have understood 20 years ago in school, but very little actual code. Again, thanks.

Siderite said...

Sorry about that. Published Sift3 in a bit of a hurry and I only copy pasted the method from an object.
I am glad I could help, I am no mathematician and after looking at some of the formulas you are talking about and not understanding anything, I started from scratch with my silly brain.
I am still wondering it worked :)

Pawprint said...

Okay, I have tested this function and I can say that it's not giving me the results I expect. For example, I compared the name "Dimas" to "Jim" and "Tina" and in both cases I got a distance of 2 when I should have gotten 3. To make sure we're both on the same page here, this is why I think I should have gotten three:
Jim-> Change first letter, delete two letters at the end: 3
Tina-> Change first letter, change third letter, delete one letter at the end: 3.
Do you agree?

Pawprint said...

One more thing: your function is typed to return float but the return statement is operating on ints, so you'll never send back anything but an int. I don't think edit distance has the concept of anything but ints, anyways, so I think the function ought to reflect that.

Siderite said...

Maybe it wasn't clear enough that this algorithm is aproximating the Levenshtein distance, it doesn't always get it right.

I tried to minimize the difference between the Levenshtein distance and the result of this function and this is what I came up with. However, the first priority was speed.

In order to improve on the accuracy, you should use Levenshtein for short words (let's say if they have less than 6 letters) or on the top similarity results from Sift. In this way you double check the results, but only on those strings similar enough to be worth it.

As for the int, it's a good catch, but I am returning float so I don't have to cast anything when computing similarity and to compare with Sift2, which returned a float. Dividing by 2.0 increased the difference from Levenshtein, too :)

Anonymous said...

witch one is faster? Sift2 or Sift3?

Tks

Siderite said...

Sift 3 is better and faster.

John Davies said...

I was looking for a method to compare two digital ham radio messages to see if they were the same message. Since there can be some line noise, it has to be a little forgiving.

I tried Swift3 and at the default maxOffset of 5 it worked like crap in my test cases. Before giving up and moving on, I tried 20. Now it works like a champ.

I'm very impressed.

Siderite said...

I am glad I could help. Thank you for your comment!

DaberElay said...

Dear,

I've downloaded and start goofing arround with your algo.
first off - great work, some things are best left of simple :-)

I'd like to propose 2 improvements that would not hit performance and may be of good use:

1. make maxOffset varing according to strings lengths. i've found that u best use lower maxoffset when u have short words, than in a sentence. - easy to implement, no cpu overload, and results seem better (didn't heavy test it though).

2. location priority - again simple implementation (i.e *log(pos)), as in some sentences the begining is more important than the end or vice versa

I'd be happy to discuss it with you if you think my ideas has any value to you.

best,
DaberElay.

Siderite said...

Thanks for the input. Your sugestions certainly seem reasonable. Alas, I lack the time to work on this algorithm at the moment.
If you do transform your ideas in code (and maybe some test data as well), please let me know, I will add your contribution to this post.

Todd Wolff said...

I have a store of possibly misspelled words with a count of around 300,000. I have a dictionary of around 1000 terms. Your algo gave me some great results but takes quite a bit of time because of the loooping. I modified the SQL UDF to be more efficient and thought that I would send you my results. (Yes, I know I'm way too anal about spacing ;-) )

CREATE FUNCTION [dbo].[Sift3Distance](@s1 nvarchar(3999), @s2 nvarchar(3999),@maxOffset int)
RETURNS FLOAT
AS
BEGIN

DECLARE @s1LEN INT,
@s2LEN INT

SELECT @s1LEN = LEN(ISNULL(@s1, '')),
@s2LEN = LEN(ISNULL(@s2, ''))

IF @s1LEN = 0
RETURN @s2LEN
ELSE IF @s2LEN = 0
RETURN @s1LEN

IF ISNULL(@maxOffset,0)=0
SET @maxOffset=5

DECLARE @currPos INT,
@matchCnt INT,
@wrkPos INT,

@s1Offset INT,
@s1Char VARCHAR,
@s1Pos INT,
@s1Dist INT,

@s2Offset INT,
@s2Char VARCHAR,
@s2Pos INT,
@s2Dist INT

SELECT @s1Offset=0,
@s2Offset=0,
@matchCnt=0,
@currPos=0


WHILE (@currPos+@s1Offset<@s1LEN AND @currPos+@s2Offset<@s2LEN)
BEGIN
SET @wrkPos = @currPos + 1
IF (SUBSTRING(@s1,@wrkPos+@s1Offset,1)=SUBSTRING(@s2,@wrkPos+@s2Offset,1))
SET @matchCnt=@matchCnt+1
ELSE
BEGIN
SET @s1Offset=0
SET @s2Offset=0

SELECT @s1Char = SUBSTRING(@s1,@wrkPos,1),
@s2Char = SUBSTRING(@s2,@wrkPos,1)

SELECT @s1Pos = CHARINDEX(@s2Char, @s1, @wrkPos) - 1,
@s2Pos = CHARINDEX(@s1Char, @s2, @wrkPos) - 1

SELECT @s1Dist = @s1Pos-@currPos,
@s2Dist = @s2Pos-@currPos

IF(@s1Pos > 0 AND (@s1Dist <= @s2Dist OR @s2Pos<1) AND @s1Dist < @MaxOffset)
SET @s1Offset = (@s1Pos-@wrkPos) + 1
ELSE IF(@s2Pos > 0 AND (@s2Dist < @s1Dist OR @s1Pos<1) AND @s2Dist < @MaxOffset)
SET @s2Offset = (@s2Pos-@wrkPos) + 1
END

SET @currPos=@currPos+1
END

RETURN (@s1LEN+@s2LEN)/2.0-@matchCnt
END


I'm still trying to determine if this will give me the throughput that I'm looking for but I sure appreciate the hard work that you did!

Siderite said...

Ah, the good old days when I had the time to pursue my own ideas and research. Thank you for giving me feedback. For me it was a pleasure to create the algorithm and most of all to see it used.

I will try to test your improvement and also to integrate the other cool ideas from the others, maybe change version to Sift4. If only I had the time...

Todd Wolff said...

I totally understand. It seems these days that I only learn new stuff for which I have a need. No more time to just go have some "fun". :-)

I've been toying with the idea of adding a MaxDistance param to short circuit the loop when it falls into range to maybe shave a few more milliseconds off.

Probably won't do much for the casual app but when your comparing 13,000 known terms to 300,000 possible misspellings each iteration means something!

Will post anything else that comes out of my testing.

Siderite said...

I have a few ideas. Already got a small algorithm that is twice as fast as Sift3, but terribly innacurate, partly based on your optimisation. Made it today.

David Larson said...

One minor suggestion...include maxOffset as an input parameter and make the method Static.

Brad Wood said...

Hey, it's me again. This post is pretty old, but I figured I owed you a heads up. Quite some time ago I ported your SIFT3 over to ColdFusion script and then modified the heck out of it as an experiment. I came up with a version that I used to highlight difference in longer strings of text (1 or 2 paragraphs). With a maxoffset of anywhere from 10 to 50 it would recognize spelling differences as well as minor insertions/deletions. It would highlight the differences by inserting HTML tags into the strings for display in a web page as well as reporting the difference count and percentage. The function is nowhere as small and fast as SIFT3 was, but since I based my work off of SIFT3 I wanted to be fair and let you know.

If you want to see it, I have a small example of it along with the ColdFusion code posted on my site so I could share it with people. I credited your name and blog in the comments.

http://www.bradwood.com/string_compare/


Thanks!

~Brad

Siderite said...

You are most welcome. I really appreciate you leaving this comment. I am glad I helped.

Simon said...

Thank you; an interesting piece of code. I expect to test/use it in near future and report on my experience.

Simon