.NET (296) administrative (41) Ajax (42) AngularJS (2) ASP.NET (145) bicycle (2) books (180) browser (8) C# (135) cars (1) chess (29) CodePlex (10) Coma (8) database (49) deployment (3) Entity Framework (2) essay (112) flash/shockwave (2) flex (1) food (3) friend (2) game (20) idea (5) IIS (8) javascript (83) LInQ (2) Linux (6) management (4) manga (43) misc (674) mobile (1) movies (91) MsAccess (1) murder (2) music (64) mysql (1) news (100) permanent (1) personal (68) PHP (1) physics (2) picture (309) places (12) politics (13) programming (507) rant (120) religion (3) science (43) Sharepoint (3) software (58) space (1) T4 (2) technology (11) Test Driven Development (4) translation (2) VB (2) video (98) Visual Studio (45) web design (46) Windows API (8) Windows Forms (3) Windows Server (6) WPF/Silverlight (63) XML (11)

Tuesday, November 11, 2014

Super Fast and Accurate string distance algorithm: Sift4

Warning: the algorithm works perfectly well and is better than Sift3, however you might want to consider it in beta, as I am still looking for better implementation solutions that might change the structure of the code.

A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to make a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:
  • 33% more accurate
  • three different variants: simple, common and general
  • new concepts added
  • support for own value and matching functions, different tokenizer functions, etc.
  • actually tested with a (slightly more) serious test
  • more robust, working better for large values of maxOffset

Before I get into the details, I am publishing the algorithm here for the moment, no CodePlex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!

I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.

After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.

One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.

Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.

Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.

Simplest Sift4:
// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
    if (!s1||!s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2||!s2.length) {
        return s1.length;
    }

    var l1=s1.length;
    var l2=s2.length;

    var c1 = 0;  //cursor for string 1
    var c2 = 0;  //cursor for string 2
    var lcss = 0;  //largest common subsequence
    var local_cs = 0; //local common substring

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
        } else {
            lcss+=local_cs;
            local_cs=0;
            if (c1!=c2) {
                c1=c2=Math.max(c1,c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
            }
            for (var i = 0; i < maxOffset; i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1+= i;
                    local_cs++;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c2+= i;
                    local_cs++;
                    break;
                }
            }
        }
           c1++;
          c2++;
    }
    lcss+=local_cs;
    return Math.round(Math.max(l1,l2)- lcss);
}

Common Sift4:
// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
    if (!s1||!s1.length) {
        if (!s2) {
            return 0;
        }
        return s2.length;
    }

    if (!s2||!s2.length) {
        return s1.length;
    }

    var l1=s1.length;
    var l2=s2.length;

    var c1 = 0;  //cursor for string 1
    var c2 = 0;  //cursor for string 2
    var lcss = 0;  //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0;  //number of transpositions ('ab' vs 'ba')
    var offset_arr=[];  //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (s1.charAt(c1) == s2.charAt(c2)) {
            local_cs++;
            while (offset_arr.length) {  //see if current match is a transposition
                if (c1<=offset_arr[0][0] || c2 <= offset_arr[0][1]) {
                    trans++;
                    break;
                } else {
                    offset_arr.splice(0,1);
                }
            }
            offset_arr.push([c1,c2]);
        } else {
            lcss+=local_cs;
            local_cs=0;
            if (c1!=c2) {
                c1=c2=Math.min(c1,c2);  //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches 
            for (var i = 0; i < maxOffset; i++) {
                if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
                    c1+= i-1; 
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
                    c1--;
                    c2+= i-1;
                    break;
                }
            }
        }
           c1++;
          c2++;
        if (maxDistance)
        {
            var temporaryDistance=Math.max(c1,c2)-lcss+trans/2;
            if (temporaryDistance>=maxDistance) return Math.round(temporaryDistance);
        }
    }
    lcss+=local_cs;
    return Math.round(Math.max(l1,l2)- lcss +trans/2); //remove half the number of transpositions from the lcss
}

Extended/General Sift4:
// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
//         maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
//         tokenizer: a function to transform strings into vectors of tokens
//        tokenMatcher: a function to determine if two tokens are matching (equal)
//        matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
//        localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
//        transpositionsEvaluator: a function to determine the way the number of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {

    options=extend(options,{
        maxDistance:null,
        tokenizer: function(s) { return s?s.split(''):[]; },
        tokenMatcher: function(t1,t2) { return t1==t2; },
        matchingEvaluator: function(t1,t2) { return 1; },
        localLengthEvaluator: function(local_cs) { return local_cs; },
        transpositionsEvaluator: function(lcss,trans) { return lcss-trans/2; }
    });

    var t1=options.tokenizer(s1);
    var t2=options.tokenizer(s2);

    var l1=t1.length;
    var l2=t2.length;

    if (l1==0) return options.localLengthEvaluator(l2);
    if (l2==0) return options.localLengthEvaluator(l1);

    var c1 = 0;  //cursor for string 1
    var c2 = 0;  //cursor for string 2
    var lcss = 0;  //largest common subsequence
    var local_cs = 0; //local common substring
    var trans = 0;  //number of transpositions ('axb' vs 'xba')
    var offset_arr=[];  //offset pair array, for computing the transpositions

    while ((c1 < l1) && (c2 < l2)) {
        if (options.tokenMatcher(t1[c1],t2[c2])) {
            local_cs+=options.matchingEvaluator(t1[c1],t2[c2]);
            while (offset_arr.length) {  //see if current match is a transposition
                if (c1<=offset_arr[0][0] || c2 <= offset_arr[0][1]) {
                    trans++;
                    break;
                } else {
                    offset_arr.splice(0,1);
                }
            }
            offset_arr.push([c1,c2]);
        } else {
            lcss+=options.localLengthEvaluator(local_cs);
            local_cs=0;
            if (c1!=c2) {
                c1=c2=Math.min(c1,c2);  //using min allows the computation of transpositions
            }
            //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches 
            for (var i = 0; i < maxOffset; i++) {
                if ((c1 + i < l1) && options.tokenMatcher(t1[c1+i],t2[c2])) {
                    c1+= i-1; 
                    c2--;
                    break;
                }
                if ((c2 + i < l2) && options.tokenMatcher(t1[c1],t2[c2+i])) {
                    c1--;
                    c2+= i-1;
                    break;
                }
            }
        }
           c1++;
          c2++;
        if (options.maxDistance)
        {
            var temporaryDistance=options.localLengthEvaluator(Math.max(c1,c2))-options.transpositionsEvaluator(lcss,trans);
            if (temporaryDistance>=options.maxDistance) return Math.round(temporaryDistance);
        }
    }
    lcss+=options.localLengthEvaluator(local_cs);
    return Math.round(options.localLengthEvaluator(Math.max(l1,l2)) - options.transpositionsEvaluator(lcss,trans)); //remove half the number of transpositions from the lcss
}

function extend(obj,def) {
    var result={};
    for (var prop in def) {
        if (!obj||!obj.hasOwnProperty(prop)) {
            result[prop]=def[prop];
        } else {
            result[prop]=obj[prop];
        }
    }
    return result;
}

// possible values for the options
// tokenizers:
function nGramTokenizer(s,n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
    var result=[];
    if (!s) return result;
    for (var i=0; i<=s.length-n; i++) {
        result.push(s.substr(i,n));
    }
    return result;
}

function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
    if (!s) return [];
    return s.split(/\s+/);
}

function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
    var result = [];
    for (var i=0; i<=25; i++) {
        var val=0;
        if (s) {
            for (j=0; j<s.length; j++) {
                var code=s.charCodeAt(j);
                if (code==i+65 || code==i+97) val++;
            }
        }
        result.push(val);
    }
    return result;
}

//tokenMatchers:
function sift4TokenMatcher(t1,t2) { //tokenMatcher:sift4TokenMatcher
    var similarity=1-sift4(t1,t2,5)/Math.max(t1.length,t2.length);
    return similarity>0.7;
}

//matchingEvaluators:
function sift4MatchingEvaluator(t1,t2) { //matchingEvaluator:sift4MatchingEvaluator
    var similarity=1-sift4(t1,t2,5)/Math.max(t1.length,t2.length);
    return similarity;
}

//localLengthEvaluators:
function rewardLengthEvaluator(l) {
    if (l<1) return l; //0 -> 0
    return l-1/(l+1);  //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}

function rewardLengthEvaluator2(l) {
    return Math.pow(l,1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}

As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.


Here is some explanation for the options of the general algorithm.

It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.

The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.

The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.

The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.

FAQ:

Q: Can you make Sift4 to work case insensitive?
A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply.

Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms?
A: Use a tokenizer function that splits the strings into words, then replaces them with the first of their synonyms. A more complex solution would require to analyse the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)

Q: I need an implementation for this programming language, can you help?
A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)

Q: I have been using Sift3 until now, how do I upgrade to Sift4?
A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.

Q: How can I reward you for this fantastic piece of software engineering?
A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)

Q: Your algorithm really sucks because... reasons.
A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.

Q: I compared Sift4 with another algorithm that is much more exact and there are differences.
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that?
A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.

Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end.
A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.

Q: I found a bug!
A: Let me know it and I will try and fix it.

Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously!
A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.

Q: What are the edge cases for Sift?
A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.

Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1).
A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.

Implementations in other languages:

C# implementation [expand/collapse]
public class Sift4
    {
        private Options _options;

        public Sift4(Options options)
        {
            if (options == null) options = new Options();
            if (options.Tokenizer == null)
            {
                options.Tokenizer = (s) => s == null
                     ? new string[0]
                     : s.ToCharArray().Select(c => c.ToString()).ToArray();
            }
            if (options.TokenMatcher == null)
            {
                options.TokenMatcher = (t1, t2) => t1 == t2;
            }
            if (options.MatchingEvaluator == null)
            {
                options.MatchingEvaluator = (t1, t2) => 1;
            }
            if (options.LocalLengthEvaluator == null)
            {
                options.LocalLengthEvaluator = (l) => l;
            }
            if (options.TranspositionsEvaluator == null)
            {
                options.TranspositionsEvaluator = (l, t) => l - t / 2;
            }
            _options = options;
        }

        /// <summary>
        /// General distance algorithm uses all the parameters in the options object and works on tokens
        /// </summary>
        /// <param name="s1"></param>
        /// <param name="s2"></param>
        /// <param name="maxOffset"></param>
        /// <returns></returns>
        public double GeneralDistance(string s1, string s2, int maxOffset)
        {
            var t1 = _options.Tokenizer(s1);
            var t2 = _options.Tokenizer(s2);

            var l1 = t1.Length;
            var l2 = t2.Length;

            if (l1 == 0) return _options.LocalLengthEvaluator(l2);
            if (l2 == 0) return _options.LocalLengthEvaluator(l1);

            var c1 = 0;  //cursor for string 1
            var c2 = 0;  //cursor for string 2
            var lcss = 0.0;  //largest common subsequence
            var local_cs = 0.0; //local common substring
            var trans = 0;  //number of transpositions ('axb' vs 'xba')
            var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

            while ((c1 < l1) && (c2 < l2))
            {
                if (_options.TokenMatcher(t1[c1], t2[c2]))
                {
                    local_cs += _options.MatchingEvaluator(t1[c1], t2[c2]);
                    var op = offset_arr.First;
                    while (op != null)
                    {  //see if current match is a transposition
                        if (c1 <= op.Value.C1 || c2 <= op.Value.C2)
                        {
                            trans++;
                            break;
                        }
                        else
                        {
                            var next_op = op.Next;
                            offset_arr.Remove(op);
                            op = next_op;
                        }
                    }
                    offset_arr.AddLast(new OffsetPair(c1, c2));
                }
                else
                {
                    lcss += _options.LocalLengthEvaluator(local_cs);
                    local_cs = 0;
                    if (c1 != c2)
                    {
                        c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                    }
                    //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                    //so that we can have only one code block handling matches 
                    for (var i = 0; i < maxOffset; i++)
                    {
                        if ((c1 + i < l1) && _options.TokenMatcher(t1[c1 + i], t2[c2]))
                        {
                            c1 += i - 1;
                            c2--;
                            break;
                        }
                        if ((c2 + i < l2) && _options.TokenMatcher(t1[c1], t2[c2 + i]))
                        {
                            c1--;
                            c2 += i - 1;
                            break;
                        }
                    }
                }
                c1++;
                c2++;
                if (_options.MaxDistance != null)
                {
                    var temporaryDistance = _options.LocalLengthEvaluator(Math.Max(c1, c2)) - _options.TranspositionsEvaluator(lcss, trans);
                    if (temporaryDistance >= _options.MaxDistance) return Math.Round(temporaryDistance, MidpointRounding.AwayFromZero);
                }
            }
            lcss += _options.LocalLengthEvaluator(local_cs);
            return Math.Round(_options.LocalLengthEvaluator(Math.Max(l1, l2)) - _options.TranspositionsEvaluator(lcss, trans), MidpointRounding.AwayFromZero); //remove half the number of transpositions from the lcss
        }

        /// <summary>
        /// Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
        /// </summary>
        /// <param name="s1"></param>
        /// <param name="s2"></param>
        /// <param name="maxOffset"></param>
        /// <param name="maxDistance"></param>
        /// <returns></returns>
        public static double CommonDistance(string s1, string s2, int maxOffset, int maxDistance=0)
        {
            var l1 = s1 == null ? 0 : s1.Length;
            var l2 = s2 == null ? 0 : s2.Length;

            if (l1 == 0) return l2;
            if (l2 == 0) return l1;

            var c1 = 0;  //cursor for string 1
            var c2 = 0;  //cursor for string 2
            var lcss = 0;  //largest common subsequence
            var local_cs = 0; //local common substring
            var trans = 0;  //number of transpositions ('axb' vs 'xba')
            var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

            while ((c1 < l1) && (c2 < l2))
            {
                if (s1[c1] == s2[c2])
                {
                    local_cs++;
                    var op = offset_arr.First;
                    while (op != null)
                    {  //see if current match is a transposition
                        if (c1 <= op.Value.C1 || c2 <= op.Value.C2)
                        {
                            trans++;
                            break;
                        }
                        else
                        {
                            var next_op = op.Next;
                            offset_arr.Remove(op);
                            op = next_op;
                        }
                    }
                    offset_arr.AddLast(new OffsetPair(c1, c2));
                }
                else
                {
                    lcss += local_cs;
                    local_cs = 0;
                    if (c1 != c2)
                    {
                        c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
                    }
                    //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                    //so that we can have only one code block handling matches 
                    for (var i = 0; i < maxOffset; i++)
                    {
                        if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                        {
                            c1 += i - 1;
                            c2--;
                            break;
                        }
                        if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                        {
                            c1--;
                            c2 += i - 1;
                            break;
                        }
                    }
                }
                c1++;
                c2++;
                if (maxDistance>0)
                {
                    var temporaryDistance = Math.Max(c1, c2) - (lcss - trans / 2.0);
                    if (temporaryDistance >= maxDistance) return Math.Round(temporaryDistance,MidpointRounding.AwayFromZero);
                }
            }
            lcss += local_cs;
            return Math.Round(Math.Max(l1, l2) - (lcss - trans / 2.0), MidpointRounding.AwayFromZero); //remove half the number of transpositions from the lcss
        }

        /// <summary>
        /// Standard Sift algorithm, using strings and taking only maxOffset as a parameter
        /// </summary>
        /// <param name="s1"></param>
        /// <param name="s2"></param>
        /// <param name="maxOffset"></param>
        /// <returns></returns>
        public static int SimplestDistance(string s1, string s2, int maxOffset)
        {
            var l1 = s1 == null ? 0 : s1.Length;
            var l2 = s2 == null ? 0 : s2.Length;

            if (l1 == 0) return l2;
            if (l2 == 0) return l1;

            var c1 = 0;  //cursor for string 1
            var c2 = 0;  //cursor for string 2
            var lcss = 0;  //largest common subsequence
            var local_cs = 0; //local common substring

            while ((c1 < l1) && (c2 < l2))
            {
                if (s1[c1] == s2[c2])
                {
                    local_cs++;
                }
                else
                {
                    lcss += local_cs;
                    local_cs = 0;
                    if (c1 != c2)
                    {
                        c1 = c2 = Math.Max(c1, c2);  
                    }
                    //if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                    //so that we can have only one code block handling matches 
                    for (var i = 0; i < maxOffset; i++)
                    {
                        if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
                        {
                            c1 += i - 1;
                            c2--;
                            break;
                        }
                        if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
                        {
                            c1--;
                            c2 += i - 1;
                            break;
                        }
                    }
                }
                c1++;
                c2++;
            }
            lcss += local_cs;
            return Math.Max(l1, l2) - lcss; 
        }

        private class OffsetPair
        {
            public int C1 { get; set; }
            public int C2 { get; set; }

            public OffsetPair(int c1, int c2)
            {
                this.C1 = c1;
                this.C2 = c2;
            }
        }

        public class Options
        {
            /// <summary>
            /// If set, the algorithm will stop if the distance reaches this value
            /// </summary>
            public int? MaxDistance { get; set; }

            /// <summary>
            /// The function that turns strings into a list of tokens (also strings)
            /// </summary>
            public Func<string, string[]> Tokenizer { get; set; }

            /// <summary>
            /// The function that determines if two tokens are matching (similar to characters being the same in the simple algorithm)
            /// </summary>
            public Func<string, string, bool> TokenMatcher { get; set; }

            /// <summary>
            /// The function that determines the value of a match of two tokens (the equivalent of adding 1 to the lcss when two characters match)
            /// This assumes that the TokenMatcher function is a lot less expensive than this evaluator. If that is not the case, 
            /// you can optimize the speed of the algorithm by using only the matching evaluator and then calculating if two tokens match on the returned value.
            /// </summary>
            public Func<string, string, double> MatchingEvaluator { get; set; }

            /// <summary>
            /// Determines if the local value (computed on subsequent matched tokens) must be modified.
            /// In case one wants to reward longer matched substrings, for example, this is what you need to change.
            /// </summary>
            public Func<double, double> LocalLengthEvaluator { get; set; }

            /// <summary>
            /// The default transposition function is to remove from the value of the lcss half of the number of transpositions.
            /// Change it if it doesn't suit you.
            /// </summary>
            public Func<double, double, double> TranspositionsEvaluator { get; set; }
        }
    }

PHP implementation [expand/collapse]
// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// $maxOffset is the number of characters to search for matching letters
// $maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4($s1, $s2, $maxOffset, $maxDistance) {
    if (!$s1||!strlen($s1)) {
        if (!$s2) {
            return 0;
        }
        return strlen($s2);
    }

    if (!$s2||!strlen($s2)) {
        return strlen($s1);
    }

    $l1=strlen($s1);
    $l2=strlen($s2);

    $c1 = 0;  //cursor for string 1
    $c2 = 0;  //cursor for string 2
    $lcss = 0;  //largest common subsequence
    $local_cs = 0; //local common substring
    $trans = 0;  //number of transpositions ('ab' vs 'ba')
    $offset_arr=[];  //offset pair array, for computing the transpositions

    while (($c1 < $l1) && ($c2 < $l2)) {
        if (substr($s1,$c1,1) == substr($s2,$c2,1)) {
            $local_cs++;
            while (sizeof($offset_arr)) {  //see if current match is a transposition
                if ($c1<=$offset_arr[0][0] || $c2 <= $offset_arr[0][1]) {
                    $trans++;
                    break;
                } else {
                    array_splice($offset_arr,0,1);
                }
            }
            array_push($offset_arr,array($c1,$c2));
        } else {
            $lcss+=$local_cs;
            $local_cs=0;
            if ($c1!=$c2) {
                $c1=$c2=min($c1,$c2);  //using min allows the computation of transpositions
            }
            //if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
            //so that we can have only one code block handling matches 
            for ($i = 0; $i < $maxOffset; $i++) {
                if (($c1 + $i < $l1) && (substr($s1,$c1 + $i,1) == substr($s2,$c2,1))) {
                    $c1+= $i-1; 
                    $c2--;
                    break;
                }
                if (($c2 + $i < $l2) && (substr($s1,$c1,1) == substr($s2, $c2 + $i,1))) {
                    $c1--;
                    $c2+= $i-1;
                    break;
                }
            }
        }
        $c1++;
        $c2++;
        if ($maxDistance)
        {
            $temporaryDistance=max($c1,$c2)-$lcss+$trans/2.0;
            if ($temporaryDistance>=$maxDistance) return round($temporaryDistance);
        }
    }
    $lcss+=$local_cs;
    return round(max($l1,$l2)- $lcss +$trans/2.0); //remove half the number of transpositions from the $lcss
}
The Simplest and General versions of the algorithm remain as an exercise to you, since I haven't been working in PHP for more than a decade.

T-SQL implementation [expand/collapse]
---<summary>
---Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
---</summary>
---<param name="s1"></param>
---<param name="s2"></param>
---<param name="maxOffset"></param>
---<param name="maxDistance"></param>
---<returns></returns>
CREATE FUNCTION Sift4Common(@s1          NVARCHAR(max), 
                           @s2          NVARCHAR(max), 
                           @maxOffset   INT, 
                           @maxDistance INT) RETURNS INT 
AS 
  BEGIN 
      DECLARE @l1 INT = Len(ISNULL(@s1, '')); 
      DECLARE @l2 INT = Len(ISNULL(@s2, '')); 

      IF ( @l1 = 0 ) 
        RETURN @l2; 

      IF ( @l2 = 0 ) 
        RETURN @l1; 

      DECLARE @c1 INT = 0; 
      DECLARE @c2 INT = 0; 
      DECLARE @lcss INT = 0; 
      DECLARE @local_cs INT = 0; 
      DECLARE @trans INT = 0; 
      DECLARE @offset_arr TABLE 
        ( 
           C1 INT, 
           C2 INT 
        ) 
      DECLARE @i INT 
      DECLARE @temporaryDistance FLOAT 
      DECLARE @result INT 

      WHILE ( ( @c1 < @l1 ) 
              AND ( @c2 < @l2 ) ) 
        BEGIN 
            IF ( Substring(@s1, @c1 + 1, 1) = Substring(@s2, @c2 + 1, 1) ) 
              BEGIN 
                  SET @local_cs=@local_cs + 1; 
                  SET @trans = @trans + CASE WHEN EXISTS(SELECT 1 FROM 
                               @offset_arr 
                               o 
                               WHERE 
                               @c1 
                               <= o.C1 OR @c2 
                               <= o.C2) THEN 1 ELSE 0 END; 

                  DELETE FROM @offset_arr 
                  WHERE  @c1 > C1 
                         AND @c2 > C2; 

                  INSERT INTO @offset_arr 
                  VALUES     (@c1, 
                              @c2); 
              END 
            ELSE 
              BEGIN 
                  SET @lcss = @lcss + @local_cs; 
                  SET @local_cs = 0; 

                  IF ( @c1 != @c2 ) 
                    -- using min allows the computation of transpositions 
                    BEGIN 
                        IF ( @c1 < @c2 ) 
                          BEGIN 
                              SET @c2=@c1; 
                          END 
                        ELSE 
                          BEGIN 
                              SET @c1=@c2; 
                          END 
                    END 

                  --if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
                  --so that we can have only one code block handling matches  
                  SET @i=0; 

                  WHILE ( @i < @maxOffset ) 
                    BEGIN 
                        IF ( ( @c1 + @i < @l1 ) 
                             AND Substring(@s1, @c1 + @i + 1, 1) = 
                                 Substring(@s2, @c2 + 1, 1) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 + @i - 1; 
                              SET @c2=@c2 - 1; 

                              BREAK; 
                          END 

                        IF ( ( @c2 + @i < @l2 ) 
                             AND Substring(@s1, @c1 + 1, 1) = Substring(@s2, 
                                                              @c2 + @i + 1, 1 
                                                              ) 
                           ) 
                          BEGIN 
                              SET @c1 = @c1 - 1; 
                              SET @c2=@c2 + @i - 1; 

                              BREAK; 
                          END 

                        SET @i=@i + 1; 
                    END 
              END 

            SET @c1=@c1 + 1; 
            SET @c2=@c2 + 1; 

            IF ( @maxDistance > 0 ) 
              BEGIN 
                  IF ( @c1 > @c2 ) 
                    BEGIN 
                        SET @temporaryDistance = @c1 - ( @lcss - @trans / 2.0 ); 
                    END 
                  ELSE 
                    BEGIN 
                        SET @temporaryDistance = @c2 - ( @lcss - @trans / 2.0 ); 
                    END 

                  IF ( @temporaryDistance >= @maxDistance ) 
                    RETURN Round(@temporaryDistance, 0); 
              END 
        END 

      SET @lcss = @lcss + @local_cs; 

      --remove half the number of transpositions from the lcss 
      IF ( @l1 > @l2 ) 
        BEGIN 
            SET @result = Round(@l1 - ( @lcss - @trans / 2.0 ), 0); 
        END 
      ELSE 
        BEGIN 
            SET @result = Round(@l2 - ( @lcss - @trans / 2.0 ), 0); 
        END 

      RETURN @result; 
  END 

Clearly a general version of the algorithm is not possible in Transact SQL.

Monday, November 10, 2014

By default, IISExpress is working on 32bit even in 64bit Windows.

I had this memory problem that I could not understand. OK, the design of the application was not the best in the world, but why would it always give me a damn OutOfMemoryException when I have a super duper computer with a lot of memory and I have solved issues like heap memory allocation? And the reason is that IISExpress, the default Windows7 ASP.Net web server is running in 32bit mode, meaning it has a maximum of 4GB of memory no matter what you do. Well, you can make IISExpress run in 64bit mode by simply switching it on in the Windows registry. I don't want to copy the content of the article from someone else, so here is a link towards how to do it: Debugging VS2013 Websites Using 64-bit IIS Express.

Just in case you want the ultra fast version, copy this into a file with the .reg extension and execute it:
Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\12.0\WebProjects]
"Use64BitIISExpress"=dword:00000001

Cheers!

Aggregate function to concatenate strings in T-SQL

You know how many a time you need to get the results of a query as a strings list, let's say a CSV and there is nothing in Microsoft SQL Server to help you? What you would like is something like an aggregate function similar to SUM that returns the concatenated value of all strings in the SELECT query. And before SQL Server 2008 that was the case, you needed to use variables and SELECT @S=@S+[Value] FROM .... But in SQL 2008 they added more XML support and thus the data() XML PATH method. Take notice that this method adds a space between atomic values. So, without further due, here is the code to concatenate several values into a comma separated value string:
DECLARE @T TABLE(S NVARCHAR(Max))
INSERT INTO @T VALUES('Filet'),('T-bone'),('Sausage'),('Würstel') -- enough with the fruity examples!

SELECT CONVERT(NVARCHAR(Max),(SELECT S+',' AS 'data()' FROM @T t
FOR XML PATH('')))

Result: Filet, T-bone, Sausage, Würstel, - of course, remove the last comma.

Friday, November 07, 2014

After 15 years of Naruto, Naruto Shippuden manga ends at chapter 700

It was inevitable, both Naruto and Sasuke were getting ridiculously strong. In the end they fought the mother of all chakra and... of course they won, then they fought each other, but it was kind of underwhelming, since their power prevented any subtlety and they just went cowboy punching each other. The last color chapter is about how they leave it all to the next generation, although it is hard to think of anything more they could do to top their parents. I loved the entire series and it is easy to understand why: simple concept, positive feelings like friendship and camaraderie and weird magical ninja fights. I was a teen when I started watching the anime and now I am freakishly old. Well, life happens. After I got kind of tired of watching the anime, even if it was really well done and followed the manga faithfully, I went with reading the manga. I like to use Mangastream for my reading purposes, so you can read the entire thing here: Naruto Shippuden. Even if it appears they are writing some Naruto side stories, I am not sure I will ever read them. I am still looking for a manga that can grab me like Naruto has.

Thursday, November 06, 2014

Crazy cool computer moves in a game analysis (King's gambit accepted)

I was watching a video from GM Niclas Huschenbeth where he played lytura in an online game. Amazingly he lost, but that is what happens when you underestimate your opponent, which I think was what actually went wrong. At the end of the video it was difficult to see exactly what White could have done after a point, so I analysed the game with the computer and found some amazing moves. First I will show you the original game. I urge you to think it through and see what moves you would have done differently, like a chess puzzle, before you watch the game as the computer suggested it. You can also watch the video online at the end of the post and, if you like chess, I really recommend Huschenbeth's channel. Not only is he a great player, but also a decent and nice guy and young, too. His Blitz & Talk GM Special videos are especially cool, since he plays with other world class grand masters.

But enough of that. Here is the game, up to a point where it didn't really matter what happened: 1. e4 e5 2. f4 exf4 3. Nf3 g5 4. Nc3 g4 5. Ne5 Qh4+ 6. g3 fxg3 7. Qxg4 g2+ 8. Qxh4 gxh1=Q 9. Qh5 Nh6 10. d3 d6 11. Bxh6 Be6 12. Bxf8 Rxf8 13. Nf3 Nd7 14. O-O-O c6 15. Bh3 Nf6 16. Rxh1 Nxh5 17. Bf1 Rg8 18. Ne2 Kd7 19. Kd2 Rg7 20. Ke3 Rag8 21. a3 Nf6 22. h3 b6 23. d4 a5 24. Nf4 Rg3 25. Ne2

Here is the video of the game, to give you the time to think it through:


And finally, here are two lines that the computer recommended. This is considering that the fateful 10. d3 was played already: 1. e4 e5 2. f4 exf4 3. Nf3 g5 4. Nc3 g4 5. Ne5 Qh4+ 6. g3 fxg3 7. Qxg4 g2+ 8. Qxh4 gxh1=Q 9. Qh5 Nh6 10. d3 d6 11. Bxh6 Be6 12. Bxf8 Rxf8 13. Nf3 Nd7 14. O-O-O c6 15. Nb5 Nf6 (15. .. cxb5 16. Bh3 Qxd1+ 17. Kxd1 O-O-O 18. Bxe6 fxe6 19. Nd4 {White +1.2}) 16. Nxd6+ Kd7 17. Qh3 Bxh3 18. Bxh3+ Kxd6 19. Rxh1 {Black +0.3}

Did you see those Nb5 and Qh3 moves?! Who does that? :)