Tuesday, May 26, 2015

Determine the best matching regular expression for a string

I had this situation where I had to match a string by several regular expressions and choose the most specific match. Let me explain it a little bit further: you have a string and several regular expressions that match that string. The requirement is to choose which one matches better. This would be useful if, for example, you want to match mobile manufacturers and so you define one configurations for Apple and another for Android and another for the rest. The regular expressions would be "Apple", "Android" and ".*" (which matches everything). So you would want to choose only "Apple" for Apple devices and not ".*".

Now, I've studied regular expressions a bit, but I don't want to implement a regular expression parser in order to compute the complexity of the expression tree. What I want is to have various Regex objects and, somehow, compare their complexity. I am not going to pretend that I understand what goes on within the Regex object, instead I will say that I noticed the complexity of a regular expression is directly proportional to the length of the pattern and, in the case of the Regex object, with the number of items in the _codes int32 array found in the code RegexCode object found as a private field of the Regex object.

So, the simplest code for that is this:
private double regexComplexity(string pattern)
{
    var reg = new Regex(pattern, RegexOptions.Singleline | RegexOptions.IgnoreCase);
    var codeField = reg.GetType().GetField("code", BindingFlags.Instance | BindingFlags.NonPublic);
    var code = codeField.GetValue(reg);
    var _codesField = code.GetType().GetField("_codes", BindingFlags.Instance | BindingFlags.NonPublic);
    var codes = (int[])_codesField.GetValue(code);
    return codes.Length;
}

However we must take into consideration several cases:
  • The regular expression pattern might be wrong
  • Reflection is slow

So, let's first catch all exceptions and use the length of the pattern as the complexity and then find a way to cache the complexity of a string, in case we use it more than once. Also, cache the FieldInfo objects for use at every call.

Here is the final code:
public static class RegularExpressionExtensions
{
    private static ConcurrentDictionary<string, int> _dict;
    private static FieldInfo _codeField;
    private static FieldInfo _codesField;

    static RegularExpressionExtensions()
    {
        _dict = new ConcurrentDictionary<string, int>();
        _codeField = typeof(Regex).GetField("code", BindingFlags.Instance | BindingFlags.NonPublic);
        _codesField = _codeField.FieldType.GetField("_codes", BindingFlags.Instance | BindingFlags.NonPublic);
    }

    public static int Complexity(this Regex reg)
    {
        if (reg == null) return 0;
        var pattern = reg.ToString();
        var complexity = _dict.GetOrAdd(pattern, p => getComplexity(reg));
        return complexity;
    }

    private static int getComplexity(Regex reg)
    {
        var code = _codeField.GetValue(reg);
        var codes = (int[])_codesField.GetValue(code);
        return codes.Length;
    }
}

Basically we encapsulate the Regex complexity in an static extension class that we can use on any Regex instance as reg.Complexity().

0 comments: