Tuesday, May 22, 2012

On RegexOptions.IgnoreCase performance

I found a bit of code today that tested if a bunch of strings were found in another. It used IndexOf for each of the strings and continued to search if not found. The code, a long list of ifs and elses, looked terrible. So I thought I would refactor it to use regular expressions. I created a big Regex object, using the "|" regular expression operator and I tested for speed.

( Actually, I took the code, encapsulated it into a method that then went into a new object, then created the automated unit tests for that object and only then I proceeded in writing new code. I am very smug because usually I don't do that :) )

After the tests said the new code was good, I created a new test to compare the increase in performance. It is always good to have a metric to justify the work you have been doing. So the old code worked in about 3 seconds. The new code took 10! I was flabbergasted. Not only that I couldn't understand how that could happen, how several scans of the same string could be faster than a single one, but I am the one that wrote the article that said IndexOf is slower than Regex search (at least it was so in the .Net 2.0 times and I could not replicate the results in .Net 4.0). It was like a slap in the face, really.

I proceeded to change the method, having now a way to determine increases in performance, until I finally figured out what was going on. The original code was first transforming the text into lowercase, then doing IndexOf. It was not even using IndexOf with StringComparison.OrdinalIgnoreCase which was, of course, a "pfff" moment for me. My new method was, of course, using RegexOptions.IgnoreCase. No way this option would slow things down. But it did!

You see, when you have a search of two strings, separated by the "|" regular expression operator, inside there is a tree of states that is created. Say you are searching for "abc|abd", it will search once for "a", then once for "b", then check the next character for "c" or "d". If any of these conditions fail, the match will fail. However, if you do a case ignorant match, for each character there will be at least two searches per letter. Even so, I expected only a doubling of the processing length, not the whooping five times decrease in speed!

So I did the humble thing: I transformed the string into lowercase, then did a normal regex match. And the whole thing went from 10 seconds to under 3. I am yet to understand why this happens, but be careful when using the case ignorant option in regular expressions in .Net.


cbp said...

From what I understand IgnoreCase doesn't just equate 'a' with 'A', but also with 'À', 'Ã', 'Ä' etc. It also does a culture lookup to work out what it should do.
That's where your performance is going!

Siderite said...

Exactly! That's why I wrote "at least two searches per letter".

I still am not sure what I think of this. In Romanian we have several types of "a" characters, with a sign on top to indicate the difference, that represent particular sounds. But I wouldn't consider any of them "a"s. They have their own lower and upper case, too. I know in French there are accents that make a vowel sound differently, but I also remember they give different meanings to words, as well.

Also, I have to consider the possibility that when a regular expression has need of backtracking, it will repeat some of these searches per character. Until today, if I had seen something like: text=text.ToLower(); reg.Match(text); I would have refactored it to using RegexOptions.IgnoreCase.