Thursday, May 29, 2008

Text searching in .NET, not as I expected.

Update: the title of the post should have been "Text searching in general, not as I expected". I did a quick test in javascript. Do 1000 indexOf and then 1000 search(RegExp object) in a long text and measure time. Here are the results:
Internet Explorer 7: search 90% of indexOf time!
FireFox: search 355% of indexOf time!?!
Chrome: search and indexOf same speed.

Update February 2016: I've tried it again, with Firefox 44.0.2, Chrome 48.0.2564.109 m and Internet Explorer 11.0.9600.18163 and indexOf is marginally faster than regex for all of them. For the case insensitive search, Firefox and Chrome take 50% than indexOf, while Internet Explorer is not only the fastest, but also only adds 4% when searching case insensitive.

It makes sense to have the same implementation of indexOf and regular expressions as in .Net for the Microsoft browser, so I am not that surprised, but what happened with FireFox? My hypothesis is that they used Boyer-Moore in indexOf (as any decent programmer should have!) yet used a simpler implementation for their regular expression engine.

A few days ago I was preaching that IndexOf with StringComparison.Ordinal is the fastest thing for text searching. I was even implying that it uses the Boyer-Moore algorithm from a Windows system library written in c++. I was dead wrong!

Thanks to Meaflux who told me that the Internet said things differently, I delved even further into the code, past the .NET code and into the library code of mscorwks.dll where the IndexOfString method that ordinal IndexOf uses, written in c++, appears to be a pathetic linear algorithm!!

Even worse, the Boyer-Moore algorithm is used in the .NET framework in the Regex class! The Regex class is written entirely in managed code, even if it uses all sorts of cool hacks. I've also tried a Boyer-Moore implementation from Codeproject that also included the Apostolico-Giancarlo, BCL, Horspool and Turbo Boyer Moore algorithms. IndexOf was slower!

Anyway, the code is simple: take a one megabyte Harry Potter text book and search a 100 byte text from somewhere near the middle of the text. Use IndexOf with StringComparison.Ordinal, then Regex with both variants, compiled and ad-hoc creation, then the class from CodeProject.

    Results for 100 tries:
  • Ordinal IndexOf - 390 milliseconds
  • Normal IndexOf - 2380 milliseconds (6 times slower!!)
  • AdHoc Regex - 190 milliseconds (take into acount the creation of the regex 100 times plus the use of Regex.Escape 100 times on the search pattern)
  • Compiled Regex - 160 milliseconds
  • Codeproject Boyer-Moore - 230 milliseconds

Would you have believed this?

Maybe the algorithm was better because the search string was 100 bytes, so I made it 12 bytes and redid the test:

    Results for 100 tries:
  • IndexOf - 440 milliseconds(more than with 100 bytes!)
  • Normal IndexOf - 2410 milliseconds!!
  • AdHoc Regex - 235 milliseconds
  • Compiled Regex - 230 milliseconds
  • Codeproject Boyer-Moore - 368 milliseconds (a lot more, similar to IndexOf, yet still less)

Ok, maybe the problem was that the string in which I searched was big. Let's make it 5000 bytes!

    Results for 100000 (1000 times more operations):
  • Ordinal IndexOf - 1109 milliseconds
  • Normal IndexOf - 1109 milliseconds
  • AdHoc Regex - 2500 milliseconds
  • Compiled Regex - 375 milliseconds!!!
  • Codeproject Boyer-Moore - 1312 milliseconds (finally something slower than IndexOf)

So Regex appears to be best for string searching in most cases!

Update: after optimizing the Boyer-Moore class and adding my own goodness to it, I made more tests. The Boyer-Moore algorithm can be improved by adding more preprocessing, therefore it is more and more efficient as the pattern size increases and the number of pattern searches decreases. In those cases (pattern length of a few thousand characters) Regex is not so good, especially if one has to Escape the entire sequence and instantiate the class on each search. Anything over 500 characters in search pattern length made Regex perform less than the most preprocessing BM I've used.


Madhur said...

This is interesting results. How did you measure the time ?

There are many other things to be taken in context like no. of processes, process switch, thread priority etc. ..

Siderite said...

I did a classic DateTime.Now-DateTime.Then :) But on 100 runs of the same method and with no other programs running in the background.

Make your own test with IndexOf, then replace input.IndexOf(search) with:
Match m=Regex.Match(input,Regex.Replace(search),RegexOptions.CultureInvariant|RegexOptions.SingleLine);
return m.Success?m.Index:-1;

Siderite said...

Oh, and when you use this code, use different search strings everytime as .Net caches about 15 of the latest Regex used. Use New Regex if you want to use the same string.