Monday, May 26, 2014

Detecting the language of a text in C#

Short post about how to use language detection capabilities in your application. I will be demonstrating NTextCat, which is a .NET port of text_cat, a Perl script, itself an implementation of a whitepaper published in 1994 called N-Gram-Based Text Categorization.

    There are four steps to language detection using NTextCat:
  • Reference NTextCat - the library is now available as a NuGet package as well
  • Instantiate an identifier factory (usually RankedLanguageIdentifierFactory)
  • Get an instance of a RankedLanguageIdentifier from the factory by loading a language XML file
  • Call the Identify method on your text and get a list of languages in the order of the probability that the text in that language

Here is a piece of code that does that using the core XML file published with the library. Remember to add the XML to your project and set its property of Copy to Output Directory.

public class LanguageProcessor
{
    private RankedLanguageIdentifier _identifier;

    public string IdentifyLanguage(string text)
    {
        if (_identifier == null)
        {
            var file = new FileInfo(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "LanguageModels/Core14.profile.xml"));
            if (!file.Exists)
            {
                throw new FileNotFoundException("Could not find LanguageModels/Core14.profile.xml to detect the language");
            }
            using (var readStream = File.OpenRead(file.FullName))
            {
                var factory = new RankedLanguageIdentifierFactory();
                _identifier = factory.Load(readStream);
            }
        }
        var languages = _identifier.Identify(text);
        var mostCertainLanguage = languages.FirstOrDefault();
        if (mostCertainLanguage != null)
        {
            return mostCertainLanguage.Item1.Iso639_3;
        }
        return null;
    }

}

There are a lot of XMl files, some taken from Wikipedia, for example, and handling 280+ languages, but for the casual purpose of finding non-English text in a list, the core one will suffice.

0 comments: