Thursday, January 23, 2014

Comparing the content of two similar web pages

For a personal project of mine I needed to gather a lot of data and condense it into a newsletter. What I needed was to take information from selected blogs, google queries and various pages that I find and take only what was relevant into account. Great, I thought, I will make a software to help me do that. And now, proverbially, I have two problems.

The major issue is that after getting all the info I needed, I was stuck on reading thousands of web pages to get to the information I needed. I was practically spammed. The thing is that there aren't even so many stories, it's just the same content copied from news site to news site, changing only the basic structure of the text, maybe using other words or expanding and collapsing terms in and out of abbreviations and sometimes just pasting it exactly as it was in the source, but displayed in a different web page, with a different template.

So the challenge was to compare two or more web pages for the semantic similarity of the stories. While there is such theory as semantic text analysis, just google for semantic similarity and you will get mostly PDF academic white papers and software that is done in Python or some equally disgusting language used only in scientific circles. And while, true, I was intrigued and for a few days I entertained the idea of understanding all that and actually building a C# library up to the task, I did not have the time for it. Not to mention that the data file I was supposed to parse was growing day by day while I was dallying in arcane algorithms.

In conclusion I used a faster and more hackish way to the same end. Here is how I did it.

The first major hurdle was to clear the muck from the web page and get to the real information. A simple html node innerText would not do. I had to ignore not only HTML markup, but such lovely things as menus, ads, sidebars with blog information, etc. Luckily there is already a project that does that called Boilerpipe. And before you jump at me for linking to a Java project, there is also a C# port, which I had no difficulties to download and compile.

At the time of the writing, the project would not compile well because of its dependency to a Mono.Posix library. Fortunately the library was only used for two methods that were never used, so I just removed the reference and the methods and all was well.

So now I would mostly have the meaningful text of both web pages. I needed an algorithm to quickly determine their similarity. I skipped the semantic bit of the problem altogether (trying to detect synonyms or doing lexical parsing) and I resorted to String Kernels. Don't worry if you don't understand a lot of the Wikipedia page, I will explain how it works right away. My hypothesis was that even if they change some words, the basic structure of the text remains the same, so while I am trying to find the pages with the same basic meaning, I could find them by looking for pages with the same text structure.

In order to do that I created for each page a dictionary with string keys and integer values. The keys would be text n-grams from the page (all combinations of three characters that are digits and letters) and the values the count of those kernels in the Boilerpipe text. At first I also allowed spaces in the character list of kernels, but it only complicated the analysis.

To compare a page to others, I would take the keys in the kernel dictionary for my page and look for them in the dictionaries of other pages, then compute a distance out of the counts. And it worked! It's not always perfect, but sometimes I even get pages that have a different text altogether, but reference the same topic.

You might want to know what made me use 3-grams and not words. The explanation comes mostly from what I read first when I started to look for a solution, but also has some logic. If I would have used words, then abbreviations would have changed the meaning of the text completely. Also, I did not know how many words would have been in a few thousand web pages. Restricting the length to three characters gave me an upper limit for the memory used.

Conclusion: use the .Net port of Boilerpipe to extract text from the html, create a kernel dictionary for each page, then compute the vector distance between the dictionaries.

I also found a method to compare the dictionaries better. I make a general kernel dictionary (for all documents at once) and then the commonality of a bit of text is the number of times it appears divided by the total count of kernels. Or the number of documents in which it is found divided by the total number of documents. I chose commonality as the product of these two. Then, one computes the difference between kernel counts in two documents by dividing the squared difference for each kernel by its commonality and adding the result up. It works much better like this. Another side effect of this method is that one can compute how "interesting" a document is, by adding up the counts of all kernels divided by their commonality, then dividing that to the length of the text (or the total count of kernels). The higher the number, the less common its content would be.