Friday, January 31, 2014

Deserializing/Serializing XML that contains xsi:type attributes (and other XML adventures)

I wanted to take an arbitrary XML format and turn it into C# classes. I even considered for a while to write my own IXmlSerializable implementation of the classes, but quickly gave up because of their large number and heavy imbrication. Before we proceed you should know that there are several ways in which to turn XML into C# classes. Here is a short list (google it to learn more):
  • In Visual Studio 2012, all you have to do is copy the XML, then go to Edit -> Paste Special -> Paste XML as classes. There is an option for pasting JSON there as well.
  • There is the xsd.exe option. This is usually shipped with the Windows SDK and you have to either add the folder to the PATH environment variable so that the utility works everywhere, or use the complete path (which depends on which version of SDK you have).
  • xsd2Code is an addon for Visual Studio which gives you an extra menu option when you right click an .xsd file in the Solution Explorer to transform it to classes
  • Other zillion custom made tools that transform the XML into whatever

Anyway, the way to turn this XML into classes manually (since I didn't like the output of any of the tools above and some were even crashing) is this:
  • Create a class that is decorated with the XmlRoot attribute. If the root element has a namespace, don't forget to specify the namespace as well. Example:
    [XmlRoot(ElementName = "RootElement", Namespace = "", IsNullable = false)]
  • For each descendant element you create a class. You add a get/set property to the parent element class, then you decorate it with the XmlElement (or XmlAttribute, or XmlText, etc). Specify the ElementName as the exact name of the element in the source XML and the Namespace url if it is different from the namespace of the document root. Example:
    [XmlElement(ElementName = "Integer", Namespace = "")]
  • If there are supposed to be more children elements of the same type, just set the type of the property to an array or a List of the class type representing one element
  • Create an instance of an XmlSerializer using the type of the root element class as a parameter. Example:
    var serializer = new XmlSerializer(typeof(RootElementEntity));
  • Create an XmlSerializerNamespaces instance and add all the namespaces in the document to it. Example:
    var ns = new XmlSerializerNamespaces(); ns.Add("ss", ""); ns.Add("ds", "");
  • Use the namespaces instance to serialize the class. Example: serializer.Serialize(stream, instance, ns);

The above technique serializes a RootElementEntity instance to something similar to:
<ss:RootElement xmlns:ss="" xmlns:ds="">

Now, everything is almost good. The only problem I met doing this was trying to deserialize an XML containing xsi:type attributes. An exception of type InvalidOperationException was thrown with the message "The specified type was not recognized: name='TheType', namespace='', at " and then the XML element that caused the exception. (Note that this is an internal exception of the first InvalidOperationException thrown that just says there was an error in the XML)

I finally found the solution, even if it is not the most intuitive. You need to create a type that inherits from the type you want associated to the element. Then you need to decorate it (and the original element) with an XmlRoot attribute specifying the namespace (even if the namespace is the same as the one of the document root element). And then you need to decorate the base type with the XmlInclude attribute. Here is an example.

The XML:

<ss:RootElement xmlns:ss="" xmlns:ds="">
<ss:MyType xsi:type="ss:TheType">10</ss:MyType>

You need to create the class for MyType then inherit TheType from it:
public class MyTypeEntity {}

public class TheType: MyTypeEntity {}
Removing any of these attributes makes the deserialization fail.

Hope this helps somebody.

Friday, January 24, 2014

T-SQL Convert and Cast turn empty string to default value, NOT null

It is a bit embarrassing not knowing this at my level of software development, but I was stunned to see other people, even more experienced than I, had the same lack of knowledge. Apparently Microsoft SQL Server converts empty or whitespace strings to default values when using CONVERT or CAST. So CONVERT(INT,''), equivalent to CAST('' as INT), equals 0. DATETIME conversion leads to a value of 1900-01-01. And so on. That means that a good practice for data conversion when you don't know what data you may be getting is to always turn whitespace to null before using CONVERT or CAST. Also, in related news, newline is NOT whitespace in T-SQL so LTRIM(CHAR(10)) and LTRIM(CHAR(13)) is not empty string!

Bottom line: instead of CONVERT(<type>,<unknown string value>) use the cumbersome CONVERT(<type>,CASE WHEN LTRIM(RTRIM(<unknown string value>))!='' THEN <unknown string value> END). Same with CAST.

Here is a table of conversions for some values converted to FLOAT:

'' (empty string)0NULL0
' ' (whitespace)0NULL0
' (whitespace and newlines)
Conversion errorConversion errorNULL

You might think this is not such a big deal, but in Microsoft SQL 2012 they introduced TRY_CONVERT and the similar TRY_CAST, which return null if there is a conversion error. This means that for an incorrect string value the function would return null for most but empty string, where it would return the default value of the type chosen, thus resulting in an inconsistent behavior.

Thursday, January 23, 2014

Comparing the content of two similar web pages

For a personal project of mine I needed to gather a lot of data and condense it into a newsletter. What I needed was to take information from selected blogs, google queries and various pages that I find and take only what was relevant into account. Great, I thought, I will make a software to help me do that. And now, proverbially, I have two problems.

The major issue is that after getting all the info I needed, I was stuck on reading thousands of web pages to get to the information I needed. I was practically spammed. The thing is that there aren't even so many stories, it's just the same content copied from news site to news site, changing only the basic structure of the text, maybe using other words or expanding and collapsing terms in and out of abbreviations and sometimes just pasting it exactly as it was in the source, but displayed in a different web page, with a different template.

So the challenge was to compare two or more web pages for the semantic similarity of the stories. While there is such theory as semantic text analysis, just google for semantic similarity and you will get mostly PDF academic white papers and software that is done in Python or some equally disgusting language used only in scientific circles. And while, true, I was intrigued and for a few days I entertained the idea of understanding all that and actually building a C# library up to the task, I did not have the time for it. Not to mention that the data file I was supposed to parse was growing day by day while I was dallying in arcane algorithms.

In conclusion I used a faster and more hackish way to the same end. Here is how I did it.

The first major hurdle was to clear the muck from the web page and get to the real information. A simple html node innerText would not do. I had to ignore not only HTML markup, but such lovely things as menus, ads, sidebars with blog information, etc. Luckily there is already a project that does that called Boilerpipe. And before you jump at me for linking to a Java project, there is also a C# port, which I had no difficulties to download and compile.

At the time of the writing, the project would not compile well because of its dependency to a Mono.Posix library. Fortunately the library was only used for two methods that were never used, so I just removed the reference and the methods and all was well.

So now I would mostly have the meaningful text of both web pages. I needed an algorithm to quickly determine their similarity. I skipped the semantic bit of the problem altogether (trying to detect synonyms or doing lexical parsing) and I resorted to String Kernels. Don't worry if you don't understand a lot of the Wikipedia page, I will explain how it works right away. My hypothesis was that even if they change some words, the basic structure of the text remains the same, so while I am trying to find the pages with the same basic meaning, I could find them by looking for pages with the same text structure.

In order to do that I created for each page a dictionary with string keys and integer values. The keys would be text n-grams from the page (all combinations of three characters that are digits and letters) and the values the count of those kernels in the Boilerpipe text. At first I also allowed spaces in the character list of kernels, but it only complicated the analysis.

To compare a page to others, I would take the keys in the kernel dictionary for my page and look for them in the dictionaries of other pages, then compute a distance out of the counts. And it worked! It's not always perfect, but sometimes I even get pages that have a different text altogether, but reference the same topic.

You might want to know what made me use 3-grams and not words. The explanation comes mostly from what I read first when I started to look for a solution, but also has some logic. If I would have used words, then abbreviations would have changed the meaning of the text completely. Also, I did not know how many words would have been in a few thousand web pages. Restricting the length to three characters gave me an upper limit for the memory used.

Conclusion: use the .Net port of Boilerpipe to extract text from the html, create a kernel dictionary for each page, then compute the vector distance between the dictionaries.

I also found a method to compare the dictionaries better. I make a general kernel dictionary (for all documents at once) and then the commonality of a bit of text is the number of times it appears divided by the total count of kernels. Or the number of documents in which it is found divided by the total number of documents. I chose commonality as the product of these two. Then, one computes the difference between kernel counts in two documents by dividing the squared difference for each kernel by its commonality and adding the result up. It works much better like this. Another side effect of this method is that one can compute how "interesting" a document is, by adding up the counts of all kernels divided by their commonality, then dividing that to the length of the text (or the total count of kernels). The higher the number, the less common its content would be.

Monday, January 13, 2014

A compressed string class

I admit this is not a very efficient class for my purposes, but it was a quick and dirty fix for a personal project, so it didn't matter. The class presented here stores a string in a compressed byte array if the length of the string exceeds a value. I used it to solve an annoying XmlSerializer OutOfMemoryException when deserializing a very large XML (400MB) in a list of objects. By objects had a Content property that stored the content of html pages and it went completely overboard when putting in memory. The class uses the System.IO.Compression.GZipStream class that was introduced in .Net 2.0 (you have to add a reference to System.IO.Compression.dll). Enjoy!

    public class CompressedString
        private byte[] _content;
        private int _length;
        private bool _compressed;
        private int _maximumStringLength;

        public CompressedString():this(0)

        public CompressedString(int maximumStringLengthBeforeCompress)
            _length = 0;
            _maximumStringLength = maximumStringLengthBeforeCompress;

        public string Value
                if (_content == null) return null;
                if (!_compressed) return Encoding.UTF8.GetString(_content);
                using (var ms = new MemoryStream(_content))
                    using (var gz = new GZipStream(ms, CompressionMode.Decompress))
                        using (var ms2 = new MemoryStream())
                            return Encoding.UTF8.GetString(ms2.ToArray());
                if (value == null)
                    _content = null;
                    _compressed = false;
                    _length = 0;
                _length = value.Length;
                var arr = Encoding.UTF8.GetBytes(value);
                if (_length <= _maximumStringLength)
                    _compressed = false;
                    _content = arr;
                using (var ms = new MemoryStream())
                    using (var gz = new GZipStream(ms, CompressionMode.Compress))
                        gz.Write(arr, 0, arr.Length);
                        _compressed = true;
                        _content = ms.ToArray();

        public int Length
                return _length;

Mysteries of the Microscopic World - a The Great Courses err.. course

Course cover I can't emphasize enough how cool the video courses from The Teaching CompanyThe Great Courses are. They are in the format of a university course, but no one is there to take notes so the pace of presentation is natural, it is all recorded on video. No black or white boards, either, as the visualizations of what the presenter is saying are added later via computer. Most courses have from 10 to 40 lectures, all in an easy to understand language, but no trace of the ridiculous tricks and populist stupidities in TV documentaries.

This course - Mysteries of the Microscopic World, presented by Bruce E. Fleury - in particular is very interesting, as it discusses microorganisms in relation to human culture. Especially interesting are lectures 11 to 13, discussing the hideous pandemic of 1918, of which nobody seems to be talking or making heroic movies about or even remember, even if it killed from 50 to 100 million people. In comparison, first world war killed a measly 8.5 million. Why is that? Is it as Dr. Fleury suggests, that the pandemic was a horrible and completely unstoppable phenomenon from which no one felt they had escaped or in face of which there were no heroes? I find this almost as disgusting as the disease itself, that people would only want to document their triumphs.

Anyway, for an old guy, Bruce is a funny man. He is very eloquent and not at all boring, despite his fears. The course goes from explaining what microorganisms are, how they evolved, the perpetual arms race against other organisms, including us, how they influenced history and even how they were used in biological warfare, AIDS and even allergies, all in 24 lectures. I think a lot of information in this course is something unlikely for you to have accidentally overheard or to have been exposed to, therefore of high quality.

As an additional bonus, you get to understand not only the evolution of medicine, but of all the quack snake oil ideas that are periodically emerging in "naive populations", truly epidemics in their own right, and even the source of some of the most common sayings and symbols. For example the symbol of medicine has little to do with the wisdom of snakes, but more with the procedure to remove nematode worms from someone's flesh by wrapping them slowly around a stick.

All in all a wonderful course, created and presented by a guy who is clearly adverse to bullshit and who has read and has worked quite a bit to make it. Give it a try!

Friday, January 10, 2014

TV Series I've been watching - Part 17

Happy New Year, everybody! A new year is starting and, with it, a lot of TV shows start and meet their demise. It is time to tell you what I've watched - as community service for you, of course :) - and what I think about them.

Let's start with the already described ones:

  • Doctor Who - The new season with Capaldi in the role of the 13th doctor has not started yet, but there were the 50th anniversary of the show, which pretty much saved Gallifrey and showed all the Doctors so far and even a bit of the new one, then the usual Christmas special, where Gallifrey saved The Doctor. Kind of a quid pro quo. Anyway, it seems to me that in less than 9 years they added at least 600 years to the venerable age of The Doctor. Soon he will become older than The Face of Bo, if he continues that way :)
  • True Blood - I will be watching the new season, but I don't have much hopes for the show now. It was good while it lasted, though.
  • The Good Wife - The old formula of the show started squeaking, as I was observing the last TV series post, so they cooked up something else. Alicia is leaving the company to form her own, to the dismay and hurt of Will, who is now intent on making her hurt back. A game of legal cat and mouse ensues (see what I did there?). It is a breath of fresh air for the show, but I don't know how much it can last. Will's character cannot maintain its value if he loses to Alicia in court all day.
  • Haven - the fourth season sees sheriff Carter from Eureka be a psychotic bad guy. Good for him. They finally defeat him only to find a very changed Audrey Parker. Parker and Carter sitting in a tree...
  • Southpark - Southpark had some very funny episodes lately. Not the best, but pretty cool.
  • Homeland - Season three was weird. High tension, quick turns of the situation, great acting. Unfortunately the characters themselves lost their charisma and empathy value. I had no idea why anyone did what they did, even when they explained it several times. Not that it is confusing, I just can't relate to the characters. The end of the season made me think it was the end of the show, but it seems there will be a fourth.
  • The Walking Dead - A plague, the Governor returning for a while and a lot of the characters leaving or dying. This can be good for the show, as some sort of renewal was desperately needed, but we'll have to see how it goes...
  • Game of Thrones - waiting for the new season. There is a lot of tension about Martin not writing his books fast enough. The show is going to catch up and then what?
  • Copper - Copper got cancelled. It kind of deserved it, though, after a boring and pointless second season.
  • Arrow - I spoke too soon. I kind of like Arrow. I have no good reason for it, but there are a lot of characters, beautiful women, weird magical stuff that has nothing to do with logic and they even added Flash in the series. What I am most happy about, enough to take it from the notwant list, is the return of Manu Bennett, who I noticed in Spartacus as being a very good actor.
  • Elementary - New dynamic. Lucy Liu's character is getting more and more attention as Sherlock himself starts to show all kinds of vulnerabilities and a human side. It works, I think, but they'd better not push it too much.
  • The Tomorrow People (2013) - Ridiculously good looking actors also have superpowers, while being hunted by an evil agency. I am glad the show changed almost everything else in this remake, but the show could use more logic in it.
  • The Legend of Korra - The second season ended in victory for the forces of good, naturally, but a bit better than it started. Korra seems to slowly mature. Really slowly.
  • Rewind - A strange move to cancel this show after its pilot aired. It was a promising one, even if the base concept was a bit too morally unstable.
  • Serangoon Road - The first season ended and I had the feeling of loss and of wanting more that indicates I really liked the show and its characters. There are a lot more facets of the world in it to be explored and I eagerly await the second season.
  • Siberia - Siberia got cancelled and for good reasons, I think. I stopped watching it anyway.
  • Sleepy Hollow - Magic, witches demons and American history. The concept could have gone in so many ways, all good, but they chose the melodramatic way.
  • The Bridge - Interesting show, but I wonder how they intend to continue it in the second season, since all the major story arches of the first one got completed.
  • The Originals - I am close to stopping watching it. It feels like Dynasty with vampires.
  • The Psychopath Next Door - I really liked how this started, but it seems the pilot got converted into a movie and that was the end of it. Too bad!
  • Under The Dome - I want to believe that King's story was better than this watered down, incoherent crap. I may stop watching it altogether.
  • Witches of East End - The show got a little darker, but only a little. New characters and connections appear, but not much of a show besides the eye candy.
  • The Last Witch - I liked the first episode, but there was no second. I really want this to be picked up for a series, but I don't know if it didn't or the plan to continue it later or what...

And the new or restarted shows:

  • Marvel's agents of S.H.I.E.L.D. - A show that combines the idea of superheroes that seems to stick nicely to the public with the one about a good government agency: S.H.I.E.L.D. Their purpose: to police the entire world in order to protect it from itself and contamination from alien technology. I can't possibly subscribe to the mission statement, but the show is decent.
  • Dracula - What started as a ridiculous concept caught up with me. Dracula is no angel, but he fights an even darker agency: the order Draco. Everybody is a bit conspiratorial and over dramatic, but I like the show so far.
  • Killer Women - An US remake of an Argentinian show, it features Tricia Helfer as a Ranger! She uses her womanly powers to fight crime, alas. I really hoped for a better premise. I would predict the show is going to be cancelled pretty quickly, as woman police shows usually get, but I could be wrong, seeing that is a remake of another show, so at least part of it should stick to audiences.
  • Misfits - Misfits ended. It was a nice show, but a little too pointless. I will still recommend you watch the first seasons.
  • Ripper Street - This show was also cancelled, after just two seasons. Me and an entire bunch of people protested against its cancellation because it was a good show! A bit inconsistent, true, but good actors and a nice starting point. Bring it back, you wankers!
  • Wizards vs Aliens - Yes! The second season of the show features yet again wizards taking a stand about the alien hungering for magic: the Necross! There is even an episode about the world where magic originates, where the wizard kid and the female Necross (there transformed into a woman) live together and have a child. Is that weird or what? :)
  • Sherlock - The third season of the British reinvention of the Holmes mythos just started. I am going to watch it, but as you may remember, I don't particularly like the way they did it, even if it stars Benedict Cumberbatch.
  • Ghost in the Shell: Arise - I watched the second episode of this Japanese anime series. If you don't know what Ghost in the Shell is, you should start watching it immediately. Arise is just a reinvention of the series, with better graphics and a change in technology and character stories. It doesn't seem to be as poignant as the films or the Stand Alone Complex series, but it may change in the future.
  • A Young Doctor's Notebook and Other Stories - The second season appeared! Just like the first, four episodes of 20 minutes each. This time the humor is almost not present, instead terrible despair. The main character's ... err.. character is so awful and pathetic that even the viewer has to loathe him. His older alter ego is prepared to forgive him, only even he can't! A very good show, with a completely different structure and feel from anything I've seen so far.

There have been a lot of new shows lately, but many of them I just skimmed or downright refused to even try.

Monday, January 06, 2014

The fallacy of "real life"

I am writing this post because I sometimes get fed up with all these self-righteous people who explain to me, condescendingly of course, what "real" means and how important it is compared to what I may be doing, which has a lower value of reality, often approaching zero. I am hearing that texting or using an instant messenger is not carrying a conversation. That love is attention and that I should always focus that attention on one thing or another (mainly on their person, though). I watch too many movies instead of going out to parties, I read books instead of talking walks, I throw myself into an online game or some news item instead of noticing to my wife's needs, I stay indoors instead of going out. You see, for these people, going outside the home, physically interacting with other humans with no hope of escape and watching events unfold with your eyes (smell them with your nose, touch them with your own skin) rather than seeing them on a screen is what is "real". Well, I am here to tell you all: bullshit! There is no such thing as real since the time a brain was invented.

Now, I could be as condescending as these people are and explain to you how neurologically a brain is trying to project the world, as perceived by the senses, so that it can fit in the head and can simulate events before they happen, thus leading to informed decisions. Or I could bore you to death by demonstrating that two people can never ever have access to the same reality. I won't do that, though :) What I will do is just give you some counter-examples that will prove, I hope, that there was never a common reality to begin with and that technology only enables a process that is too old and too human to ever stop.

When I was a child my parents were thinking that going out would be good for me. I, however, wanted to stay indoors and read books. Not on a PDA or on a computer, but on actual paper, the only things that were then available to me in Romania. They would talk to me, you see, ask me to come to lunch, or ask me a question or try to interact with me for some reason or another. I, however, was lost 20000 leagues under the sea or on some alien planet or in some cave, running from a crazed killer. I couldn't hear them. More, I didn't really want to. They could, of course, smack me in the head and that would certainly feel more real than what was in the book, but does that mean it was not real to begin with? And I will have to say that, even if some written scenario was complete fantasy, I was interacting with it, remembering it, training my mind on it, maybe even believing it could be real or that it was real already. The contents of the book were changing my personality and my knowledge and, on any further "real" interactions with other people, changed them a bit, too. It's the same thing as believing the things said in an electoral campaign and then changing your life's course to account for that. At least sci-fi has a small chance to happen!

My point is that the process of losing myself into a parallel world, whether of my own creation or somebody else's, is something that people have been doing for a long time. Technology is not creating this phenomenon, it only enables it.

And then there is the hypocrisy. Some fantasy book is something not real and I should do something that counts, you say, but you don't have the balls to say the same thing to a religious nut who advocates prayer every Sunday (or perhaps a small war). That would be insensitive to their beliefs, you say then. They have the right to lose themselves in a complete fabrication because they are not the only ones. There is a whole pack ready to tear you to bits if you try to stop it. I have news to you! The readers of books may not be a tight knit pack, but their set includes the set of people who read religious books and believe in them, too. The book readers group is a lot bigger, if a less ferocious, tribe. We are not to be feared, but that doesn't mean you are not insensitive to us.

So now it is easier to watch a movie or a series to become lost in some fantastic universe. It is easier to split communication into small text bits that are sent only when and where you want them. It is a lot easier to imagine you are in a circle of friends, even if you've never actually met most of them. Is that bad? It's like accusing the inventor of writing of making people listen less to other people speak. Don't get me wrong, I am not advocating replacing the old and tried methods of human interaction with technological means; I am instead revolting against attempts to limit the methods I find best for me.

And literary fantasy is not necessarily the stuff that shapes your thoughts for a while. It can be something acutely technical, like a recipe for cake, or a legal contract, or a video explaining how to do something. Neither are "real", they are just information. Then comes your decision to bake the cake, memorize the recipe or just forget the whole thing. And when has anything you've read in a legal contract have anything to do with reality?

I believe that all this propaganda for the concept of reality - itself just a fantasy of the accuser - is used to hide a more brutal thing, one that is harder to accept. I submit to you that when someone prefers to read a book or watch a movie rather than talk to you, it is because you are less interesting. When children prefer to text on their smartphones while ignoring their parents, it's because their parents are boring. When someone prefers indoor activities to outdoor activities, it's because the things you did outside when you were young, the things that made you feel healthy and proud, are becoming less and less relevant. A conversation is two-sided only and continuous only if both participants are incredibly interesting, otherwise there are other options now. Eye contact doesn't communicate the amount and quality of information that makes it worthwhile anymore. And love, the ultimate feeling, the thing that makes the world go round, the stuff of dreams and fairy tales, love just has to be of a certain quality nowadays before it becomes attractive. Reality is boring, it's the low bandwidth information flow of yesterday, the only people living almost exclusively in it are termed savages and peasants and other derogatory terms that you don't want attached to you. Be Zen! Be aware of and absorb everything that is happening to you, instead of choosing the things you want to see and hear and not smell. What pretentious crap!

Learning is now multithreaded, a web of fantasy and fact that just comes at you from all directions and that needs you to determine at every step how reliable, interesting or "real" it is. Other people are just data points and tools to help you achieve goals. Friendship is distributed. Identity is multiple and depends on context. People choose to live in fantasies now, because they can do it easier and better than before, when they still would have chosen it, but they didn't quite knew how. There is an app for everything because we thought of it first, someone created the app and people find the need to use it.

Technology does not ultimately change humanity in unwanted directions because technology has no desires. If humanity changes - or gives technology desires :), it is because it chooses so. It might be a bad choice, but it's a choice nonetheless. And people that find themselves overwhelmed by that choice should refrain from trying to rebrand past as reality.