Thursday, December 16, 2010

Programming Collective Intelligence by Toby Segaran

Book coverProgramming Collective Intelligence is easy to read, small but concise, and its only major flaw is the title; and that is because it is misleading. The book touches quite heavily on using collective information and social site APIs, but what it is really about is data mining. It may not be a flaw with the majority of readers, but personally I wouldn't care about the collective, the Facebook API or anything like that, but I was really interested in the different ways to analyse data. In that sense, this book can be taken as a reference guide on data mining.

Each algorithm and idea is accompanied by Python sources. I personally dislike Python as a language, but the author afirms he chose it intentionally because the algorithms look clear and the source is small, with its purpose unhindred by many language artefacts. The book was so interesting, though, that I plan (if I ever find the time :( ) to take all the examples and do them in C#, then place them on Codeplex.

The book covers classification and feature extraction, supervised and unsupervised algorithms, filtering and discovery and it also has exercises at the end of each chapter. Here is a short list:
  • Making Recommendations - about the way one can use data from user preferences in order to create recommendations. Distance metrics and finding similar items to the ones we like or people with similar tastes.
  • Discovering Groups - about classifying data into different groups. Supervised and unsupervised methods are described, hierarchical clustering, dendograms, column clustering, K-Means clustering and diferent methods of visualisation.
  • Searching and Ranking - it basically explains step by step how to make a search engine. Word frequency, word distance, location of a document, counting methods, artifical neural networks, the Google PageRank algorithm, extraction of information from link text, and learning from user clicks can be found in this chapter.
  • Optimization - simulated annealing, hill climbing, genetic algorithms are described and exampled here. The chapter talks about optimizing problems like travel schedules and the example uses data from Kayak.
  • Document Filtering - a chapter about filtering documents based on preferences or getting rid of spam. You can find here Bayesian filtering and the Fisher method.
  • Decision Trees - a very interesting method of splitting information items into groups that have a hierarchical connection between them. The examples use the Zillow API
  • Bulding Price Models - k-Nearest neighbours, weighted neighbours, scaling.
  • Advanced Classification - Kernel Methods and Support Vector Machines. This is a great chapter and it show some pretty cool uses of data mining using the Facebook API
  • Finding Independent Features - reviews Bayesian classification and clustering, then proposes Non-Negative Matrix Factorisation, a method invented circa the late 90s, a powerful algorithm which uses matrix algebra to find features in a data set
  • Evolving Intelligence - bingo! Genetic Programming made easy. Really cool.
  • Algorithm Summary, Third Party Libraries and Mathematical Formulas - if you had any doubts you can use this book as a data mining reference book, the last three chapters eliminate them. An even more concise summary of the methods explained in the book, listing every math formula and obscure library used in the book

Conclusion: I really loved the book and I can hardly wait to take it apart with a computer in hand.