Monday, February 24, 2014

Quickly clustering a large number of items

It just happens that I have two different projects that have the need of cluster analysis, applied in two different ways: one has uses on maps, where a large number of items needs to be displayed quickly, while another implies finding clusters of news items, where the distance between them is determined by their content. The most used clustering algorithm and the first to be found by searching the web is the k-means clustering. Its purpose is to categorize a list of items into a number of k clusters, hence the name. Setting aside the use value of the algorithm for my purposes, the biggest problem I see is the complexity: in its general form it is at least O(n2), and most of the time a lot higher. The net abounds with scientific papers investigating the k-means complexity and suggesting improvements, but they are highly mathematical and I didn't have the time to investigate further. So I just built my own algorithm. It is clearly fuzzy, imperfect, may even be wrong in some situations, but at least it is fast. I will certainly investigate this area more, maybe even try to understand the math behind it and analyse my results based on this research. When I do that I will update this post or write others. But until then, let me present my work so far.

The first problem I had was, as I said, complexity. For one million points on the map, any algorithm that takes into account the distance between any two items will have to make at least one trillion comparisons. So my solution was to limit the number of items by grouping them in a grid:
Step 1: find the min and max on each dimension (that means going through the entire item collection once or knowing beforetime the map bounds)
Step 2: determine a number of cells that would be a bit more than what I need in the end. (that's a decision I have to take, no algorithmic complexity)
Example: for my map example I have only two dimensions: X and Y. I want to display an upper bound of 1000 clusters. Therefore I find the minimum and maximum X and Y and then split each dimension into 100 slots. That means I would cluster the items I have into 10000 cells.
Step 3: for each item, find its cell based on X,Y and add the item to the cell. This is done by simple division: (X-minX)/(maxX-minX). (again that means going once through the collection)
Step 4: find the smallest cell (the complexity is reduced now to working with cells)
Step 5: find its smallest neighbour (the complexity of this on the implementation)
Step 6: merge the two cells
Until the number of cells is larger than the desired number of clusters, repeat from Step 4.
In the end, the algorithm is O(n+p*log(p)), I guess, where p is the number of cells chosen at step 2.

Optimizations are the next issue.
  • How does one find the neighbours of a cell? On Step 3 we also create a list of neighbors for each new cluster by looking for a cluster that is at coordinates immediately above, below, left or right. When we merge two clusters, we get a cluster that is a neighbour to all the neighbours of the merged clusters.
  • How does one quickly get the cluster at a certain position? We create a dictionary that has the coordinates as the key. What about when we merge two clusters? Then the new cluster will be accessible by any of the original cluster keys (that implied that each cluster has a list of keys, as well)
  • How does one find the smallest cell in the cluster list? After Step 3 we sort the cluster list by the number of items they contain and each time we perform a merge we find the first item larger than the merged result and we insert it in the list at that location, so that the list always remains sorted.
  • How do we easily find the first item larger than an item? By employing a divide-et-impera method of splitting the list in two at each step and choosing to look into one bucket based on the item count of the cluster at the middle position

Before you use the code note that there are specific scenarios where this type of clustering would look a bit off, like items in a long line or an empty polygon (the cluster will appear to be in its center). But I needed speed and I got it.


Update: The performance of removing or adding items from a List is very poor, so I created a LinkedList version that seems to be even faster. Here it is. The old List version is at the end

/// <summary>
/// Generic x,y positioned item class
/// </summary>
public class Item
    public double X { get; set; }
    public double Y { get; set; }

public class ClusteringDataProcessor
    /// <summary>
    /// the squared root of the initial cell number (100*100 = 10000)
    /// </summary>
    private const int initialClustersSquareSide = 100;
    /// <summary>
    /// the desired number of resulting clusters
    /// </summary>
    private const int numberOfFinalClusters = 1000;
    private static Random _rnd = new Random();

    /// <summary>
    /// In this implementation, the Cluster inherits from Item, so the result is a list of Item
    /// In the case of one Item Clusters, we actually return the original Item
    /// </summary>
    /// <param name="list"></param>
    /// <returns></returns>
    public List<Item> Process(List<Item> list)
        if (list.Count <= numberOfFinalClusters) return list;
        // find bounds. If already known, this can be provided as parameters
        double minY = double.MaxValue;
        double minX = double.MaxValue;
        double maxY = double.MinValue;
        double maxX = double.MinValue;
        foreach (var item in list)
            var y = item.Y;
            var x = item.X;
            minY = Math.Min(minY, y);
            maxY = Math.Max(maxY, y);
            minX = Math.Min(minX, x);
            maxX = Math.Max(maxX, x);
        // the original list of clusters
        var clusterArr = new List<Cluster>();

        // the dictionary used to index clusters on their position
        var clusterDict = new Dictionary<string, Cluster>();

        // the unit for finding the cell position for the initial clusters
        var qX = (maxX - minX) / initialClustersSquareSide;
        var qY = (maxY - minY) / initialClustersSquareSide;

        foreach (var item in list)
            // compute cell coordinates (integer and the values used as keys in the dictionary)
            var cx = Math.Min((int)((item.X - minX) / qX), initialClustersSquareSide - 1);
            var cy = Math.Min((int)((item.Y - minY) / qY), initialClustersSquareSide - 1);
            var key = getKey(cx, cy);
            Cluster cluster;
            // if the cluster for this position does not exist, create it
            if (!clusterDict.TryGetValue(key, out cluster))
                cluster = new Cluster
                    Keys = new List<string> { key },
                    X = minX + cx * qX + qX / 2,
                    Y = minY + cy * qY + qY / 2,
                    //Items = new List<Item>(),
                    Count = 0,
                    Neighbors = new List<string>()
                // the neighbours of this cluster are the existing clusters that are below, above, left or right. If they exist, this cluster also is added to their neighbour list
                var nkeys = new[] { getKey(cx - 1, cy), getKey(cx + 1, cy), getKey(cx, cy - 1), getKey(cx, cy - 1) };
                for (var j = 0; j < 4; j++)
                    Cluster nc;
                    if (clusterDict.TryGetValue(nkeys[j], out nc))
                clusterDict[key] = cluster;
            // add the item to the cluster (note that the commented lines hold the items in a list, while the current implementation only remember the number of items)
            cluster.Item = item;
            // add the item position to the sums, so that we can compute the final position of the cluster at the Finalize stage without enumerating the items (or having to hold them in an Items list)
            cluster.SumX += item.X;
            cluster.SumY += item.Y;
        // if the number of items is already smaller than the desired number of clusters, just return the clusters
        if (clusterArr.Count <= numberOfFinalClusters)
            return clusterArr.Select(c => c.Finalize()).ToList();

        // sort the cluster list so we can efficiently find the smallest cluster
        //clusterArr.Sort(new Comparison<Cluster>((c1, c2) => c1.Items.Count.CompareTo(c2.Items.Count)));
        LinkedList<Cluster> clusterLinkedList = new LinkedList<Cluster>(clusterArr.OrderBy(c => c.Count));

        // remember last merged cluster, as next merged clusters might have similar sizes
        var lastCount = int.MaxValue;
        LinkedListNode<Cluster> lastLinkedNode = null;

        // do this until we get to the desired number of clusters
        while (clusterLinkedList.Count > numberOfFinalClusters)
            // we need to get the smallest (so first) cluster that has any neighbours
            var cluster1 = clusterLinkedList.First(c => c.Neighbors.Any());
            Cluster cluster2 = null;
            // then find the smallest neighbour
            var min = int.MaxValue;
            foreach (var nkey in cluster1.Neighbors)
                var n = clusterDict[nkey];
                //var l = n.Items.Count;
                var l = n.Count;
                if (l < min)
                    min = l;
                    cluster2 = n;
            // join the clusters
            var keys = cluster1.Keys.Union(cluster2.Keys).ToList();
            var cluster = new Cluster
                Keys = keys,
                // approximate cluster position, not needed
                //X = (cluster1.X + cluster2.X) / 2,
                //Y = (cluster1.Y + cluster2.Y) / 2,

                // the result holds the count of both clusters
                //Items = cluster1.Items.Union(cluster2.Items).ToList(),
                Count = cluster1.Count + cluster2.Count,
                // the neighbors are in the union of their neighbours that does not contain themselves
                Neighbors = cluster1.Neighbors.Union(cluster2.Neighbors)
                // compute the sums for the final position
                SumX = cluster1.SumX + cluster2.SumX,
                SumY = cluster1.SumY + cluster2.SumY
            foreach (var key in keys)
                clusterDict[key] = cluster;
            // efficiently remove clusters since LinkedList removals are fast

            // a little bit of magic to make the finding of the insertion point faster (LinkedLists go through the entire list to find an item)
            // if the last merged cluster is smaller or equal to the new merged cluster, then start searching from it.
            // this halves the insert time operation, but I am sure there are even better implementations, just didn't think it's so important
            LinkedListNode<Cluster> start;
            if (lastCount <= cluster.Count && lastLinkedNode.Value != cluster1 && lastLinkedNode.Value != cluster2)
                start = lastLinkedNode;
                start = clusterLinkedList.First;
            var insertionPoint = nextOrDefault(clusterLinkedList, start, c => c.Count >= cluster.Count);
            // remember last merged cluster
            LinkedListNode<Cluster> link;
            if (insertionPoint == null)
                link = clusterLinkedList.AddLast(cluster);
                link = clusterLinkedList.AddBefore(insertionPoint, cluster);
            lastLinkedNode = link;
            lastCount = cluster.Count;
        return clusterLinkedList.Select(c => c.Finalize()).ToList();

    private LinkedListNode<T> nextOrDefault<T>(LinkedList<T> list, LinkedListNode<T> start, Func<T, bool> condition)
        while (start.Next != null)
            if (condition(start.Value)) return start;
            start = start.Next;
        return null;

    private string getKey(int cx, int cy)
        return cx + ":" + cy;

    private class Cluster : Item
        public Cluster()
            SumX = 0;
            SumY = 0;

        public double SumX { get; set; }
        public double SumY { get; set; }

        public List<string> Keys { get; set; }

        //public List<Item> Items { get; set; }
        public int Count { get; set; }
        public Item Item { get; set; }

        public List<string> Neighbors { get; set; }

        /// <summary>
        /// the function that finalizes the computation of the values or even decides to return the only item in the cluster
        /// </summary>
        /// <returns></returns>
        public Item Finalize()
            //if (Items.Count == 1) return Items[0];
            if (Count == 1) return Item;
            /*Y = SumY / Items.Count;
            X = SumX / Items.Count;
            Count = Items.Count;*/
            Y = SumY / Count;
            X = SumX / Count;
            Count = Count;
            return this;

old List based code (click to show)