Tuesday, July 31, 2012

Ensemble Clustering with VisuMap

A common question by data clustering is: How stable is the clustering results? It often happens that some data points are classified by an algorithm to a common cluster, but when you run the algorithm again on the same data with slightly different initialization conditions these data points might be classified into different clusters. This phenom seems to be a consequence of the nondeterministic behavior of the algorithm, but it is rather a manifestation of the clustering structure among the data. From certain point of view, these unstable data points tells us actually more interesting information about data than those stable data points.

One way to answer the stability question is using the boosting method (also called ensemble clustering) where we aggregate the results of multiple runs of a clustering algorithm. One of the simplest aggregation method is as following: we classify two points to a common cluster (i.e. assign the same color to two points in a map) if all runs classified the two points to a common cluster

To visualize the ensemble clustering let us consider a sample as by the following picture. Th picture shows how the k-Mean algorithm clustered a data set in three different runs (i.e. with different random initialization.) When we look closely at these three maps, we can notice, as marked in the circles, that the boundary between two clusters varies among the three maps. This means that data points in this region are unstable. But each individual one of the three maps does not tell us this information; and even with all three maps displayed together, it is rather hard to find these unstable points by visual comparisons.

Now, when we aggregate the three maps with the simple aggregation method mentioned above, we get a  map as show in the following picture. We can recognize easily that, as circled in 3 regions, that there are 3 unstable regions: data points in these regions have different colors than those major clusters. We also notice here easily that there are some stable boundaries between some clusters.

Although this kind of ensemble clustering is fairly simple to implement with the scripting interface of VisuMap, it lacked a simple friendly user interface for this service. The new release of VisuMap,  the version 3.5.881, resolved this problem with a new utility, called cluster manager, that offers the services to capture, store and explore clustering results. And especially, the cluster manager makes it very straightforward to do ensemble clustering. As illustrated in the following screenshot, the user just need to select the clustering results, called named coloring, then choose an appropriate context menu to compose the aggregated clustering:

No comments: