Wednesday, March 7, 2012

On Mean Shift and K-Means Clustering

Mean shift and K-Means algorithm are two similar clustering algorithms; both of them extract information from data with some kind of mean vector operations. Whereas the K-Mean algorithm has been widely popular, the mean shift algorithm has found only limited applications (e.g. for image segmentation.) In this note I'll briefly compare these two algorithms and show a way, with VisuMap software, to combine them to get much better clustering tools.

The mean shift clustering algorithm has two main drawbacks. Firstly, the algorithm is pretty calculation intensive; it requires in general O(kN2) operations (which are mainly calculations of Euclidean distance,) where N is the number of data points and k is the number of average iteration steps for each data point. Secondly, the mean shift algorithm relies on sufficient high data density with clear gradient to locate the cluster centers. In particular, the main shift algorithm often fails to find appropriate clusters for so called data outliers, or those data points locating between natural clusters. The second issue normally manifests itself in some small uncertain clusters together with the wanted major clusters.

The K-Means algorithm does not have the above two problems. The K-Means algorithm normally requires only O(kN) operations, so that K-Means algorithm can be applied to relatively large dataset. However, K-Means has two significant limitations. Firstly, K-Means algorithm requires that the number of clusters to be pre-determined. In practise, it is often difficult to find the right cluster number, so that we often just pick a sufficiently large cluster number. This will result in situations that some natural clusters will be represented by multiple clusters fund by the K-Means algorithm. We have discussed this issue in a previous blog entry. Secondly, the K-Means algorithm is in general incapable to find non-convex clusters (as clusters of K-Means algorithm form a voronoi segmentation of the data space.) The second limitation makes the K-Means algorithm inadequate for complex non-linear data.

Fortunately, there is simple way to overcome the mentioned problems of the two algorithms: we simply combine them together. That is, as illustrated in the following picture, we first apply the K-Means algorithm on a dataset; then apply the mean shift algorithm on the cluster centers found by the K-Means algorithm. The mean shift clustering step can be done quickly as  its input data (which are the cluster centers of the K-Mean algorithm ) is much smaller than the original dataset. We don't have to know the right cluster number for the K-Means algorithm: a number that is significantly larger than the number of natural clusters will be good enough for the task. The mean shift step will merge appropriate cluster centers to match the natural clusters. Since the clusters of the K-Means algorithm are normally located in high density sites, the mean shift step will normally not produce small uncertain clusters.



Furthermore, the above combination is also capable to find non-convex clusters. The following video clip demonstrates a such case with the VisuMap software.




The following is a series of maps for a data set from flow cytometry. The data set contains 500 000 data points that has 12 numeric attributes for 500 000 cells. The first map is a typical bi-variate map in flow cytometry analysis; The second map is a PCA map produced with VisuMap; The third map is a PCA map colored with K-Mean and Mean Shift algorithm. We see that the colored PCA map reveals much more details and sub-populations. It took about 2 minutes to produce the colored PCA map, whereas more than 10 hours with mean-shift algorithm alone.