Tuesday, November 29, 2011

K-Mean clustering under no-optimal conditions

K-Mean algorithm (KMA) is a widely used clustering algorithm for multidimensional data. In this note, I'll show some intuitive behaviors of KMA with a very simple quantization task: Grouping a series of numeric values into multiple groups. An intuitive requirement for this task is: if some values are concentrated at a location, those values should be grouped together.

I have generated 5 datasets, each with 5000 values between -5.0 and +20.0. Data points in the 5 datasets are successively concentrated at 1 to 5 locations according to Gaussian distribution. I used KMA to cluster these data into 3 clusters. I ran KMA on each dataset twice; the result is shown in the following picture:

In above picture, each spectrum shows the result of one run of KMA on a dataset. Each bar in the spectrum indicates a data point in a dataset. The curve on the spectrum shows the density of the data points. The bar colors indicate the clusters created by the KMA. We notice that all bars are colored by 3 different colors as the number of clusters, K, used KMA is 3.

Let N be the natural clusters present in the dataset (i.e. N=1,2,3,4,5 for these 5 datasets,) it would be helpful to distinguish the behaviors of KMA in to the following three cases:

(1) N < K : In this case the KMA has to group fewer natural clusters into larger number of clusters. As be shown in spectrum 1.a, 1.b, 2.a and 2.b, KMA assigned more than one clusters (colors) to a natural cluster, but didn't assign one to multiple natural clusters. For instance, in spectrum 2.a, the first natural cluster has been split evenly into two clusters.

(2) N = K : In this case the KMA has to group exact N natural clusters in to N clusters. We see in spectrum 3.a and 3.b that KMA did a nice job in this case. KMA acutally found the natural clusters.

(3) N > K : KMA has to group more natural clusters into fewer clusters. As be shown in spectrum 4.a, 4.b, 5.a and 5.b, KMA, in this case, grouped multiple natural clusters together. KMA has not, at least not significantly, split natural clusters. How KMA groups the natural clusters can change from run to run, depending on the initialization of the algorithm. For instance, spectrum 4.a and 4.b show two different grouping created by two different KMA runs.

In above test, our datasets all have clearly separated natural clusters. I would say that KMA did a pretty good job: its results reflect pretty much our intuitive expectation. However, if the natural cluster are not so clearly separated, the result of KMA would not be such nice. The following picture shows how KMA groups less separated natural clusters:

In above picture, the 5 datasets are created similarly as before except that these natural clusters overlap each other somewhat. We see that, only for the case K=N, KMA found the exact natural clusters. For other causes, KMA does not respect the natural clusters. Actually, we can visually do a better job based on the density curves.

No comments: