Tuesday, October 21, 2008

Clustering clusters

Many clustering algorithms, like k-mean algorithm, are of stochastic nature in the sense that they produce different partitions (of a data set) depending on the random initialization. In this blog I am going to use the mapping service provided by VisuMap to investigate the partitions generated by a particular clustering algorithm. In other word, I want to visualize the clusters among partitions generated by a clustering algorithm.

The clustering algorithm I am going to look at is the metrical sampling (MS) clustering algorithm which is basically a k-mean algorithm with some randomization extension. For a given set of data points with a distance matrix, MS algorithm tries to find a set of sample points, so that minimizes the average error, that is average distance of the data points to their nearest sample point (also called exemplar). Thus, MS algorithm is one of many available algorithms to solve the so called k-medoid clustering problem (which is an NP complete problem according to certain literature).

For a dataset of N data points, MS algorithm produces a N dimensional binary vectors. Each component is a binary value that decides whether a data point is selected as a sample point. So, when we run the MS algorithm many times with different random initializations, we will get a list of N dimensional binary vectors. I am interested to see clusters or patterns among those binary vectors. For this study, I picked up the dataset PharMine171 that contains 1025 fingerprints of 171 pharmacologically interesting compounds. I have run MS algorithm on the dataset for 1000 times. The following is a map of the 1000 partitions ( produced with the t-SNE mapping algorithm):


In above map, each spot represents a partition generated by one run of MS algorithm. Closely located partitions means they share more sample points. The coloring is done manually to highlight the clusters. We can easily see that there are 8 or 9 clusters among the 1000 partitions.

Now, one question that raises naturally is: Do the partitions in one cluster have similar qualities (i.e. average error)? In order to answer this questions I opened a bar view window that displays the average error of each partition in an ordered bar view as follows:

In above picture, each bar (in yellow or red color) represents a partition. The height of the bar indicates their average error. The red bars represent those partitions selected manually on the partition maps. So, when we select a cluster in the partition map, we can easily see the corresponding distribution of average error of that clusters. The following two pictures show two such distributions:




We notice that the average error in a cluster is spread all over the whole error range (from very bad to very good). This means, similar partitions can have very different qualities. Thus, the answer to above questions is negative.

More in general, I suppose this study indicates that any algorithm trying to find good solutions for the k-medoid problem need a mechanism to allow global variations in the searching space. The MS algorithm implemented in VisuMap does this through controlled randomness. Another interesting algorithm, named affinity propagation seems to do that with help of 2N2 addition system states (called messages). Affinity propagation is also supported in VisuMap. I will do some comparison between MS and AP in another blog.

No comments: