First, a few words about the implementations of these algorithms. MS is basically a variation of the k-mean algorithm that uses the expectation maximization (EM) method to solve the k-medoid problem. Such algorithms are sometimes called k-center or k-medoid algorithm. Simple direct implementation of k-medoid algorithm often just find bad solutions (i.e. local minimums). People normally has to repeat the algorithm many many times to find an acceptable solution. The MS algorithm implemented in VisuMap uses a stochastic mechanism to avoid or reduce such local minimum problem.
AP uses message passing mechanism to find representative exemplars within dataset with a similarity structure. AP is basically a deterministic algorithm, it does not require random initialization or randomness during the optimization process. However, AP algorithm often suffers from the oscillation problem in the way that the algorithm wanders around several local minimums for quite long time. To reduce the oscillation, we can use a large dumping factor as suggested in the original paper of AP. But a more efficient way seems to be using small randomness when assigning preferences to the data points. By default, the implementation of AP in VisuMap enables this randomness.
Thus, both MS and AP in VisuMap are of stochasitc nature. I have applied the two algorithms each 500 times on the dataset SP500 (each data point contains the weekly price history of a stock in a year) and got all together 1000 different clustering solutions (each solution is a selection of 25 exemplar points). Using the t-SNE algorithm I created the map of these 1000 solutions as shown in the following picture:
In above map each red/green dot represents a solution found by AP/MS algorithm respectively. As a special characteristic of t-SNE algorithm, identical solutions will be arranged as one more circles. That means, dots in a rectangle in above map represent acutally identical solutions.
We notice first that the two algorithms found very different solutions as red and green dots are clearly separated from each other. There are only two small regions (marked by ovals) where the two algorithms found similar solutions. Secondly, we notice that MS found a large number of different solutions. AP found just few different solutions. This means that the AP optimization process often converged to common solutions.
The following table shows the statistics about the solutions found by AP and MS:
Metric Sampling | Affinity Propagation | |
Min Error | 0.4972 | 0.4901 |
Max Error | 0.5058 | 0.5035 |
Average Error | 0.5004 | 0.4960 |
Process Time in Seconds | 5 | 4 |
In the above table the error of a solution is the average distance of the data points to their cluster center (i.e. exemplars). The average error is calculated over the 500 solutions found by each of the two algorithms. We can see that AP has found, on the average, better solutions than MS, but the improvement is only about 0.4%. So, it is fair to say that these two algorithms performed similarly with respect to this dataset.
The situation changed a lot when I applied the two algorithms to a different dataset, namely the dataset PharMine171 discussed in previous blog entry. This dataset contains about 1000 binary 171-dimensional vectors , whereas the SP500 dataset contains 500 40-dimensional vectors with real values. The following is the map of the clustering solutions and the corresponding statistics.
Metric Sampling | Affinity Propagation | |
Min Error | 2.1775 | 2.1101 |
Max Error | 2.2321 | 2.1686 |
Average Error | 2.1972 | 2.1603 |
Process Time in Seconds | 6 | 30 |
From above map we can see that AP and MS still produced different set of solutions. AP also produced more different solutions than with the previous dataset. This means that, with respect to this dataset, AP is more sensitive to small variations/noise in the input data. This phenomenon might be explained as following: Because this dataset contains binary vectors, many similarities between data points might have exact equal values. Those equivalence of similarity will be broken different by small variations in the input data. Since AP indeed uses operation like max() to distinguish even infinitesimal differences, small variations may lead to significantly different solutions.
We also notice that AP is substantially slower than MS. For larger datasets, the performance difference will be even larger. For instance, for a dataset with about 4000 data points, MS is more than 100 times faster than AP. It should be pointed that both implementations of AP and MS uses the full similarity matrix. It remains to see how much speedup can be achieved by using sparse matrix.
More interestingly we see that the solutions found by AP is about 3.7% better than those found by MS; and even the worst solution found by AP is better than the best solution found by MS.
Now, one would ask the question: Does AP always find better solutions than MS? Unfortunately, this is not the case. The following picture shows the 3D presentation of the spheroid dataset that contains 1825 data points forming together a spheroid:
As be shown in the following test statistics, solutions found by MS is about 3.7% better than those found by AP:
Metric Sampling | Affinity Propagation | |
Min Error | 5.9029 | 5.9029 |
Max Error | 6.4034 | 6.6483 |
Average Error | 6.0948 | 6.3199 |
Process Time in Seconds | 9 | 284 |
Conclusions:
I believe AP is a useful and promising addition to the arsenal of clustering algorithms. Not only that AP can find better solutions, it also finds different solutions as existing algorithms. As we know that there are many different clustering problems. AP doesn't seem to be designed to directly attack the k-medoid problem, but nevertheless it is competitive to those algorithms directly targeting the k-medoid problem. This indicates that AP may have potential to attack a much larger class of clustering problems.