Monday, October 29, 2007

New paper comparing CCA and Manifold learning arlgorithms

The journal Information Visualization has recently published a very interesting paper of J. Venna and S. Kaski with the title "Comparison of Visualization methods for an atlas of gene expression data sets".

This paper has compared the performance of many algorithms for mapping various kind of data sets. Algorithms considered include: PCA, LLE, Laplacian Eigenmap, Isomap and CCA. The performance comparison are done with diagram of trustworthiness and continuity which visualize, like Shepard diagram, the discrepancies between input- and output-distance matrices.

The advantage of the trustworthiness and continuity diagrams over the Shepard diagram is that they aggregate the discrepancy information uniformly over all data points, so that you get a single curve to show the quality of a map. With Shepard diagram you get a curve for each data point. Thus, trustworthiness/continuity diagrams are much easier to apply in practice. On the other side, the Shepard diagram provides more detailed information that allows, for instance, users to investigate mapping quality with respect to individual data points (not just the whole map).

Whereas those diagrams provide objective measures for mapping quality, I think they should be used with care. They may not always reflect the subjective mapping quality perceived by human, and the ultimately goal should be helping people not machines. Blindly trusting these numbers might discourage development of new useful algorithms. One main problem with these diagrams is, for instance, they don't have the concept of partition. Algorithms (like RPM) which simplify data by partitioning (apart from dimensionality reduction) are greatly penalized. Partition, as a perception method, is probably as fundamentally as focusing-by-proximity.

A main message of this paper is that CCA algorithm clearly and significantly out-performed other algorithms based on explicit unfolding. Our experience supports this assessment. We have not encountered a single data set that CCA performed noticeably worse than algorithms like LLE and Isomap. Sammon map and PCA cannot be compared directly with CCA as they preserve long distance information and visualize the over-all structure of the data set (instead of unfolding no-linear structure).

No comments: