Tuesday, October 30, 2007

Transformation between rectangle and torus

I have found today a nice video clip from YouTube that visualizes the transformation between rectangle and torus surface. Torus surface is the base information space used by relational perspective map (RPM). This is main characteristics of RPM that distinguish RPM from other conventional mapping methods which use open Euclidean space as image space.

There is also good clips showing the construction of Klein-Bottle. But I could not find a good clip for the construction of real projective space. In order to visualize projective space and other more complex 2/3-D manifolds we probably need to partition the objects in some intuitive way.

Monday, October 29, 2007

New paper comparing CCA and Manifold learning arlgorithms

The journal Information Visualization has recently published a very interesting paper of J. Venna and S. Kaski with the title "Comparison of Visualization methods for an atlas of gene expression data sets".

This paper has compared the performance of many algorithms for mapping various kind of data sets. Algorithms considered include: PCA, LLE, Laplacian Eigenmap, Isomap and CCA. The performance comparison are done with diagram of trustworthiness and continuity which visualize, like Shepard diagram, the discrepancies between input- and output-distance matrices.

The advantage of the trustworthiness and continuity diagrams over the Shepard diagram is that they aggregate the discrepancy information uniformly over all data points, so that you get a single curve to show the quality of a map. With Shepard diagram you get a curve for each data point. Thus, trustworthiness/continuity diagrams are much easier to apply in practice. On the other side, the Shepard diagram provides more detailed information that allows, for instance, users to investigate mapping quality with respect to individual data points (not just the whole map).

Whereas those diagrams provide objective measures for mapping quality, I think they should be used with care. They may not always reflect the subjective mapping quality perceived by human, and the ultimately goal should be helping people not machines. Blindly trusting these numbers might discourage development of new useful algorithms. One main problem with these diagrams is, for instance, they don't have the concept of partition. Algorithms (like RPM) which simplify data by partitioning (apart from dimensionality reduction) are greatly penalized. Partition, as a perception method, is probably as fundamentally as focusing-by-proximity.

A main message of this paper is that CCA algorithm clearly and significantly out-performed other algorithms based on explicit unfolding. Our experience supports this assessment. We have not encountered a single data set that CCA performed noticeably worse than algorithms like LLE and Isomap. Sammon map and PCA cannot be compared directly with CCA as they preserve long distance information and visualize the over-all structure of the data set (instead of unfolding no-linear structure).

Tuesday, October 16, 2007

Symmetry between dimensionality reduction and data clustering

Dimensionality reduction and data clustering are two main types of services offered by VisuMap to support visual exploration of high dimensional data. Although there was no explicit plan in the initial design of VisuMap to develop these two types of services in parallel, they have evolved nicely together in the software architecture.

We might argue that the reason for the nice coexistence are because they are both indispensable to explore high dimensional no-linear data. But there is another more profound reason for this: namely, they are conceptually symmetry to each other.

In order to see this "symmetry" let us consider a high dimensional dataset as a data table with rows and columns. The number of columns is normally also considered as the dimension of the data. A dimensionality reduction method basically tries to reduce the number of columns without losing much relevant information. On the other side, a clustering algorithm tries to group similar rows together so that the clusters preserve as much as possible information. In other words, the purpose of clustering algorithm is to reduce the number rows by replacing them with clusters.

In terms of "symmetry" we can say, dimensionality reduction algorithms reduce the number of columns whereas clustering algorithms reduce the number rows.

So, what does this symmetry brings us? One application of this symmetry is that we can transfer any clustering algorithm to a dimensionality reduction method (and vice versa) by transposing the data table as a matrix. For instance, we can apply a clustering method on the columns of dataset to get three clusters and use the centroids of the clusters as 3D coordinates for the rows. By doing so we reduce the dataset's dimension to 3.