Visualizing High Dimensional Data

Thursday, February 21, 2008

VisuMap 2.6 Released

We have released VisuMap 2.6 today. This release includes some significant internal architecture change that may simplify future development.

We have considered implementing DirectX based code with WPF or XNA, but both libraries have turned out to be inadequate for our purpose. WPF does not support sprite; whereas XNA requires shader 1.1 as a minimum that is too much for our customers.

Apart from many accumulative improvements we have implemented a new data view, called mountain view, for high dimensional data. This feature can be considered as 3D extension of the existing value diagram. Mountain view also uses DirectX for fast navigation. The following is snapshot of the mountain view window:

Tuesday, January 8, 2008

P. Dirac about geometric and algebraic way.

Recently, during my search for information about projective space I bumped into an interesting script of P. Dirac with title "Projective Geometry, Origin of Quantum Equations".
The script is made from a talk Dirac gave in 1972. He seemed to talk to a general public, so the talk was rather inform.

In his talk Dirac briefly described projective geometry and argued that projective geometry is more appropriate than Euclidean geometry as a mathematical structure for quantum theory. I was not able to really understand the link between projective geometry and quantum theory, but I believe his view was fundamentally correct in theoretical physics.

What has interested me the most was his philosophical comments about geometry and algebra. In mathematical works you will, according to Dirac, either prefer the algebraic way or the geometrical way. The algebraical way is more deductive where you start with equations or axioms and follow rules to get to interesting results. The geometrical way is more inductive where you start with some concrete pictures and try to find relationship among the pictures.

Dirac put himself into the geometrical camp. But he lamented that his works rather appear more algebraic (e.g. lots of equations) although he used a lot of geometrical methods. The reason for this is that it was pretty awkward to produce and print pictures in published papers. Writing equations was just much easier for him (and probably most other theorists in his time) than drawing pictures. So, it was rather a technique limit that led to heavy algebraic appearance of his works.

That technique limit seems to disappear with the availability of modern computers. I am wandering what impact could it have had if the modern computers were available to those great scientists. On the other hand, in the current academic world the algebraic way still seems to dominate the stage: a paper with a lot of differential equations would be considered more scientific than a paper with a lot pictures.

The mission statement of VisuMap Technologies is unleashing human visualization power for complex high dimensional data. It is indeed our grandiose ambition to revive visualization as first ranking tool to do science, that has somehow lost its glory since Rene Descartes invented his coordinate system. In order to achieve this we not only need new software, new methodologies and theories, but also people who embrace visual way to do scientific research. So, there is still a long way to go!

I feel pleased that Euclidean space has been considered inadequate by Dirac for quantum theory. Since I have felt for long time that the open Euclidean space is not appropriate for generic study of high dimensional data. For instance, some concepts like left/right, centers/peripherals (which make sense in Euclidean space) can not be applied intuitively to high dimensional data. For this reason, I have called RPM (relational perspective map) as MDS on closed manifolds. RPM basically simulates a non-stationary dynamic system on a closed manifold. Interestingly, modern physics offered some useful tools to do that. I have looked in to several low dimensional manifolds as our image space (like sphere, torus and real projective space). Our long term plan is to evolve the image space to more expressive structures to visualize high dimensional data.

Tuesday, October 30, 2007

Transformation between rectangle and torus

I have found today a nice video clip from YouTube that visualizes the transformation between rectangle and torus surface. Torus surface is the base information space used by relational perspective map (RPM). This is main characteristics of RPM that distinguish RPM from other conventional mapping methods which use open Euclidean space as image space.

There is also good clips showing the construction of Klein-Bottle. But I could not find a good clip for the construction of real projective space. In order to visualize projective space and other more complex 2/3-D manifolds we probably need to partition the objects in some intuitive way.

Monday, October 29, 2007

New paper comparing CCA and Manifold learning arlgorithms

The journal Information Visualization has recently published a very interesting paper of J. Venna and S. Kaski with the title "Comparison of Visualization methods for an atlas of gene expression data sets".

This paper has compared the performance of many algorithms for mapping various kind of data sets. Algorithms considered include: PCA, LLE, Laplacian Eigenmap, Isomap and CCA. The performance comparison are done with diagram of trustworthiness and continuity which visualize, like Shepard diagram, the discrepancies between input- and output-distance matrices.

The advantage of the trustworthiness and continuity diagrams over the Shepard diagram is that they aggregate the discrepancy information uniformly over all data points, so that you get a single curve to show the quality of a map. With Shepard diagram you get a curve for each data point. Thus, trustworthiness/continuity diagrams are much easier to apply in practice. On the other side, the Shepard diagram provides more detailed information that allows, for instance, users to investigate mapping quality with respect to individual data points (not just the whole map).

Whereas those diagrams provide objective measures for mapping quality, I think they should be used with care. They may not always reflect the subjective mapping quality perceived by human, and the ultimately goal should be helping people not machines. Blindly trusting these numbers might discourage development of new useful algorithms. One main problem with these diagrams is, for instance, they don't have the concept of partition. Algorithms (like RPM) which simplify data by partitioning (apart from dimensionality reduction) are greatly penalized. Partition, as a perception method, is probably as fundamentally as focusing-by-proximity.

A main message of this paper is that CCA algorithm clearly and significantly out-performed other algorithms based on explicit unfolding. Our experience supports this assessment. We have not encountered a single data set that CCA performed noticeably worse than algorithms like LLE and Isomap. Sammon map and PCA cannot be compared directly with CCA as they preserve long distance information and visualize the over-all structure of the data set (instead of unfolding no-linear structure).