Visualizing High Dimensional Data: 2009

Saturday, October 24, 2009

New productivity features in VisuMap 3.0

I have just posted a short video that shows some new productivity features implemented in VisuMap version 3.0.

Wednesday, September 23, 2009

A layman's introduction to principal component analysis.

I have just uploaded a very short introduction for PCA to youtube. The video is probably the shortest introduction to PCA (1.5min). This introduction emphasizes the geometrical aspects, instead of the usual statistical nature. I especially like the statement that the covariance matrix is just a method to measure the average extend of an object (a point cloud) along any axis.

Monday, August 31, 2009

VisuMap 3.0 Released

Today we have released VisuMap 3.0. This is a major upgrade from the previous version 2.7.

The main change in this release is the implementation of the Atlas service. An atlas in VisuMap 3.0 is frame-less window that enables users to organize and compose data views of different types. The following picture shows the snapshot of a sample atlas:

Atlas service in VisuMap is thus similar as dashboard service in some other visualization centric software applications. It allows users to group visual information according to the subject instead of information type itself. And more importantly for VisuMap users, atlas service enables users to combine multiple algorithms to generate and compose heterogeneously structured information visualization. For instance, this release includes a script (PartitionAtlas.js) that automatically partitions a dataset in several clusters, and generates various maps for each cluster.

Together with this release we have extensive updated the scripting interface. Unfortunately, some old scripts may need minor adjustments to convert to this new release. We have also extended the data format extensively to accommodate the new service. However, the data format has kept compatible in the sense that old dataset files can be loaded into new version, but datasets of new format can not be read by previous versions.

Click the following image to watch a short video that shows how to create an atlas interactively:

Here is another video that shows how a script uses kmean algorithm to partition a dataset and generate different data views for the partitions.

This video shows using pre-configured SOG cluster to create mutiple data views for a dataset.

Wednesday, July 1, 2009

Data Clustering with Self-Organizing Graph

We have just released VisuMap version 2.7.832. In this release we have added a very interesting clustering service called the self-organizing graph (SOG). SOG is basically an extension to the self-ogranizing map (SOM) that simulates a homogeneous artificial neural network. Unlike SOM, SOG simulates a network of arbitrary structure. The network can be in fact any weighted undirected graph. The network is a kind of parameter for the algorithm: the user defines the network depending on his/her knowledge or interest about the dataset.

The following picture illustrates an application of SOG. On the left side is a map of a dataset that shows roughly two data point clouds. The right side shows the network we used to classify the dataset. During the learning process each node of the network will become associated with some data points. After the learning process has finished each data point will be classified as their corresponding nodes in the SOG network.

We see that the SOG algorithm not only correctly classified the two data point clouds, but also captured those data points located a the peripheries of the clouds (data points shown as blue triangles).

The following is a short video tutorial for the SOG clustering service in VisuMap:

Comparing with traditional clustering algorithms, like k-Mean and hierarchical clustering algorithms, SOG provides more flexible clustering service. With SOG the user can not only specify the number of clusters, but also the size and other geometric or topological relationships between the clusters. In fact, SOG enables the user to specify a model to reflect existing knowledge and interests about the data.

There are many variants of the self-organizing map that use irregular networks. Those algorithms often employ adaptive strategies to generate the irregular networks depending on the data. Those irregular network are results of those algorithms. Contrary to those algorithms SOG uses the network as input for the algorithm, the user can express his/her interest and knowledge about the data through the network.

SOG is our first attempt to provide a new kind of clustering service. While its framework resembles to provide general pattern matching capability. The learning algorithm it employed has only limited capability to capture structural information. In view of this there are still many challenges and potentials for SOG algorithm in the future.

Friday, March 13, 2009

Visualizing the real projecitve plane.

I have just uploaded a video clip to YouTube that shows some ways to explorer the real projective plane with VisuMap. The link to this video is included here:

This dataset contains 14,358 data points (4-dimensional vectors) sampled from the surface of RP2 embedded into R⁴. The analytical formula is given in the video clip. A smaller sample dataset in VisuMap format can be downloaded at KleinBottle4D that also contains the dataset for the Klein-Bottle surface.

It is interesting to notice that RPM algorithm flattens the 2D sphere by cutting it into two equal sized half; but it flattens the real projective plane in the way that the whole plane is kept in a single connected region.

Monday, March 9, 2009

Projective and Injective Exploration Methods

In a recent blog post Terence Tao has put mathematical approaches to study a space X (which can be a space, a manifold etc.) into two fundamental camps:

1. By looking into a map f: Y->X that maps certain (better known) space Y into X.

2. By looking into a map f: X->Y that maps X into certain (better known) space Y.

I think that these statements not only apply to mathematical studies, they also provide useful framework for the exploration of high dimensional data. In our case, the unknown space X becomes a high dimensional dataset that we wish to understand. The space Y is usually a lower dimensional space that we know better and is normally easy to visualize.

For the sake of simplicity I would like to refer to methods from the above two camps as injective and projective methods respectively. That means, injective methods inject known space into the target space, whereas the projective methods project the target space into a better known space.

Following this point of view, algorithms to visualize high dimensional data can be grouped as injective or projective. Most MDS (multidiemsnional scaling) based visualization algorithms are projective. Algorithm like Sammon map and PCA (principal component analysis) map the high dimensional dataset into a low dimensional Euclidean space. The RPM algorithm (relational perspective map) maps high dimensional to flat torus.

On the other side, most clustering algorithms can be considered as injective algorithm. For instance, the method of vector quantization maps a finite discrete set into the target space, whereas the Kohonen net (or Self-organizing map) maps a discrete set with a mesh topology into the target space. The hierarchical clustering algorithms (like the agglomerative clustering algorithm) basically map binary tree structures into the target space.

One major benefit of this classification of algorithms is that it can guide us to apply existing algorithms to new data, in the sense that algorithms from the same camp might be able to share some methods/concepts/tricks. For instance, we might be apply an optimization method used by one clustering algorithm to another injective algorithm.

Furthermore, as be pointed out in Tao's blog post, there is certain "duality" between injective and projective methods. That means, for a given method in one camp you can sometimes find its corresponding counter part in the other camp. This duality can then help us to transfer a method from camp to anther one. As an example for this "duality", the RPM algorithm is basically a dual algorithm of the self-organizing map: they are in different camp, but are dual to each there as they share the flat torus as the image space (i.e. the space Y mentioned at the beginning of this post).

Similar to the duality concept I have blogged before about the "symmetry" between these two camps in more restricted situations, namely when the datasets are multidimensional vector spaces (e.g. data tables). In these cases, we can often transfer a method from one camp to the another one in a straightforward way (e.g. by transposing the data matrix).

The injective and projective algorithm are realized in VisuMap by clustering and mapping services, which form the two main groups of services in the VisuMap application.

Thursday, March 5, 2009

Dimension analysis on mulitdimensional scaling

This post brings forward an inconsistency problem in the study of multidimensional scaling from the perspective of dimensional analysis. This post is rather of philosophical nature. Nevertheless, I find it is worthwhile being aware of this issue
for further theoretical investigations.

Dimensional analysis is a simple and useful rule to check the consistence of a mathematical formula or equation. Roughly speaking, this rule demands that both sides of an equation (that has any physical or chemical bearings) must have the same dimension, or unit of measurement. So for instance the equation E = mC² is consistent with the dimensional analysis, since both sides of the equation have Energy as dimension. The equation E=mC is not consistent, since the right side would have momentum as dimension.

Similarly, if a formula is composed of multiple terms by addition, then all terms must have the same dimension. So for instance, the formula: S:=π·r² + r²/2 makes sense, but not S:=π·r² + r/2.

Now, let's turn to a problem of multidimensional scaling (MDS) from the perspective of the dimensional analysis. Roughly speaking, MDS maps data points from an abstract space with a distance function to a low dimensional, normally, Euclidean space like R². The goal of a MDS algorithm is to construct the map in a way, so that the distance in the low dimensional space approximates the distance in the original abstract space.

More formally, let D_ij be the distance between two data points in the original space, d_ij be the distance low dimensional space after MDS mapping; Then a MDS algorithm tries to minimize the following quantity (often called stress):

S := Σ ( D_ij - d_ij)²

Now, from the view point of dimensional analysis the stress function in its above form is not consistent, since D_ij and d_ij have arguably different dimensions as they measure distance in different spaces.

One work-around to solve this inconsistency issue is to add a special constant or apply a monotonous function on d_ij like:

S := Σ ( D_ij - C·d_ij)² or S := Σ ( D_ij - f(d_ij))²

The constant C or the function f should magically convert the dimension of d_ij to that of D_ij. This kind of ad-hoc techniques look pretty artificial. Most scientists probably won't like this method.

It should be noticed that the RPM (relational perspecitve map) algorithm avoided this inconsistency by using the following stress function:

S := Σ D_ij/d_ij

The dimensional analysis rule does not prohibit composing terms of different dimensions by multiplication. In above formula, both sides of the equation have a completely new dimension, that is the dimension of D_ij divided by the dimension of d_ij.

Thursday, January 15, 2009

Experiements with kernel PCA

The newly released VisuMap software version 2.7 has added the kernel PCA (kPCA) service. From the user's perspective kPCA basically provides the same service as MDS methods, namely maping an abstract set of data points with a similarity distance to a multidimensional vector space, so that the Euclidean distance in the dimensional space somehow reflect the original similarity distance. Thus, the input for kPCA is a dataset with a similarity distance (or matrix), the output is a table of numerical values with one row vector for each data point.

As there are many effective analysis methods operating on dimensional vector space (like k-mean clustering, support vector machine), kPCA enables us, like MDS, to apply effective methods to a broader range of data.

In order see how kPCA works, I created a dataset with about 3000 data points which form together a sphere in the 3D space. With the Gaussian kernel (a way to measure similarity between data points) kPCA of VisuMap created a 50 dimensional datasets within about 100 seconds. That means, kPCA mapped a 3 D dataset into the 50 D space. The first 3 dimensions in the 50 D space basically mirror the original 3 dimensions. In order to see how other dimensions look like, I fixed the x- and y-axises of a 3D view window to the first two dimensions, then assign the z-axis to other dimensions one after the other while rotating the 3D view window. Then following small video clip shows how the spherical dataset looks like with those extra dimensions:

The first few seconds of the video shows how the first 3 dimension re-constructs the original sphere. Then, each time when the z-axis switched to another dimension, the sphere turned in to another geometrical shape. I am not sure how those geometrical shape get formed and how to take advantage of those new dimensions. But, those dimensions surely look interesting and a little mysterious.

Visualizing High Dimensional Data