Visualizing High Dimensional Data

Wednesday, August 29, 2012

On Gauge Principal Component Analysis

Principal Component Analysis (PCA) is a widely used method to investigate high dimensional data. Basically, PCA, as a dimensionality reduction method, rotates a data set in the high dimensional space so that it shows most information (i.e. variance) from certain view direction. In this note, I am going to describe a very simple yet effective extension of PCA. The method, which I call Guage PCA (GPCA), works as follows: Instead a of using a single global rotation as in normal PCA, GPCA first decomposes the dataset into multiple clusters, then find a rotation for each clusters; and then compose a global map from the rotated clusters. The following picture illustrates the steps of GPCA:

We can image that above maps illustrate the scenario to take snapshot of a fish tank with 4 different fishes. The top map is a random snapshot in which the fishes face different directions. The middle section are the 4 fishes rotated to show maximal information to the viewer. The last map is a snapshot of the fish tank with all fishes magically rotated, so that each fish shows the maximal information to the viewer. This scenario is sort of an extension of the scenario described in the video "A layman's introduction to principal component analysis".

In certain sense, the relationship of PCA to GPCA is analogous to the relationship between linear regression and linear spline: the former uses a single straight line to approximate a curve, whereas the latter uses multiple line segments. It is obvious that linear spline, as approximation method, is much more powerful than linear regression. The following picture illustrates how linear regression and linear spline approximate a set of points:

A key requirement for linear spline is that the line segments have to be joined together to form a single polyline. Similarly for GPCA, we require that the composed resulting map to preserve variance of the map in major directions. It should be noticed that there are many PCA related approaches under the term localized PCA. Those approaches mostly focus on how to segment the data, but ignore the step to compose a single global map for visualization purpose. In contrast, the composition step in GPCA is the key step. The creation of the initial map and the segmentation of the data is actually not part of the algorithm but just initial conditions.

VisuMap has been supporting GPCA for a while now. In order to use GPCA, we first create a MDS map and cluster the data with any available clustering algorithm of VisuMap; then open the PCA view for a selected cluster and click on the capture button to embed the local PCA map back into to original MDS map. The following video shows the process to create GPCA map for a sample dataset from pharmaceutics :

I have borrowed the term gauge from modern physics in which the gauge principle plays a fundamental role. The gauge principle states that a global system behavior is invariant under local gauge rotation. So for instance, when we calculate orbits of planets in our solar system we don't have to care about the orientation of individual planets. The orientation of planets is an additional freedom that has no impact on structure of orbits. This kind of extra degree of freedom has turned out to be the core structure underlying many laws in modern physics.

Thursday, August 23, 2012

Tracking Attributes in a MDS map.

When using MDS (multidimensional scaling) maps in the practice, a frequently asked question is how does a particular attribute, i.e. a data column in the input table, impact the resulting MDS map? One simple way to visualize the effect of an attribute is just create another MDS map without that attribute under investigation. The difference between the two maps can be then ascribed to that attribute. For instance, the following two maps are MDS maps (created with CCA aglrotihm) of the VisuMap sample dataset yeast.xvm: the fisrt is created with the first 5 attributes; second one with just the first 4 attributes.

The difference between the two maps above is thus caused by leaving out the fifth attribute (i.e. the attribute spo2). We notice that the two major data point clusters (colored as yellow and cyan) are more separated in the first map. Thus, we can see that the attribute spo2 provides significant separation for these two clusters.

With the VisuMap plugin module ClipRecorder we can do a much better job to visualize the effect of a attribute. The newly released ClipRecorder version 1.2 includes scripts that help users to create a sequence of MDS maps with gradually decreasing weight for a selected attributes. The difference between two successive maps provides direction vectors which indicate how data points move when the weight for the selected attributes decreases. The following map is one of such map that visualizes the movement direction of all data points at certain moment during the running process of the script:

In above map, each bi-colored bar represents a data point; the red side of the bar points to the moving direction of a data point and the length of the bar indicates the speed of the movement.

The ClipRecorder plugin also records all the map sequences, so that we can replay them any time later as simple map animation. The following short video clip, for instance, shows such an animation:

Thursday, August 2, 2012

New VisuMap Release: More fun with scripting

We have just released VisuMap version 3.5.882. Worth mentioning in this release are two enhancements to the scripting interface. The first enhancement is the auto-complete editing. We have added script analyzer to the editor so that it will automatically suggest class members with documentation when it notices that user is about to add code to access class methods or properties. The following shows a screen short of the auto-complete editing:

The second enhancement to the scripting interface is the so-called property spinor: within an editor window the user can now double click a property value to link the property to the mouse wheel. When the mouse-wheel is linked to a particular value in a script and when the user spins the mouse wheel, the property spinor will automatically change the linked value and execute the whole script. With this feature VisuMap provides a very generic control mechanism to probe different settings, so that user can see immediately the consequence of changed settings.

The following is short video to demonstrate the usage of property spinor:

Tuesday, July 31, 2012

Ensemble Clustering with VisuMap

A common question by data clustering is: How stable is the clustering results? It often happens that some data points are classified by an algorithm to a common cluster, but when you run the algorithm again on the same data with slightly different initialization conditions these data points might be classified into different clusters. This phenom seems to be a consequence of the nondeterministic behavior of the algorithm, but it is rather a manifestation of the clustering structure among the data. From certain point of view, these unstable data points tells us actually more interesting information about data than those stable data points.

One way to answer the stability question is using the boosting method (also called ensemble clustering) where we aggregate the results of multiple runs of a clustering algorithm. One of the simplest aggregation method is as following: we classify two points to a common cluster (i.e. assign the same color to two points in a map) if all runs classified the two points to a common cluster.

To visualize the ensemble clustering let us consider a sample as by the following picture. Th picture shows how the k-Mean algorithm clustered a data set in three different runs (i.e. with different random initialization.) When we look closely at these three maps, we can notice, as marked in the circles, that the boundary between two clusters varies among the three maps. This means that data points in this region are unstable. But each individual one of the three maps does not tell us this information; and even with all three maps displayed together, it is rather hard to find these unstable points by visual comparisons.

Now, when we aggregate the three maps with the simple aggregation method mentioned above, we get a map as show in the following picture. We can recognize easily that, as circled in 3 regions, that there are 3 unstable regions: data points in these regions have different colors than those major clusters. We also notice here easily that there are some stable boundaries between some clusters.

Although this kind of ensemble clustering is fairly simple to implement with the scripting interface of VisuMap, it lacked a simple friendly user interface for this service. The new release of VisuMap, the version 3.5.881, resolved this problem with a new utility, called cluster manager, that offers the services to capture, store and explore clustering results. And especially, the cluster manager makes it very straightforward to do ensemble clustering. As illustrated in the following screenshot, the user just need to select the clustering results, called named coloring, then choose an appropriate context menu to compose the aggregated clustering: