Wednesday, August 29, 2012

On Gauge Principal Component Analysis

Principal Component Analysis (PCA) is a widely used method to investigate high dimensional data. Basically, PCA, as a dimensionality reduction method, rotates a data set in the high dimensional space so that it shows most information (i.e. variance) from certain view direction. In this note, I am going to describe a very simple yet effective extension of PCA. The method, which I call Guage PCA  (GPCA), works as follows: Instead a of using a single global rotation as in normal PCA,  GPCA first decomposes the dataset into multiple clusters, then find a rotation for each clusters; and then compose a global map from the rotated clusters. The following picture illustrates the steps of GPCA:


We can image that above maps illustrate the scenario to take snapshot of a fish tank with 4 different fishes.  The top map is a random snapshot in which the fishes face different directions. The middle section  are the 4 fishes rotated to show maximal information to the viewer. The last map is a snapshot of the fish tank with all fishes magically rotated, so that each fish shows the maximal information to the viewer. This scenario is sort of an extension of the scenario described in the video "A layman's introduction to principal component analysis".

In certain sense, the relationship of PCA to GPCA is analogous to the relationship between linear regression and linear spline: the former uses a single straight line to approximate a curve, whereas the latter uses multiple line segments. It is obvious that linear spline, as approximation method, is much more powerful than linear regression. The following picture illustrates how linear regression and linear spline approximate a set of points:



A key requirement for linear spline is that the line segments have to be joined together to form a single polyline. Similarly for GPCA, we require that the composed resulting map to preserve variance of the map in major directions. It should be noticed that there are many PCA related approaches under the term localized PCA. Those approaches mostly focus on how to segment the data, but ignore the step to compose a single global map for visualization purpose. In contrast, the composition step in GPCA is the key step. The creation of the initial map and the segmentation of the data is actually not part of the algorithm but just initial conditions.

VisuMap has been supporting GPCA for a while now. In order to use GPCA, we first create a MDS map and cluster the data with any available clustering algorithm of VisuMap; then open the PCA view for a selected cluster and click on the capture button to embed the local PCA map back into to original MDS map. The following video shows the process to create GPCA map for a sample dataset from pharmaceutics :



I have borrowed the term gauge from modern physics in which the gauge principle plays a fundamental role. The gauge principle states that a global system behavior is invariant under local gauge rotation. So for instance, when we calculate orbits of planets in our solar system we don't have to care about the orientation of individual planets. The orientation of planets is  an additional freedom that has no impact on structure of orbits. This kind of extra degree of freedom has turned out to be the core structure underlying many laws in modern physics.


No comments: