Wednesday, August 13, 2008

New mapping algorithm t-SNE

Recently I spent some time to experiment with the t-SNE mapping algorithm proposed by L. Maaten and G. Hinton in Visualizing Data Using t-SNE. The algorithm is a variation of the SNE algorithm developed few years ago.

The t-SNE home site provides source code, technical papers and many samples, so that it is pretty easy to test the algorithm. The matlab source code is provided as a component in a matlab toolbox that also implements many other popular methods for dimensionality reduction. The web site also provides a binary executable that is optimized for the Intel/Windows platform. It took me a while to figure out its binary data format, but the time spent is worthwhile, as it runs much faster than the matlab version.

As demonstrated in its technical paper, t-SNE has the nice capability to show the cluster structure of high dimensional datasets in low dimensional maps. More interestingly, I noticed that t-SNE is able to segment dataset with complex structure in a natural way to preserve relevant information from certain perspective. For instance, the left map in the following picture shows a dataset that forms two crossing squares in 3D space, the map on the right side is a 2D map created with t-SNE. We notice that t-SNE segmented the connected shape into 5 clusters. The segmentation occurred roughly along the center line where the two squares crossing each other.

In general, I think there are two major challenges for mapping algorithms intended to visualize high dimensional complex data. The first one, I call it the singularity problem, is present in the above sample. The dataset is mostly a 2D surface except the center region where the two square cross each other. In order to map this dataset into 2D map in non-overlapping manner, an algorithm has to find a non-trivial way to segment the dataset along the singularity region. t-SNE did a good job here compared to other algorithms like CCA (curvilinear component analysis) which segments the dataset rather randomly.

The second challenge, I call it the closedness problem, occurs when the dataset forms certain kind of closed manifold in high dimensional space, like sphere, torus (e.g. surfaces of 3D objects). In this case, a mapping algorithm needs to have a mechanism to cut-off the closed structure underlying the dataset. The t-SNE algorithm behaviors also pretty well in addressing this kind of problems. The following video clip (produced with VisuMap 2.6.800) shows how t-SNE gradually deforms two overlapping spheres to a 2D map.

We notice that t-SNE first captures the global structure of dataset (like the Sammon algorithm or PCA), then deforms the map locally to achieve non-overlapping mappings. This kind of dynamical behavior is quite similar as the CCA algorithm.

The implementation of t-SNE is pretty straightforward, and it can be easily tuned for a wide range of data. Even with its current simple implementation, some of its features already surpassed many other popular mapping algorithms. So, I believe it will become a valuable tool for analyzing high dimensional data.