Monday, November 4, 2013

On Multidimensional Sorting

In a previous blog post I have talked about a method to convert photos to high dimensional datasets for analysis with MDS methods. This post will address the opposite problem: given a high dimensional dataset, can we convert it to a photo alike 2D image for easy feature recognition by human?

Let's consider our sample dataset S&P 500. A direct approach to create a 2D map is to simply convert the matrix of real values to a matrix of gray-scale pixels. This can be easily done in VisuMap with the heatmap view and the gray-scale spectrum. The following picture shows the S&P 500 dataset in normal line diagram view on the left side and the corresponding heatmap on the right side:


Each curve in the left picture corresponds to the price history of a S&P-500 component stock in a year. We notice heavy overlaps among the curves. On the right side, the heatmap represents the stock prices with row strips with different brightness. In the heatmap, we can spot roughly the two systematical downturns at the two slightly darkened vertical strips. There is however no discernible pattern between the rows, as the rows are just ordered by their stock ticks, so that neighboring relationship don't indicate any relationships between their price history.

One simple way to improve the heatmap is to group stocks with similar price history together. This is the task of most clustering algorithms. We can do this pretty easily in VisuMap. The following heatmap shows the same dataset after we have applied k-Mean clustering algorithm on the rows and re-shoveled the rows according to cluster assignments:


The k-Mean algorithm has grouped the rows into about 8 clusters, we can see that rows within each group represent rows with, more or less, similar values. However, the clusters are randomly ordered, and as well as the rows within a cluster. The clustering algorithm does not provide ordering information for individual data points.

Can we do better job than the clustering algorithm? To answer this question, let's recall what we actually tried to do above: we want to reorder the rows of the heatmap, so that neighboring rows are similar to each other. This kind task is basically also the task of MDS (Multdimensional Scaling) algorithms, which aim to map high dimensional data to low dimensional spaces while preserving similarity. Particularly in our case, the low dimensional space is just the one-dimensional space, whereas MDS in general have been used to create 2 or 3 dimensional maps.

Thus, to improve our heatmap,  we apply a MDS algorithm to our dataset to map it to an one-dimensional space, then use their coordinates in that one-dimensional space to re-order the rows in the heatmap. For this test, I have adapted the RPM mapping (Relational Perspective Map) algorithm to reduce the dimensionality of the dataset from 50 to 1, then used the one-dimensional RPM map to order the rows in the heatmap. The following picture shows a result heatmap:


We notice this heatmap is much smoother than the previous two.  In fact, we can see that the data rows gradually change from bright to dark color, as they go from the top to the bottom. More importantly, we don't see clear cluster structure among the rows, this is in clear contrary to the k-Mean algorithm that indicates 8 or 10 clusters. In this case, it is clear that k-Mean algorithm delivered the wrong artificial cluster information.

Now, a few words about the implementation. The multidimensional sorting has been implemented with RPM algorithm, since RPM allows gradually shrinking the size of the map by starting with 2D or 3D map. This special feature enables RPM to search for optimal 1D map configuration among a large solution space. We could easily adapt other MDS algorithms to perform multidimensional sorting, but we probably will have to restrict solely on one-dimensional space with those algorithms. A few years ago, I have already blogged about this kind of dimensionality reduction by shrinking dimension with some simple illustrations.

Since high dimensional datasets are generally not sequentially ordered, there are in general not an unique way to order the data. What we want to do is just find a solution that preserves as much as possible similarity information with the order information. Thus, an important question arises here: How do we validate the sorting results? How do we compare two such sorting algorithms?  A simple way to validate the sorting result is illustrated in the following picture:
So, in order to test the sorting algorithm we first take simple photo and convert it to high dimensional dataset that represents the gray-scale levels of the photo. We then randomize the order of the rows in the data table. Then, we apply the sorting algorithm on the randomized data table, and check how good the original photo can be recovered. I have done this test for some normal photos, the multidimensonal sorting was able to recover pretty well the original photo (see also the short attached video), as long as enough time is given to the learning process.

We have described an application of MDS algorithm for heatmap by using it to reduce data dimension to a single dimension. We might go a step further to use MDS to reduce data dimension to 0, so that data points will be mapped to a finite discrete points set. In this way, we would have archived the service of clustering algorithms. If this works, clustering algorithms can be considered a special case of MDS algorithms; and MDS algorithms might lead us to group of new clustering algorithms.

The multidimsional sorting service has been implemented in VisuMap version 4.0.895 as integrated service. The following short video shows how to use this service for the sample datasets mentioned above.



A more practical application of MDS sorting is for microarray data analysis where heatmaps are often used to visualize the expression levels of a collection of gens for a collection of samples. Normally, neither the gens nor  the samples are ordered according to their expression similarity, so that those heatmaps often appear rather random (even after applying additional grouping with hierarchical clustering algorithm.) The following picture shows, for instance, a heatmap for expressions of 12000 genes for about 200 samples:


After applying MDS sorting both on the rows and columns of above heatmap, the heatmap looks  as following:


The above heatmap contains theoretically the same expression information. But, just by re-ordering the rows and columns, it shows more discernible structure and information for human perception.


No comments: