Saturday, September 13, 2008

Feature selection by clustering

I have blogged before that there is a "symmetry" between data clustering and dimensionality reduction: whereas the former reduces the number of rows; the latter reduces the number of columns in a data table. Thus, we can theoretically convert any clustering algorithm to a dimensionality reduction method by transposing the data table, and vise verse.

In this blog I am going to demonstrate how to use the metric sampling algorithm to reduce dimensionality with a concrete dataset. My sample dataset (PharMine171) contains data for 171 pharmacologically interesting compounds. Each data point contains 1025 binary vectors which indicate whether the compound contains particular fragments (ie. molecular sub-structurs). Thus, the dataset is a 171 x 1025 table with binary values and its dimension is 1025.

In order the reduce the dimensionality we first transpose the table to get a 171 dimensional dataset with 1025 data points (each data point now represents a fragment). In order to get feeling about the transposed dataset I have created 2 dimensional map for it (with VisuMap) as be shown in the following picture:

Map of 1025 binary features

Above map shows clearly some clusters among the 1025 data points. In order to find representative columns in our original dataset, we need to find representative rows in the new data table. In order to do so, we apply the metric sampling algorithm (available in VisuMap) that selects representative subset of data points. I have configured the algorithm to select 25 data points. The following picture shows how the 25 data points are distributed among the 1025 data points:

We can then export the 25 data points as a 25 x 171 table, and transpose the table to get an 171 x 25 data table. This table is then our new dataset with reduced dimension.

Now, how can we tell that the reduced dataset approaches the original dataset? We can do this pretty easily with VisuMap by creating maps for both datasets. If the two datasets contain about the same information, their maps should be visually similar to each other. The following picture shows the map of the original dataset (on the left side) and of the reduced dataset (on the right side). They are both created with the SMACOF mapping method. We see that the map of the reduced dataset reveals clearly the three clusters as the original dataset, but it shows less details within the clusters.

Notice that the metric sampling algorithm implemented in VisuMap is a variation of the widely known k-medoid algorithm. The recent release of VisuMap (version 2.6.807) utilized a randomization strategy in the optimization process that made the algorithm much more robust and less prone to local minimum problem; which has been a big issue for the k-medoid algorithm.