Saturday, September 13, 2008

Feature selection by clustering

I have blogged before that there is a "symmetry" between data clustering and dimensionality reduction: whereas the former reduces the number of rows; the latter reduces the number of columns in a data table. Thus, we can theoretically convert any clustering algorithm to a dimensionality reduction method by transposing the data table, and vise verse.

In this blog I am going to demonstrate how to use the metric sampling algorithm to reduce dimensionality with a concrete dataset. My sample dataset (PharMine171) contains data for 171 pharmacologically interesting compounds. Each data point contains 1025 binary vectors which indicate whether the compound contains particular fragments (ie. molecular sub-structurs). Thus, the dataset is a 171 x 1025 table with binary values and its dimension is 1025.

In order the reduce the dimensionality we first transpose the table to get a 171 dimensional dataset with 1025 data points (each data point now represents a fragment). In order to get feeling about the transposed dataset I have created 2 dimensional map for it (with VisuMap) as be shown in the following picture:

Map of 1025 binary features

Above map shows clearly some clusters among the 1025 data points. In order to find representative columns in our original dataset, we need to find representative rows in the new data table. In order to do so, we apply the metric sampling algorithm (available in VisuMap) that selects representative subset of data points. I have configured the algorithm to select 25 data points. The following picture shows how the 25 data points are distributed among the 1025 data points:

We can then export the 25 data points as a 25 x 171 table, and transpose the table to get an 171 x 25 data table. This table is then our new dataset with reduced dimension.

Now, how can we tell that the reduced dataset approaches the original dataset? We can do this pretty easily with VisuMap by creating maps for both datasets. If the two datasets contain about the same information, their maps should be visually similar to each other. The following picture shows the map of the original dataset (on the left side) and of the reduced dataset (on the right side). They are both created with the SMACOF mapping method. We see that the map of the reduced dataset reveals clearly the three clusters as the original dataset, but it shows less details within the clusters.

Notice that the metric sampling algorithm implemented in VisuMap is a variation of the widely known k-medoid algorithm. The recent release of VisuMap (version 2.6.807) utilized a randomization strategy in the optimization process that made the algorithm much more robust and less prone to local minimum problem; which has been a big issue for the k-medoid algorithm.

Thursday, September 4, 2008

On similarity metrics for chemical compounds

Recently, Yap Chun Wei has posted a dataset on the pharmine blog. The dataset consists of fingerprints of 171 pharmacologically interesting compounds. Just to recapture, the fingerprint of a compound is here a vector of 1025 binary flags, each flag in the vector indicates whether a particular molecular fragment is present in the compound. There are many ways to calculate fingerprints. Depending on the nature of the problem, you can use different algorithms or different collection of fragments. The mentioned dataset, for instance, used OpenBabel to calculate those fingerprints.

The dataset constains 3 different groups of compounds: penicillins, cephalosporins, fluoroquinolones. Using VisuMap Yap created different maps of these 171 compounds which showed, more or less, the cluster structure of the dataset. I personally find that the PCA map provides the best visualization. The following picture is a PCA map I created with VisuMap for this dataset:

Compound Map of 171 compounds

In above map, the 3 compound groups are displayed as glyphs in 3 different colors. The coloring are done manually with VisuMap based on known information about these groups. Although, for this dataset, you can get almost exactly these 3 clusters using the k-mean algorithm provided by VisuMap. The bar diagram in the picture shows the presence frequency of the 1025 fragments among the 171 selected compounds. That means, a higher bar indicates that a particular fragment is present in more compounds.

The above map visualizes the similarity information between the 171 compounds. That means, closely located compounds will have similar fragment collections and therefore similar pharmacological properties. The similarity information are basically encoded in the fingerprints. Thus, the method to calculate of those fingerprints is naturally critical for this kind of data analysis.

In order to better understand those fingerprints we can created a map of those 1025 features with VisuMap. In order to do so, we simply transpose the binary data table (via the menu Edit>Filter Data>Transpose Table in VisuMap), so that each binary feature becomes a new data point; and each compound become a feature in the transposed dataset. We can then pick a mapping algorithm and metrics to create a feature map. The following picture shows such a map created with the t-SNE algorithm and the tanimoto dissimilarity metric:

Feature map of 1025 binary features

Above picture shows 4 or 5 clear clusters on the left side represented by colored glyphs. The rest are more or less randomly distributed. It turned out that those yellow-square features are those fragments which are NOT present in any of those 171 compounds (all bits are zero). Therefore, they carry no direct information about our compounds. Interestingly, these zero vectors form together a homogeneous cluster in the map.

Other clusters in above map represent groups of fragments which have high frequency and are informative to distinguish the three compound groups in the original dataset. We can verify this with the help of the bar diagram in VisuMap as follows:

We first open the feature map in a separate window (via the menu Tools>Map Snapsot). Then open the compound map, and then select all compounds and open the bar view through the context menu "Bar View". The bar view by default displays the frequency in the order as given in the transposed data table. We then sort the bars through the context menu "Sort Values" so that bars are displayed in the order from low frequency to high frequency as depicted in the following two picture:

The sorted bar view shows 3 plateaus which correspond to clusters in the feature map and in the compound maps. In order to see the correspondence we select a plateau in the bar view with the mouse, the snapshot window of the feature map will automatically high light those selected features. As we can see in the following picture, the selected plateau of features clearly correspond to a particular cluster in the feature map (the marked cluster at lower left corner).

Correspondence between high frequency features and feature clusters

We notice that some of the clusters in the feature map show some fine sequential structure that may lead to more hints about the internal structure of the fragment collections.

With the knowledge about informative feature clusters we can, for instance, reduce the number of features significantly without significant loss of information about the clusters in the original dataset. The VisuMap dataset folder PharMine171.xvmz (zipped XML file) includes a reduced dataset with 298 features that characterizes similar similarity information as the original dataset with 1025 features.

The above VisuMap dataset folder also contains feature maps created with other mapping algorithms. It is interesting to notice that for the feature map dataset, the t-SNE algorithm provides the best visualization, whereas the result of the PCA algorithm is rather disappointing.