Friday, April 21, 2017

Deep Data Profile with VisuMap

Data profiling are normally understood as statistical methods to extract numerical features from complex systems for easy exploration. So for instance, GDP, CPI and various kind of indices are often used to profile the state of an economy. Appropriate profiling helps us to compare similar systems; or a system in different development phases. In this note I'll put forward a generic method to profile high dimensional data; the method combines dimensionality reduction algorithms with deep artificial neural networks.

In recent years, many so called  nonlinear dimensionality reduction (NDR) methods have been developed to visualize high dimensional complex data. Those methods often use machine learning algorithm to produce 2D or 3D maps; which provide a kind of graphic profiles about data. For instance the following pictures shows a 2D map made from a data set from flow cytometry study:


Above map was made with the tSNE algorithm from a dataset that contains about 6000 data points each with 12 variables; The colors for the sub-clusters are added with help of the affinity propagation clustering algorithm. The colored map is pretty helpful to discern interesting structure within the data set. Unlike those statistics based profiling, the visual map based profiling does not rely on high level features like GPD ratio; which in general require a good understanding about the data and underlying system.

Nevertheless, those visual map based methods lack the predication capability in the following sense: in the practice, we often have multiple data sets collected about similar systems, or the same system in different phases. For instance, in clinic trials our data sets might be gene expression profiles of patients in different trial stages. In these cases, we are especially interested in differences between the profiles. Most of these NDR methods are however insensitive to small changes, so that it is hard to recognize differences between NDR maps from similar data sets.

To address above mentioned problem, we purpose here the deep data profiling (DDP) procedure as an extension to NDR based profiling as illustrated in the following diagram:

Deep Data Profiling 

The DDP procedure starts with a set similar data sets. As first step we choose a reference data set and apply NDR on the data set to obtain a 2D or 3D map. Many NDR methods are available for this purpose, for this note we recommend the tSNE algorithm, as it is widely available and produces relatively good structured map for a wide range of data. Then, we apply a clustering algorithm on the map to produce labels for the sub-clusters on the map. Those labels are then used as different colors for the sub-clusters, so that we get a colored map as illustrated in above picture.
There are many clustering algorithms suitable for this purpose, for this test we used the affinity propagation algorithm together with some manual adjustment directly performed on the map.

The colored map we obtained from the reference data set represents knowledge we captured from the reference data. As next step we then use a machine learning algorithm to learn such knowledge. In particular, we employ multilayer feed forwards networks to learn the translation from reference data to the colored map. Two networks will be used for this purpose: one to learn the dimensionality reduction function; and the other one to learn the clustering function.

The two trained networks can then be applied to other data sets to produce colored maps as their visual profiles.

A plugin module, called Data Modeling, has been implemented to integrate DDP service with VisuMap version 5.0.928. The plugin module internally uses the TensorFlow engine from Google Inc. to implement feed-forward neural networks. The plugin offers GUI and scripting interface to train and apply models with data directly from VisuMap.  The following screen-cast demonstrates a simple use case of DDP with VisuMap:



Deep data profiling procedure presented here offers a visual profiling technique for high dimensional complex data. Compared to conventional statistics based profiling techniques, DDP is more intuitive and information richer. As an meta-algorithm, DDP can take advantage of new algorithms for NDR, clustering and machine learning. With the rapid development of machine learning technologies, DDR might offer powerful and versatile tools to explore high dimensional complex systems.