Friday, June 28, 2019

Representation Learning with Deep Neural Network

Representation Learning for Algebraic Relationships


This note puts forward a simple framework to learn data representation with deep neural networks. The framework is inspired by mathematical representation theory where we look for vector representations for elements of algebraic structure like groups.

To describe the framework, let's assume that we have the following table for an abstract operation:

Table 1: Binary operation table for abstract items a, b, c and d.

Above table shows that the abstract entities a, b, c and d fulfill an abstract operation ⊕, for instance: aa = 0;  bc = 3; etc..  We now try to find representations for a, b, c and d with two dimensional vectors. In order to do so, we start with a default zero-representation as in following table:

a:(0, 0)
b:(0, 0)
c:(0, 0)
d:(0, 0)
Table 2: Initial representation table for a, b, c, d as 2 dimensional zero vectors.

That means a, b, c and d are all represented by vectors (0, 0). We then construct a feed-forwards neural networks as illustrated as following:


The network has four input nodes to accept a pair of 2-dimensional vectors; and one output node to approximate the operation ⊕.  Like conventional feed-forward neural network, the input and output nodes will get data from the operation and representation table before each optimization step. Unlike the conventional feed-forward network, the input nodes are configured as trainable variables which will be adjusted during the learning process; and their values will be saved back to the representation table after each learning step. More particularly, the learning process goes as follows:
  1. Randomly select a sample from the operation table, say b⊕c = 3.
  2. Lookup the representation for a and b from the representation table. Assign them to input nodes of the network. Lookup the expected output for b⊕c from the operation table, and pass it to the output node as learning target.
  3. Run the optimization algorithm, e.g. the ADAM optimization algorithm, for one step, which will change the state of input nodes to new values b'0, b'1, c'0 and c'1
  4. Save  b'0, b'1, c'0 and c'1 back to the representation table as new representation for b and c.
  5. Continue with step 1.
With an appropriate choices for the network and training algorithm, the above training process will eventually converge to a representation for a,b,c,d, so that the network realizes the abstract operation as described in table 1. A tensorflow based implementation of above training process is available at
Z4GroupLearning, which can be run on the google cloud computing platform from a browser.

Representation Learning for Transformations

In previous example the table 1 actually describes the finite group ℤ4, representation are linked to the input layer.  In general, the framework is not restricted to binary operations and the representation vectors can be linked to any state vectors, i.e. weights, bias and other trainable variables, of the network.

For the demonstration we describe here a network that finds representation for transformations in the 3D space. Let's assume that X is a 1000x3 table consisting of 1000 points from a  sphere in the 3-dimensional space. Let {T0, T1, ...} be a sequences of 3D-to-3D transformations which transform X to {Y0, Y1, ...}. Assuming that X and  {Y0, Y1, ...} are known, we can then find representations for the transformations {T0, T1, ...} with a deep neural network as illustrated as following:

The transformations are linked to some internal part of the weight matrix. Before starting the training process, the transformations are initialized with zero-matrices with the same shape as the linked weight matrix. At each training epoch, a 3-tuple (X, TtYt) is randomly selected; Tt is pushed to the linked weight matrix; then the optimization process is run to learn the map X-to-Yt mapping; After the optimization process has run for several loops for all data points in X and Yt, the linked weight matrix is pulled back in to the representation Tt, stored as T't for future epochs. The training process then continues with other randomly selected 3-tuples, until all  Yt have sufficiently approximated by the network.

As an example, we have selected 30 linear translations and 30 rotations to transform the sphere in the 3D space. The following video clip shows the transformed images Yt:



After the training process has completed after about 100 epochs, we have got 60 matrices as representation for the transformations. We have embedded the 60 matrices into the 3-dimensional space with PCA (principal component analysis) algorithm. The following picture shows the embedding of the 60 transformations:


In above picture, each dot represents a transformation: a red dot corresponds to a linear translation, a yellow dot to rotation. We see that the representations generated by the deep neural network correspond very well the geometry structure of the transformations.

Representation Learning for Dimensionality Reduction

Representation learning (RL) described here can be used to perform non-linear dimensionality reduction in a straightforward way. So for example, in order to reduce a m dimensional dataset to k dimension, we can simply take a feedforward network with k input nodes and m output nodes. As depicted in the following diagram, the input nodes are configured as learning nodes whose state values get values from the training data and adjusted during the learning process.

We notice that autoendcoder (AE) networks has often be used in various ways for dimensionality reduction and unsupervised learning. Comparing to AE, RL basically performs just decoding service. However, the essential difference is that the AE trains a network, whereas RL trains directly the lower dimentional representation.

The following maps shows an example of mapping a 3-dimenional sphere dataset to the 2 dimensional plane. An implementation with tensorflow on colab platform can be found here.


We notice that 2-dimenional map preserved large part of the local topology, while unfolding and flattening the 3D-sphere structure.

If encoding service (like that in AE) is needed, we can extend the above RL network with an encoding network that maps the network output to the lower dimensional data. This encoding network can be then trained independently from the decoding part, or concurrently with the decoding network part.

Application for Data Imputation in Epigenomics

Epigenomic study shows that different cell types of multi-cellular organisms developed from a common genome by selectively express different genes. Many types of assays have been developed to probe those epi-genomic manipulations. The result of such an assay is normally a sequence of numerical values profiling certain biochemical properties along the whole genome strand. With a set of assays and a set of cell types we can then obtain a profile matrix as depicted as following:

Table 3: Profile matrix for cell and essay types.

A large profile matrix about various cell types can help us to understand the behaviors of those cells. However, it is currently rather expensive to conduct essays in large numbers. To alleviate such limitations, researchers have tried to build models based on available data, so that they can impute profiles for news cell types or essay types (ChromImputeAVOCADO, PREDICTD, Encode Imputation Challenge,)

We see that table 3 has similar structure as table 1, thus we can model table 3 similarly as table 1.  We notice here two main difference between the two tables: Firstly, table 3 is only sparsely defined as only limited number of experiments have been done. Secondly, an element in table 3 is a whole sequence (which may be potentially billions of numerical values.), whereas an element in table 1 contains a single value. It is obvious that the training algorithm for table 1 can handle the data sparsity problem without any change. In order to obtain representation for each cell and assay type, we only have to make sure that each row and each column contains some defined profile data.

In order to address the high dimensional data issue, we extend our neural network as depicted as following:


Figure 1: Neural network for cell- and assay-type representation learning.

The training algorithm runs roughly as following:

  1. Randomly select a profile from table 3, say a profile P(A, b) for cell-type A and assay type b. 
  2. Segment the profile into sections of equal length N; and arrange the sections into a matrix PA,b that has L rows and N columns.
  3. Lookup the representation RA for cell type A in present set of representations. If no such representation is found, initialize RA as a zero-matrix of  dimension LxDc, where Dis a preset constant.
  4. Lookup the representation Rb for assay type b in present set of representations. If no such representation is found, initialize Rb as a zero-matrix of dimension LxDa, where Dis a preset constant.
  5. Randomly select an integer k between 0 and L; Push the k -th row of RA and Rb into the input layer of the network; Run the optimization process, e.g. with ADAM optimizer, for one step with k-th of PA,b as the target label. 
  6. Pull back the state of the input layer into the k -th row of RA and Rb respectively. 
  7. Repeat the steps 5 and 6 for certain preset numbers. 
  8. Select an other profile and repeat step 1 to 7 till the training process converges to certain  state with small errors.
Notice that the network depicted above consists mainly of dense feed-forward layers. Only the last layer is a transposed convolution layer. This layer significantly extends the output dimension and therefore speedup the training process for long profile sequences. Another technique to speedup the training, that is not directly visible in above diagram, is the batched learning: the training process updates the weight matrix and input layer states only after 30 to 40 samples have been feed into the algorithm.

Representation Learning versus Model Learning

Most deep learning frameworks are model learning in the sense that they train models to learn data. Compared to the amount of training data, the size of the trained models are relatively small. RL extends model training to learn vector representation for key entities, which subject to certain algebraic relationships. Those key entities can be, element of mathematical groups,  3D image transformation, biological cell and assay types. Since RL exploits modeling power of deep neural network in the same way as model learning does, methods and techniques developed for model learning (like data batching, normalization and regularization,) can be directly used by RL framework.

It is interesting to notice RL incidentally resembles the working mechanism in cellular organism as revealed in molecular biology. In the latter case, the basic machinery of a cell is the RNA ribosome, which is relatively simple and generic compared to the DNA sequences. The ribosome could be considered as a biological model, whereas the DNA sequences as representation (i.e. code) for plethora of organic phenotypes. The learning process of the two models are however different. Whereas the ribosome model learns by recombination and optimizing selection, RL learns by gradient back-propagation. The former seems to be more general, whereas the latter be more specific and, arguably, more efficient. Nevertheless, I do think that nature organism at certain level might host more efficient gradient directed learning procedure.

Conclusion

We have presented a representation learning method with deep neural network. The method basically swaps the data representation with states of neuron nodes during the training process. Intuitively, those learned representation can be considered as target data back-propagated to designated neuron nodes. With examples we have demonstrated that RL algorithm can effectively learn abstract data like algebraic entities, transformations, and can perform services like dimensionality reduction and data imputation.









Tuesday, June 13, 2017

On Strip Overfitting of Multidimensional Regression

Overfitting is a key characteristic for regression style data modeling with deep neural networks. Whereas considerable effort have been devoted to finding methods to reduce overfitting, little have been done to address the question how to measure overfitting. In most cases, overfitting is measured by prediction error with respect to a pre-selected test data set. Such method has led to various, more or less, successful overfitting-reducing techniques.

There are, however, some open questions relating to the rather ad hoc method to measure overfitting. Firstly, the method requires that the test set does not participate in the training process. But most learning process checks repeatedly with the test data set to find optimal models or meta-parameters. In this way, the test data set information "leaks" in to the training process and compromises the separation between training and testing data set. Secondly, there are no clear guidelines to choose the test data set. The method implicitly assumes that the randomly selected test data set will reflect the intrinsic statistics of data model, such assumption can be far from being true in the practice. Thirdly, many problem encountered in the practice have rather limited data, reserving part of the data as test data set might significantly reduce the size of the training data set, and therefore reduce accuracy of regression.model.

In this note I'll present a simple way to quantify overfitting of regression style neural networks. The main idea of the method is that it generates test data by linear interpolation, instead of choosing test data set from training data. The following diagram illustrated in this method:


Strip Overfitting with respect to point A and B.

The above diagram shows a neural network involved in learning a map from high dimensional data set to the 2-dimensional plane. After completed the learning process, the network maps the two data points labeled A and B to A' and B', respectively. In order to measure the overfitting, we choose 3 evenly distributed linear interpolation points between A and B; and calculate their targets in the 2 dimensional plane with the learned network. Whereas A, P1, P2, P3, B form a straight line, the points A', P'1, P'2, P'3, B' normally don't form a straight line. We then  define the strip overfitting w.r.t. A and B by the size of the strip induced by the points A', P'1, P'2, P'3B', as depicted in above diagram as the shadow area.

Based on the strip overfitting for a pair of data points, we can then estimate the overfitting for the whole training data set by the average strip overfitting for a fixed number of randomly selected data point pairs.

It should be noted that the number of interpolation points provides a way to control the scaling of the overfitting: the more interpolation points we choose, the finer the strip and the smaller the strip overfitting. When the target space is a one-dimensional space, the strip will collapses to a straight line and always have size zero.  In order to accommodate such special case, we append an index component to each value in the series A', P'1, P'2, P'3B', so that we get a series of two dimensional vectors (A',0), (P'1,1), (P'2,2), (P'3, 3), (B', 4); and we define its strip size as the strip overfitting for the one-dimensional target space.

It should be noted that the strip overfitting also becomes zero, if the mapping induced by the network preserves  straight lines between the data points. That means, linear regression would be considered as kind of ideal regression with zero overfitting. Since most data models in the practice are non-linear, zero strip overfitting won't be achieved globally for all data point pairs.

The interpolation technique also provides a direct way to visualize data generalization. In order to see how a neural network generalizes, we can choose two data points and calculate a series of interpolation points between them; and then plot their trace mapped in the target space by the network. As samples, the following two pictures show the traces of three pairs produced by two differently trained neural networks: both networks have the same structure and are trained with the same algorithm, the only difference is that the network that produces the left picture has used the dropout regularization technique to reduce overfitting. We see clearly that the three interpolation traces on the left side are closer to straight lines. Numerical calculation also verifies that the network for the map on the left side has smaller strip overfitting. This example indicates that strip overfitting as defined in this note may be suitable to characterize overfitting phenomena.


The next two pictures show more examples of interpolation traces that visualizes the effect of dropout regularization technique. These two pictures are produced by the same two networks used before with the exception that here more data point pairs have been chosen and the training data points are not shown for the sack of clarity. We see that the dropout regularization makes the interpolation traces smoother and more homogeneous.



Discussion

This note has introduced the concept strip overfitting to quantify overfitting in multidimensional regression models. The method basically quantifies overfitting by the "distance" between a regression model and the linear interpolation. We have demonstrated with examples that the strip overfitting may be suitable to measure and compare the effect of regularization techniques like dropout technique. The strip overfitting offers an interesting alternative tool to quantify and visualize overfitting in neural networks for multidimensional data modeling.






Saturday, June 3, 2017

Deep Line Chart for Big Data

Line chart is a popular technique to visualize trend among sequential data. Whereas it has been very a useful tool in general, it has a often neglected problem for large data sets: most conventional software library draw curves sequentially in certain order, so that the early drawn curves will be covered by curves drawn afterwards. This means that the appearance of the line chart depends on the drawing order, that is often random in the practices. The following animated picture illustrates a case where yellow curves get covered by blue curves. This drawback would especially become more apparent for large data set that comprises many high density curves.



In this note I'll present a new line chart method that avoids the above mentioned problem. Before we describe the new drawing method, let's recall a simplified version of the conventional way to draw line chart:

  Input: A set of line segments each with 2 ending points and a color.
  Output: A bitmap P with a color Pxy at each coordinator (x, y).  
  Initialize P with the background color.
  for each line segment L
      Let (r, g, b) be the color for L
      for each pixel coordinator (x,y) on L
          Set Pxy:=(r,g,b)

The above algorithm is a single pass drawing method where each pixel color of a line segment is directly painted on the bitmap.  With this method, painted lines on the bitmap may potentially be painted over by successive lines.

The new method is a two-pass algorithm: it first sum up all pixel colors in an intermediate matrix; then convert the matrix into a bitmap. It works roughly as follows:

  Input: A set of line segments each with 2 ending points and a color.
  Output:  A bitmap P with a color Pxy at each coordinator (x, y).
  Initialize a matrix M with the same dimension as P, but 
    each component of M holds an integer vector Mxy=(r,g,b,n).
  for each line segment L
      Let (r, g, b) be the color for L
      for each pixel coordinator (x,y) on L
          Set Mxy := Mxy + (r, g, b, 1)
  for each coordinator (x,y) in M
      Set Pxy := (r/n, g/n, b/n), where (r,g,b,n) is the vector Mxy

Compared to the previous method, the key change of the new method is the use of an intermediate matrix : the matrix sums up all colors from all lines and counts how many times a pixel has been painted. In the second pass, the matrix is converted to bitmap by taking the "average" color for each pixel. It should be pointed out here that, for an actual implementation, the data type for matrix components should be all 32 bits integers, so that no data overflow occurs when summing up the colors.

The following picture shows a line chart created with the new line chart algorithm for the same data used in previous picture above:


We see in above picture that the group of yellow curves are clearly visible. In general, the new line chart will reveal groups of curve, as long as the group is statistically significant with respect to the complete data set. The conventional line chart, on the other side, only shows curves on the "surface" of the curve bundle.

The new line chart algorithm has another significant advantage over the conventional line chart: since the algorithm does not mandate particular order to update the matrix and bitmaps, it can be efficiently parallelized on modern GPU hardware and achieve very good performance. The implementation included in VisuMap version 5.0.930 is, for a system with a GTX 960 card (that has ca. 2000 GPU cores), more than two orders of magnitude faster than the conventional line chart.





Friday, April 21, 2017

Deep Data Profile with VisuMap

Data profiling are normally understood as statistical methods to extract numerical features from complex systems for easy exploration. So for instance, GDP, CPI and various kind of indices are often used to profile the state of an economy. Appropriate profiling helps us to compare similar systems; or a system in different development phases. In this note I'll put forward a generic method to profile high dimensional data; the method combines dimensionality reduction algorithms with deep artificial neural networks.

In recent years, many so called  nonlinear dimensionality reduction (NDR) methods have been developed to visualize high dimensional complex data. Those methods often use machine learning algorithm to produce 2D or 3D maps; which provide a kind of graphic profiles about data. For instance the following pictures shows a 2D map made from a data set from flow cytometry study:


Above map was made with the tSNE algorithm from a dataset that contains about 6000 data points each with 12 variables; The colors for the sub-clusters are added with help of the affinity propagation clustering algorithm. The colored map is pretty helpful to discern interesting structure within the data set. Unlike those statistics based profiling, the visual map based profiling does not rely on high level features like GPD ratio; which in general require a good understanding about the data and underlying system.

Nevertheless, those visual map based methods lack the predication capability in the following sense: in the practice, we often have multiple data sets collected about similar systems, or the same system in different phases. For instance, in clinic trials our data sets might be gene expression profiles of patients in different trial stages. In these cases, we are especially interested in differences between the profiles. Most of these NDR methods are however insensitive to small changes, so that it is hard to recognize differences between NDR maps from similar data sets.

To address above mentioned problem, we purpose here the deep data profiling (DDP) procedure as an extension to NDR based profiling as illustrated in the following diagram:

Deep Data Profiling 

The DDP procedure starts with a set similar data sets. As first step we choose a reference data set and apply NDR on the data set to obtain a 2D or 3D map. Many NDR methods are available for this purpose, for this note we recommend the tSNE algorithm, as it is widely available and produces relatively good structured map for a wide range of data. Then, we apply a clustering algorithm on the map to produce labels for the sub-clusters on the map. Those labels are then used as different colors for the sub-clusters, so that we get a colored map as illustrated in above picture.
There are many clustering algorithms suitable for this purpose, for this test we used the affinity propagation algorithm together with some manual adjustment directly performed on the map.

The colored map we obtained from the reference data set represents knowledge we captured from the reference data. As next step we then use a machine learning algorithm to learn such knowledge. In particular, we employ multilayer feed forwards networks to learn the translation from reference data to the colored map. Two networks will be used for this purpose: one to learn the dimensionality reduction function; and the other one to learn the clustering function.

The two trained networks can then be applied to other data sets to produce colored maps as their visual profiles.

A plugin module, called Data Modeling, has been implemented to integrate DDP service with VisuMap version 5.0.928. The plugin module internally uses the TensorFlow engine from Google Inc. to implement feed-forward neural networks. The plugin offers GUI and scripting interface to train and apply models with data directly from VisuMap.  The following screen-cast demonstrates a simple use case of DDP with VisuMap:



Deep data profiling procedure presented here offers a visual profiling technique for high dimensional complex data. Compared to conventional statistics based profiling techniques, DDP is more intuitive and information richer. As an meta-algorithm, DDP can take advantage of new algorithms for NDR, clustering and machine learning. With the rapid development of machine learning technologies, DDR might offer powerful and versatile tools to explore high dimensional complex systems.