Tuesday, June 13, 2017

On Strip Overfitting of Multidimensional Regression

Overfitting is a key characteristic for regression style data modeling with deep neural networks. Whereas considerable effort have been devoted to finding methods to reduce overfitting, little have been done to address the question how to measure overfitting. In most cases, overfitting is measured by prediction error with respect to a pre-selected test data set. Such method has led to various, more or less, successful overfitting-reducing techniques.

There are, however, some open questions relating to the rather ad hoc method to measure overfitting. Firstly, the method requires that the test set does not participate in the training process. But most learning process checks repeatedly with the test data set to find optimal models or meta-parameters. In this way, the test data set information "leaks" in to the training process and compromises the separation between training and testing data set. Secondly, there are no clear guidelines to choose the test data set. The method implicitly assumes that the randomly selected test data set will reflect the intrinsic statistics of data model, such assumption can be far from being true in the practice. Thirdly, many problem encountered in the practice have rather limited data, reserving part of the data as test data set might significantly reduce the size of the training data set, and therefore reduce accuracy of regression.model.

In this note I'll present a simple way to quantify overfitting of regression style neural networks. The main idea of the method is that it generates test data by linear interpolation, instead of choosing test data set from training data. The following diagram illustrated in this method:


Strip Overfitting with respect to point A and B.

The above diagram shows a neural network involved in learning a map from high dimensional data set to the 2-dimensional plane. After completed the learning process, the network maps the two data points labeled A and B to A' and B', respectively. In order to measure the overfitting, we choose 3 evenly distributed linear interpolation points between A and B; and calculate their targets in the 2 dimensional plane with the learned network. Whereas A, P1, P2, P3, B form a straight line, the points A', P'1, P'2, P'3, B' normally don't form a straight line. We then  define the strip overfitting w.r.t. A and B by the size of the strip induced by the points A', P'1, P'2, P'3B', as depicted in above diagram as the shadow area.

Based on the strip overfitting for a pair of data points, we can then estimate the overfitting for the whole training data set by the average strip overfitting for a fixed number of randomly selected data point pairs.

It should be noted that the number of interpolation points provides a way to control the scaling of the overfitting: the more interpolation points we choose, the finer the strip and the smaller the strip overfitting. When the target space is a one-dimensional space, the strip will collapses to a straight line and always have size zero.  In order to accommodate such special case, we append an index component to each value in the series A', P'1, P'2, P'3B', so that we get a series of two dimensional vectors (A',0), (P'1,1), (P'2,2), (P'3, 3), (B', 4); and we define its strip size as the strip overfitting for the one-dimensional target space.

It should be noted that the strip overfitting also becomes zero, if the mapping induced by the network preserves  straight lines between the data points. That means, linear regression would be considered as kind of ideal regression with zero overfitting. Since most data models in the practice are non-linear, zero strip overfitting won't be achieved globally for all data point pairs.

The interpolation technique also provides a direct way to visualize data generalization. In order to see how a neural network generalizes, we can choose two data points and calculate a series of interpolation points between them; and then plot their trace mapped in the target space by the network. As samples, the following two pictures show the traces of three pairs produced by two differently trained neural networks: both networks have the same structure and are trained with the same algorithm, the only difference is that the network that produces the left picture has used the dropout regularization technique to reduce overfitting. We see clearly that the three interpolation traces on the left side are closer to straight lines. Numerical calculation also verifies that the network for the map on the left side has smaller strip overfitting. This example indicates that strip overfitting as defined in this note may be suitable to characterize overfitting phenomena.


The next two pictures show more examples of interpolation traces that visualizes the effect of dropout regularization technique. These two pictures are produced by the same two networks used before with the exception that here more data point pairs have been chosen and the training data points are not shown for the sack of clarity. We see that the dropout regularization makes the interpolation traces smoother and more homogeneous.



Discussion

This note has introduced the concept strip overfitting to quantify overfitting in multidimensional regression models. The method basically quantifies overfitting by the "distance" between a regression model and the linear interpolation. We have demonstrated with examples that the strip overfitting may be suitable to measure and compare the effect of regularization techniques like dropout technique. The strip overfitting offers an interesting alternative tool to quantify and visualize overfitting in neural networks for multidimensional data modeling.






Saturday, June 3, 2017

Deep Line Chart for Big Data

Line chart is a popular technique to visualize trend among sequential data. Whereas it has been very a useful tool in general, it has a often neglected problem for large data sets: most conventional software library draw curves sequentially in certain order, so that the early drawn curves will be covered by curves drawn afterwards. This means that the appearance of the line chart depends on the drawing order, that is often random in the practices. The following animated picture illustrates a case where yellow curves get covered by blue curves. This drawback would especially become more apparent for large data set that comprises many high density curves.



In this note I'll present a new line chart method that avoids the above mentioned problem. Before we describe the new drawing method, let's recall a simplified version of the conventional way to draw line chart:

  Input: A set of line segments each with 2 ending points and a color.
  Output: A bitmap P with a color Pxy at each coordinator (x, y).  
  Initialize P with the background color.
  for each line segment L
      Let (r, g, b) be the color for L
      for each pixel coordinator (x,y) on L
          Set Pxy:=(r,g,b)

The above algorithm is a single pass drawing method where each pixel color of a line segment is directly painted on the bitmap.  With this method, painted lines on the bitmap may potentially be painted over by successive lines.

The new method is a two-pass algorithm: it first sum up all pixel colors in an intermediate matrix; then convert the matrix into a bitmap. It works roughly as follows:

  Input: A set of line segments each with 2 ending points and a color.
  Output:  A bitmap P with a color Pxy at each coordinator (x, y).
  Initialize a matrix M with the same dimension as P, but 
    each component of M holds an integer vector Mxy=(r,g,b,n).
  for each line segment L
      Let (r, g, b) be the color for L
      for each pixel coordinator (x,y) on L
          Set Mxy := Mxy + (r, g, b, 1)
  for each coordinator (x,y) in M
      Set Pxy := (r/n, g/n, b/n), where (r,g,b,n) is the vector Mxy

Compared to the previous method, the key change of the new method is the use of an intermediate matrix : the matrix sums up all colors from all lines and counts how many times a pixel has been painted. In the second pass, the matrix is converted to bitmap by taking the "average" color for each pixel. It should be pointed out here that, for an actual implementation, the data type for matrix components should be all 32 bits integers, so that no data overflow occurs when summing up the colors.

The following picture shows a line chart created with the new line chart algorithm for the same data used in previous picture above:


We see in above picture that the group of yellow curves are clearly visible. In general, the new line chart will reveal groups of curve, as long as the group is statistically significant with respect to the complete data set. The conventional line chart, on the other side, only shows curves on the "surface" of the curve bundle.

The new line chart algorithm has another significant advantage over the conventional line chart: since the algorithm does not mandate particular order to update the matrix and bitmaps, it can be efficiently parallelized on modern GPU hardware and achieve very good performance. The implementation included in VisuMap version 5.0.930 is, for a system with a GTX 960 card (that has ca. 2000 GPU cores), more than two orders of magnitude faster than the conventional line chart.