Visualizing High Dimensional Data

Multidimensional Sorting with High Fidelity t-SNE

2021-06-17T09:26:00.001-07:00

Multidimensional Sorting with High Fidelity t-SNE

Multidimensional data often has a latent sequential structure that determines the basic shape of the data. Those sequential structure might correlate with ordered quantities like time, stages and size etc. Capturing and visualizing those sequential structure would be very helpful for exploratory data analysis.

A frequently used method for exploring high dimensional data is the principal component analysis (PCA). By projecting data to the main principal components, we can gain institutive information about the high level large scale shape of the data. More particularly, the projection to the first principal component offers a way to visualize the main back-bone linear structure.

A main short-coming of PCA is that it only structure from linear perspective. It fails for data whose latent sequential structure significantly deviate from linearity. For instance, let's consider the 2-dimensional data set depicted by the following map:

This dataset consists of multiple clusters each possesses, more or less, a sequential structure. The PCA wouldn't be able to aggerate these partial sequential structure to a global sequential structure.

The t-SNE algorithm with the high fidelity modification offers a simple yet effective way to discover global sequential structure in high dimension data: we simplify apply t-SNE to reduce a given dataset to 1-dimensional space. The coordinates of data points in the 1-dimensional space then provide the keys to order data points. The following video clip illustrates how t-SNE works on the dataset mentioned above:

We notice in above video that t-SNE discovered, more or less, the sequential structure underlying the sub-clusters; and furthermore, those partial sequential structures are aligned/ordered in a coherent way.

Dual Sorting of Heatmaps

Heatmap is a popular method to visualize matrix of numerical data. Heatmap is often used together with dendrograms, which rearrange the rows and columns using certain hierarchic clustering algorithm. The following picture shows a typical heatmap together with two dendrograms:

Heatmap augmented with dendrograms (wikipedia)

Heatmap constructed with dendrograms offers intuitive visualization, but it has difficulty to scale up to large dataset. For large matrix like those from single cell RNA sequencing (scRNA seq) study, dendrogram become quite inefficient, and instable in the sense that small change in data might lead to large variation in the clustering structure.

The t-SNE based sorting algorithm described in this note provides a simple alternative to arrange rows and columns in a heatmap: we simply sort the rows and columns with t-SNE. As an example, I took a dataset from scRNA Seq analysis, it comprises the gene expression counts of about 11000 cells with respect to 28000 genes. The following pictures shows its t-SNE embedding and its heatmap:

In above heatmap, each row represents the expression profile of a cell with respect to all selected genes; and each column represents the expression profile of a gene with respect to selected cells. Dark or red elements in the heatmap represent highly expressed cell/gene pairs. The t-SNE map is a embedding of all cells. We can see that several cell clusters are clearly visible as dark horizontal strips. However, no clear gene clusters are visible in the heatmap.

The following picture shows the heatmap after sorted its rows and columns with t-SNE. We see much more clearly the clusters among cells and genes, and more importantly the association between the cell- and gene-clusters.

It should be noticed with regards to the experiment above:

The perplexity of the t-SNE is set to about 2500. Smaller perplexity often leads to instability and random artifacts.
The training process with t-SNE has ran for 5000 epochs, which is much higher than normal t-SNE training process. Training less epochs likely results in more random artifacts.
In above example, t-SNE used the linear correlation as distance metric instead of the normally used Euclidean distance. Various experiments indicate that correlation seems to be more appropriate to measure similarity between genes. Because of this, we can not use PCA to pre-process data table to much lower dimensions to speed up the training, since PCA in general doesn't preserve correlation table.

Unfolding of t-SNE Embedding

We have previously blogged about doing "knockout" test with t-SNE. By comparing the t-SNE maps produced by the different set of genes, we can explore the impact of disabled genes. Going one step further, with t-SNE as a multidimensional sorting algorithm we can perform the following experiment: We first sort all the participated genes into a sequence; then pick a particular direction for the sequence and split the sequence in small clusters; then create t-SNE cells maps with progressively larger gene clusters. The following diagram illustrates these steps:

It should be noticed that the sequence of t-SNE maps are generated independently by separate training processes, and each of them started from a different random initialization. In other words, the final maps M1, M2, M3 are independent embeddings of ordered gene clusters, C1, {C1, C2}, {C1, C2, C3}, etc.. If we are lucky, the order underlying the gene clusters might indicate certain latent variables, like time. The following short video clip shows the animation of through the resulting t-SNE maps for the dataset in previous section.

It is interesting to notice that the animation is reminiscent of group of cells unfolding, generation by generation, to final shape and structure.

High Fidelity t-SNE

2021-06-04T11:27:00.003-07:00

High Fidelity t-SNE with Gradual Diminishing Exaggeration

When we repeat t-SNE on the same dataset with the same parameters, we may get different maps with noticeable random variations. This kind of randomness limits the accuracy of t-SNE algorithm as a visualization tool.

In this note we describe a simple yet effective method to improve the accuracy of t-SNE: Instead of switching the exaggeration rate at a fixed time during the training, we propose starting the training process with a high exaggeration, say 12.0, then gradually reduce it to 1.0. We will demonstrate the effectively of this method with example from single cell RNA sequential study.

Post-Training Normalization

Since the cost function optimized by t-SNE is invariant under rigged transformations (i.e. rotations and shifting), t-SNE normally produced maps with random rotations and shift. These kind of randomness can be easily removed by post-training normalization. For this study, we propose the following steps to normalize t-SNE maps as illustrated in the following diagram with 4 sample t-SNE maps:

As shown in above diagram, a t-SNE map is normalized in 3 steps:

Centralizing: centralizing the map so that each of the 2D or 3D coordinate has zero-mean.
PCA based Alignment: Rotating the map, so that the x-axis aligns with the first eigen vector of the map; y-axis with the second eigen vector, and potentially the z-axis with the third eigen vector.
Moment based Flipping: Calculate the 0.5-th moment of the map along the x-axis according the following forms:

If the moment is negative, flip the map along the x-axis by multiple all x components with -1.0;
Do the same with y-axis and potentially z-axis.

After normalization we can then measure the discrepancy between two t-SNE maps by summing up the Euclidean distances between the data points in both maps. For more convenience, we can just count number of the data points whose images points have distance from each other larger than certain values, say 40 pixels.

It should be pointed out that there are cases that the suggested method might fail. For instance, when the t-SNP maps are rotational symmetrical where no clear orientation is present, this method will have difficulty find common orientations. In the practice, we often encounter this case, when the perplexity parameter of t-SNE is too small, so that the resulting t-SNE map resemble a disc or a ball. Fortunately, this kind of issue can be easily fixed by increasing the perplexity.

Finally, it is interesting to notice that there is no need to scale the maps, since t-SNE with fixed parameter settings normally produces maps with roughly the same size.

Training with Gradual Diminishing Exaggeration

In order to improve the training speed and efficiency, t-SNE algorithm initially proposed to start the training with a high exaggeration factor, train the map for certain number of epochs; then switch the exaggeration factor to 1.0 to practically terminate the exaggeration phase. The following diagram shows the cost minimized during a typical t-SNE training process, the diagram has been overplayed with the exaggeration factor.

We see that the objection cost has been reduced rapidly for a short period after the exaggeration phased has ended, but then stayed basically unchanged during the rest of the training process. This behavior suggests that real learning only takes place when exaggeration factor changes. We propose thus the following function to gradually decrease the exaggeration factor:

In above formula, f(n) is the exaggeration factor at n-th training epoch; C is the initial exaggeration factor; N is the number of total training epochs; L is the length of exaggeration phase which is set normally to 0.9N.

Example

As example I picked a dataset from single cell RNA sequence (scRNA seq) study. The dataset contains the expression counts of about 22000 cells with respect to 28000 genes. The input data for the t-SNE is thus a 22000x28000 matrix. The following picture shows a heatmap and a t-SNE map of this dataset:

For the test, t-SNE has been run on the dataset 4 times with and without enabling the new exaggeration method. The following pictures shows the cost reduction during training of the four repeats:

We see that, with the new method enabled, the final t-SNE maps have consistently lower cost. We also notice that, with the new method, the final maps have almost equal costs, whereas with old method, the cost of final maps vary slightly. Those small variations are more visible in the final maps. The following two pictures show the maps created with and without the new method:

With some effort we can discern relative large variation amount the maps created with old exaggeration strategy. We can visualize those variations more clearly with animation by interpolation between those maps as be shown in the following video clip:

Knockout Test with High Fidelity t-SNE

Knockout test has been used in microbiology to probe gene functions during the development of organisms. For doing that, certain genes are disabled or knocked-out; and changes induced in the organism give indications about those knocked-out genes. In general, knockout test help us to trace the correlation between genotypes and phenotypes.

High fidelity t-SNE enables us to probe the effect of individual features on the output t-SNE maps. If consider the t-SNE maps as phenotypes derived from genotypes (i.e. gene expressions), we can perform kind of "knockout" test with t-SNE: We create first a t-SNE map with all features; then create a t-SNE map without certain features. We can then expect that the knocked out features would be responsible for the variation in the t-SNE maps.

The following short video demonstrates how "knockout" test might been done to study correlation between gene- and cell-clusters:

Conclusion

With a small change in its algorithm, t-SNE achieved much high accuracy and opens therefore doors for new applications.

Dual Embedding of scRNA-seq Data

2020-06-12T09:16:00.000-07:00

Single-cell RNA sequencing (scRNA-seq) is a powerful tool to profile cells under given condition with respect to selected genes. As a contingency matrix, a scRNA-seq dataset records the the expression level of individual cells with respect to a set of genes or related transcriptical signatures. A widely used method to study such scRNA-seq data is embedding the expression profiles of cells, i.e. rows of such expression matrix, into low dimensional space, and visualizing them as 2D or 3D maps. Those maps offers an intuitive way to trace the phenotype clusters among cells. Yet, this kind of maps do not show the which cells are regulated by which genes, despite that the expression matrix is direct information about reglating relationship between cells and genes.

This note puts forward a method to embed expression profiles of cells and genes, ie rows and columns of the expression matrix, simultaneously into a single low dimensional space. The resulting map visualizes both cells and genes; and more interestingly, also the association between the clusters. The following diagram illustrates the relationship between the dual embedding and conventional embedding:

The dual embedding method uses the tSNE algorithm for the embedding. In particular, let C and G denote the set of cells and genes; d_c and d_g be distance functions over C and G respectively, With tSNE we can then obtain a cell map and a gene map from (C, d_c) and (G, d_g) respective as indicated in above diagram. As an extension to these methods, the dual embedding method embeds both C and G in to low dimensional space with tSNE. In order to do so, we define a distance function d_cg over C⋃G as following:

In above formula, M is the expression matrix of the scRNA-seq dataset; M[x,y] stands for the expression level of x-th cell w.r.t. y-th gene; M_max is the maximal element of M; h is a positive bias factor that additionally adjusts the distances between cells and genes.The last two terms in above formula define the distance between C and G in the sense that higher expression level between cells and genes will leads two lower distance. Additionally, larger bias factor h leads to larger separation between cells and gens.

As an example, I used a dataset with 12000 cells and 8000 genes from the data used in the publication tasic2018. The follow image shows a cell-gene map of this dataset. The expression matrix has been log-transformed; the perplexity for the tSNE algorithm is about 1000.0.
As a feature of embedding maps, we can expect that two closely located genes/cells have similar expression profile. Additionally, a gene (red dot) should have high expression level on closely located cells (green dots). As the cells and genes overlap each in the map significantly, the following image uses gif animation to blend-in the two groups successively into foreground.

Figure 1: Dual Embedding of cells and genes with tSNE Algorithm.

We notice here that the cells show a clear cluster structure, whereas the genes are rather less structured, except that a large gene clusters located on the far right side, those genes might be less relevant for the formation of these cells.

We also notice that, as shown in the following image, there are 4 or 5 cell clusters on the upper right corner whose nearby genes form a kind of sequential path. Those genes might play a special role in regulating those target cell clusters.

Embedding with Correlation Affinity

Notice that scRNA-seq datasets are basically contingency table of counts for cell/gene co-expression events, thus the nature of an expression matrix is more probabilistic, than spatial. For this reason, we purpose here a correlation distance for the embedding algorithm to replace the Euclidean distance.

More particularly, let M' be the row-normalized version of M as defined by

M'[x,y] := (M[x,y] - M[x, *])/||M[x,*]||

where M[x, *] and ||M[x,*]|| is the mean value and the norm of the x-th row of M respectively. Similarly, let M" be the column-normalized version M as defined by:

M"[x,y] := (M[x,y] - M[*,y])/||M[*,y]||

After the normalization, the correlation coefficient between two rows or two columns of M is just the scalar product the corresponding rows or columns in M' or M''. We now define our correlation distance p_cg over C⋃G as following:

The operator ⦁ in above formula denotes the scalar product of two row or column vectors; h is as before a positive bias factor that additionally adjusts the distances between cells and genes. The default value for h is 1.0. Notice also that matrix M doesn't need to be log-transformed like the case for Euclidean distance function as the normalization scale the numbers in to common ranges.

As an example, the following picture shows the dual embedding of the data used in previous example by applying tSNE with the correlation distance p_cg:

Figure 2: Dual Embedding of cells and genes with tSNE and correlation distance p_cg.

Comparing to Figure 1, we notice that in this dual embedding, the genes show more clear and simple cluster structure; and there are much less overlapping between the clusters, albeit that quite some clusters intertwine with each other. In general, a dual embedding map has the property that genes are highly expressed in by nearby cells. As be shown in Figure 2, we can see that the most high density gene clusters are rather distant from the major cell clusters, this indicates that the majority of those genes probably don't participate in the cell regulation.

Dual Embedding with Eigengenes

Mathematically, eigengenes are the top principal components of the expression matrix with the largest eigen-values. Eigengenes are normally obtained with the SVD or PCA algorithm, they are widely used as a way to reduce the dimensionality of a dataset, and therefore reduce the complexity of calculations.

We can apply the dual embedding method to eigengenes to visualize their relationship with the cell clusters. The following picture shows a dual embedding of the cells of above example together with their 50 eigengenes, represented in purple color. We notice that the cells in this map form a similar cluster structure as that in Figure 2; whereas the genes are replaced by eigengenes in the neighborhood of the cell clusters. We can clearly see that most cell clusters are regulated by 2 or 3 nearby eigengenes.

This visual characteristics indirectly validate the correlation based distance as a proper choice for the dual embedding algorithm.

Conclusion

We have introduced a correlation based distance function for cell and gene expression profiles. With help of an embedding algorithm, like tSNE, we can obtain low dimensional map for both cells and genes. These dual embedding maps, not only shows the cluster among cells and genes, but also their correlation through their proximity in the maps.

As demonstrated with real examples, correlation in the sense of statistics seems to be a better metric to visualize events of different nature like cells and genes in the scRNA seq dataset; and in general large contingency table of counts. The dual embedding provides intuitive way to visualize and explore those data.

All simulations and examples in this note is done with the software package VisuMap and a custom plug-in module named Single Cell Analysis, which implements in particular the two distance functions described in this note. The plug-in can be downloaded and installed from VisuMap online.

Representation Learning with Deep Neural Network

2019-06-28T14:09:00.000-07:00

Representation Learning for Algebraic Relationships

This note puts forward a simple framework to learn data representation with deep neural networks. The framework is inspired by mathematical representation theory where we look for vector representations for elements of algebraic structure like groups.

To describe the framework, let's assume that we have the following table for an abstract operation:

Table 1: Binary operation table for abstract items a, b, c and d.
Above table shows that the abstract entities a, b, c and d fulfill an abstract operation ⊕, for instance: a⊕a = 0; b⊕c = 3; etc.. We now try to find representations for a, b, c and d with two dimensional vectors. In order to do so, we start with a default zero-representation as in following table:

a:	(0, 0)
b:	(0, 0)
c:	(0, 0)
d:	(0, 0)

Table 2: Initial representation table for a, b, c, d as 2 dimensional zero vectors.

That means a, b, c and d are all represented by vectors (0, 0). We then construct a feed-forwards neural networks as illustrated as following:

The network has four input nodes to accept a pair of 2-dimensional vectors; and one output node to approximate the operation ⊕. Like conventional feed-forward neural network, the input and output nodes will get data from the operation and representation table before each optimization step. Unlike the conventional feed-forward network, the input nodes are configured as trainable variables which will be adjusted during the learning process; and their values will be saved back to the representation table after each learning step. More particularly, the learning process goes as follows:

Randomly select a sample from the operation table, say b⊕c = 3.
Lookup the representation for a and b from the representation table. Assign them to input nodes of the network. Lookup the expected output for b⊕c from the operation table, and pass it to the output node as learning target.
Run the optimization algorithm, e.g. the ADAM optimization algorithm, for one step, which will change the state of input nodes to new values b'₀, b'₁, c'₀ and c'₁.
Save b'₀, b'₁, c'₀ and c'₁ back to the representation table as new representation for b and c.
Continue with step 1.

With an appropriate choices for the network and training algorithm, the above training process will eventually converge to a representation for a,b,c,d, so that the network realizes the abstract operation as described in table 1. A tensorflow based implementation of above training process is available at
Z4GroupLearning, which can be run on the google cloud computing platform from a browser.

Representation Learning for Transformations

In previous example the table 1 actually describes the finite group ℤ_4, representation are linked to the input layer. In general, the framework is not restricted to binary operations and the representation vectors can be linked to any state vectors, i.e. weights, bias and other trainable variables, of the network.

For the demonstration we describe here a network that finds representation for transformations in the 3D space. Let's assume that X is a 1000x3 table consisting of 1000 points from a sphere in the 3-dimensional space. Let {T₀, T₁, ...} be a sequences of 3D-to-3D transformations which transform X to {Y₀, Y₁, ...}. Assuming that X and {Y₀, Y₁, ...} are known, we can then find representations for the transformations {T₀, T₁, ...} with a deep neural network as illustrated as following:

The transformations are linked to some internal part of the weight matrix. Before starting the training process, the transformations are initialized with zero-matrices with the same shape as the linked weight matrix. At each training epoch, a 3-tuple (X, T_t, Y_t) is randomly selected; T_t is pushed to the linked weight matrix; then the optimization process is run to learn the map X-to-Y_t mapping; After the optimization process has run for several loops for all data points in X and Y_t, the linked weight matrix is pulled back in to the representation T_t, stored as T'_t for future epochs. The training process then continues with other randomly selected 3-tuples, until all Y_t have sufficiently approximated by the network.

As an example, we have selected 30 linear translations and 30 rotations to transform the sphere in the 3D space. The following video clip shows the transformed images Y_t:

After the training process has completed after about 100 epochs, we have got 60 matrices as representation for the transformations. We have embedded the 60 matrices into the 3-dimensional space with PCA (principal component analysis) algorithm. The following picture shows the embedding of the 60 transformations:

In above picture, each dot represents a transformation: a red dot corresponds to a linear translation, a yellow dot to rotation. We see that the representations generated by the deep neural network correspond very well the geometry structure of the transformations.

Representation Learning for Dimensionality Reduction

Representation learning (RL) described here can be used to perform non-linear dimensionality reduction in a straightforward way. So for example, in order to reduce a m dimensional dataset to k dimension, we can simply take a feedforward network with k input nodes and m output nodes. As depicted in the following diagram, the input nodes are configured as learning nodes whose state values get values from the training data and adjusted during the learning process.

We notice that autoendcoder (AE) networks has often be used in various ways for dimensionality reduction and unsupervised learning. Comparing to AE, RL basically performs just decoding service. However, the essential difference is that the AE trains a network, whereas RL trains directly the lower dimentional representation.

The following maps shows an example of mapping a 3-dimenional sphere dataset to the 2 dimensional plane. An implementation with tensorflow on colab platform can be found here.

We notice that 2-dimenional map preserved large part of the local topology, while unfolding and flattening the 3D-sphere structure.

If encoding service (like that in AE) is needed, we can extend the above RL network with an encoding network that maps the network output to the lower dimensional data. This encoding network can be then trained independently from the decoding part, or concurrently with the decoding network part.

Application for Data Imputation in Epigenomics

Epigenomic study shows that different cell types of multi-cellular organisms developed from a common genome by selectively express different genes. Many types of assays have been developed to probe those epi-genomic manipulations. The result of such an assay is normally a sequence of numerical values profiling certain biochemical properties along the whole genome strand. With a set of assays and a set of cell types we can then obtain a profile matrix as depicted as following:

Table 3: Profile matrix for cell and essay types.

A large profile matrix about various cell types can help us to understand the behaviors of those cells. However, it is currently rather expensive to conduct essays in large numbers. To alleviate such limitations, researchers have tried to build models based on available data, so that they can impute profiles for news cell types or essay types (ChromImpute, AVOCADO, PREDICTD, Encode Imputation Challenge,)

We see that table 3 has similar structure as table 1, thus we can model table 3 similarly as table 1. We notice here two main difference between the two tables: Firstly, table 3 is only sparsely defined as only limited number of experiments have been done. Secondly, an element in table 3 is a whole sequence (which may be potentially billions of numerical values.), whereas an element in table 1 contains a single value. It is obvious that the training algorithm for table 1 can handle the data sparsity problem without any change. In order to obtain representation for each cell and assay type, we only have to make sure that each row and each column contains some defined profile data.

In order to address the high dimensional data issue, we extend our neural network as depicted as following:

Figure 1: Neural network for cell- and assay-type representation learning.

The training algorithm runs roughly as following:

Randomly select a profile from table 3, say a profile P(A, b) for cell-type A and assay type b.
Segment the profile into sections of equal length N; and arrange the sections into a matrix P_A,b that has L rows and N columns.
Lookup the representation R_A for cell type A in present set of representations. If no such representation is found, initialize R_A as a zero-matrix of dimension LxD_c, where D_cis a preset constant.
Lookup the representation R_b for assay type b in present set of representations. If no such representation is found, initialize R_b as a zero-matrix of dimension LxD_a, where D_ais a preset constant.
Randomly select an integer k between 0 and L; Push the k -th row of R_A and R_b into the input layer of the network; Run the optimization process, e.g. with ADAM optimizer, for one step with k-th of P_A,b as the target label.
Pull back the state of the input layer into the k -th row of R_A and R_b respectively.
Repeat the steps 5 and 6 for certain preset numbers.
Select an other profile and repeat step 1 to 7 till the training process converges to certain state with small errors.

Notice that the network depicted above consists mainly of dense feed-forward layers. Only the last layer is a transposed convolution layer. This layer significantly extends the output dimension and therefore speedup the training process for long profile sequences. Another technique to speedup the training, that is not directly visible in above diagram, is the batched learning: the training process updates the weight matrix and input layer states only after 30 to 40 samples have been feed into the algorithm.

Representation Learning versus Model Learning

Most deep learning frameworks are model learning in the sense that they train models to learn data. Compared to the amount of training data, the size of the trained models are relatively small. RL extends model training to learn vector representation for key entities, which subject to certain algebraic relationships. Those key entities can be, element of mathematical groups, 3D image transformation, biological cell and assay types. Since RL exploits modeling power of deep neural network in the same way as model learning does, methods and techniques developed for model learning (like data batching, normalization and regularization,) can be directly used by RL framework.

It is interesting to notice RL incidentally resembles the working mechanism in cellular organism as revealed in molecular biology. In the latter case, the basic machinery of a cell is the RNA ribosome, which is relatively simple and generic compared to the DNA sequences. The ribosome could be considered as a biological model, whereas the DNA sequences as representation (i.e. code) for plethora of organic phenotypes. The learning process of the two models are however different. Whereas the ribosome model learns by recombination and optimizing selection, RL learns by gradient back-propagation. The former seems to be more general, whereas the latter be more specific and, arguably, more efficient. Nevertheless, I do think that nature organism at certain level might host more efficient gradient directed learning procedure.

Conclusion

We have presented a representation learning method with deep neural network. The method basically swaps the data representation with states of neuron nodes during the training process. Intuitively, those learned representation can be considered as target data back-propagated to designated neuron nodes. With examples we have demonstrated that RL algorithm can effectively learn abstract data like algebraic entities, transformations, and can perform services like dimensionality reduction and data imputation.

On Strip Overfitting of Multidimensional Regression

2017-06-13T13:38:00.000-07:00

Overfitting is a key characteristic for regression style data modeling with deep neural networks. Whereas considerable effort have been devoted to finding methods to reduce overfitting, little have been done to address the question how to measure overfitting. In most cases, overfitting is measured by prediction error with respect to a pre-selected test data set. Such method has led to various, more or less, successful overfitting-reducing techniques.

There are, however, some open questions relating to the rather ad hoc method to measure overfitting. Firstly, the method requires that the test set does not participate in the training process. But most learning process checks repeatedly with the test data set to find optimal models or meta-parameters. In this way, the test data set information "leaks" in to the training process and compromises the separation between training and testing data set. Secondly, there are no clear guidelines to choose the test data set. The method implicitly assumes that the randomly selected test data set will reflect the intrinsic statistics of data model, such assumption can be far from being true in the practice. Thirdly, many problem encountered in the practice have rather limited data, reserving part of the data as test data set might significantly reduce the size of the training data set, and therefore reduce accuracy of regression.model.

In this note I'll present a simple way to quantify overfitting of regression style neural networks. The main idea of the method is that it generates test data by linear interpolation, instead of choosing test data set from training data. The following diagram illustrated in this method:

Strip Overfitting with respect to point A and B.

The above diagram shows a neural network involved in learning a map from high dimensional data set to the 2-dimensional plane. After completed the learning process, the network maps the two data points labeled A and B to A' and B', respectively. In order to measure the overfitting, we choose 3 evenly distributed linear interpolation points between A and B; and calculate their targets in the 2 dimensional plane with the learned network. Whereas A, P₁, P₂, P₃, B form a straight line, the points A', P'₁, P'₂, P'₃, B' normally don't form a straight line. We then define the strip overfitting w.r.t. A and B by the size of the strip induced by the points A', P'₁, P'₂, P'₃, B', as depicted in above diagram as the shadow area.

Based on the strip overfitting for a pair of data points, we can then estimate the overfitting for the whole training data set by the average strip overfitting for a fixed number of randomly selected data point pairs.

It should be noted that the number of interpolation points provides a way to control the scaling of the overfitting: the more interpolation points we choose, the finer the strip and the smaller the strip overfitting. When the target space is a one-dimensional space, the strip will collapses to a straight line and always have size zero. In order to accommodate such special case, we append an index component to each value in the series A', P'₁, P'₂, P'₃, B', so that we get a series of two dimensional vectors (A',0), (P'₁,1), (P'₂,2), (P'₃, 3), (B', 4); and we define its strip size as the strip overfitting for the one-dimensional target space.

It should be noted that the strip overfitting also becomes zero, if the mapping induced by the network preserves straight lines between the data points. That means, linear regression would be considered as kind of ideal regression with zero overfitting. Since most data models in the practice are non-linear, zero strip overfitting won't be achieved globally for all data point pairs.

The interpolation technique also provides a direct way to visualize data generalization. In order to see how a neural network generalizes, we can choose two data points and calculate a series of interpolation points between them; and then plot their trace mapped in the target space by the network. As samples, the following two pictures show the traces of three pairs produced by two differently trained neural networks: both networks have the same structure and are trained with the same algorithm, the only difference is that the network that produces the left picture has used the dropout regularization technique to reduce overfitting. We see clearly that the three interpolation traces on the left side are closer to straight lines. Numerical calculation also verifies that the network for the map on the left side has smaller strip overfitting. This example indicates that strip overfitting as defined in this note may be suitable to characterize overfitting phenomena.

The next two pictures show more examples of interpolation traces that visualizes the effect of dropout regularization technique. These two pictures are produced by the same two networks used before with the exception that here more data point pairs have been chosen and the training data points are not shown for the sack of clarity. We see that the dropout regularization makes the interpolation traces smoother and more homogeneous.

Discussion

This note has introduced the concept strip overfitting to quantify overfitting in multidimensional regression models. The method basically quantifies overfitting by the "distance" between a regression model and the linear interpolation. We have demonstrated with examples that the strip overfitting may be suitable to measure and compare the effect of regularization techniques like dropout technique. The strip overfitting offers an interesting alternative tool to quantify and visualize overfitting in neural networks for multidimensional data modeling.

Deep Line Chart for Big Data

2017-06-03T09:34:00.002-07:00

Line chart is a popular technique to visualize trend among sequential data. Whereas it has been very a useful tool in general, it has a often neglected problem for large data sets: most conventional software library draw curves sequentially in certain order, so that the early drawn curves will be covered by curves drawn afterwards. This means that the appearance of the line chart depends on the drawing order, that is often random in the practices. The following animated picture illustrates a case where yellow curves get covered by blue curves. This drawback would especially become more apparent for large data set that comprises many high density curves.

In this note I'll present a new line chart method that avoids the above mentioned problem. Before we describe the new drawing method, let's recall a simplified version of the conventional way to draw line chart:

Input: A set of line segments each with 2 ending points and a color.
Output: A bitmap P with a color P_xy at each coordinator (x, y).
Initialize P with the background color.
for each line segment L
Let (r, g, b) be the color for L
for each pixel coordinator (x,y) on L
Set P_xy:=(r,g,b)

The above algorithm is a single pass drawing method where each pixel color of a line segment is directly painted on the bitmap. With this method, painted lines on the bitmap may potentially be painted over by successive lines.

The new method is a two-pass algorithm: it first sum up all pixel colors in an intermediate matrix; then convert the matrix into a bitmap. It works roughly as follows:

Input: A set of line segments each with 2 ending points and a color.
Output: A bitmap P with a color P_xy at each coordinator (x, y).
Initialize a matrix M with the same dimension as P, but
each component of M holds an integer vector M_xy=(r,g,b,n).
for each line segment L
Let (r, g, b) be the color for L
for each pixel coordinator (x,y) on L
Set M_xy := M_xy + (r, g, b, 1)
for each coordinator (x,y) in M
Set P_xy:= (r/n, g/n, b/n), where (r,g,b,n) is the vector M_xy

Compared to the previous method, the key change of the new method is the use of an intermediate matrix : the matrix sums up all colors from all lines and counts how many times a pixel has been painted. In the second pass, the matrix is converted to bitmap by taking the "average" color for each pixel. It should be pointed out here that, for an actual implementation, the data type for matrix components should be all 32 bits integers, so that no data overflow occurs when summing up the colors.

The following picture shows a line chart created with the new line chart algorithm for the same data used in previous picture above:

We see in above picture that the group of yellow curves are clearly visible. In general, the new line chart will reveal groups of curve, as long as the group is statistically significant with respect to the complete data set. The conventional line chart, on the other side, only shows curves on the "surface" of the curve bundle.

The new line chart algorithm has another significant advantage over the conventional line chart: since the algorithm does not mandate particular order to update the matrix and bitmaps, it can be efficiently parallelized on modern GPU hardware and achieve very good performance. The implementation included in VisuMap version 5.0.930 is, for a system with a GTX 960 card (that has ca. 2000 GPU cores), more than two orders of magnitude faster than the conventional line chart.

Deep Data Profile with VisuMap

2017-04-21T10:24:00.001-07:00

Data profiling are normally understood as statistical methods to extract numerical features from complex systems for easy exploration. So for instance, GDP, CPI and various kind of indices are often used to profile the state of an economy. Appropriate profiling helps us to compare similar systems; or a system in different development phases. In this note I'll put forward a generic method to profile high dimensional data; the method combines dimensionality reduction algorithms with deep artificial neural networks.

In recent years, many so called nonlinear dimensionality reduction (NDR) methods have been developed to visualize high dimensional complex data. Those methods often use machine learning algorithm to produce 2D or 3D maps; which provide a kind of graphic profiles about data. For instance the following pictures shows a 2D map made from a data set from flow cytometry study:

Above map was made with the tSNE algorithm from a dataset that contains about 6000 data points each with 12 variables; The colors for the sub-clusters are added with help of the affinity propagation clustering algorithm. The colored map is pretty helpful to discern interesting structure within the data set. Unlike those statistics based profiling, the visual map based profiling does not rely on high level features like GPD ratio; which in general require a good understanding about the data and underlying system.

Nevertheless, those visual map based methods lack the predication capability in the following sense: in the practice, we often have multiple data sets collected about similar systems, or the same system in different phases. For instance, in clinic trials our data sets might be gene expression profiles of patients in different trial stages. In these cases, we are especially interested in differences between the profiles. Most of these NDR methods are however insensitive to small changes, so that it is hard to recognize differences between NDR maps from similar data sets.

To address above mentioned problem, we purpose here the deep data profiling (DDP) procedure as an extension to NDR based profiling as illustrated in the following diagram:

Deep Data Profiling

The DDP procedure starts with a set similar data sets. As first step we choose a reference data set and apply NDR on the data set to obtain a 2D or 3D map. Many NDR methods are available for this purpose, for this note we recommend the tSNE algorithm, as it is widely available and produces relatively good structured map for a wide range of data. Then, we apply a clustering algorithm on the map to produce labels for the sub-clusters on the map. Those labels are then used as different colors for the sub-clusters, so that we get a colored map as illustrated in above picture.
There are many clustering algorithms suitable for this purpose, for this test we used the affinity propagation algorithm together with some manual adjustment directly performed on the map.

The colored map we obtained from the reference data set represents knowledge we captured from the reference data. As next step we then use a machine learning algorithm to learn such knowledge. In particular, we employ multilayer feed forwards networks to learn the translation from reference data to the colored map. Two networks will be used for this purpose: one to learn the dimensionality reduction function; and the other one to learn the clustering function.

The two trained networks can then be applied to other data sets to produce colored maps as their visual profiles.

A plugin module, called Data Modeling, has been implemented to integrate DDP service with VisuMap version 5.0.928. The plugin module internally uses the TensorFlow engine from Google Inc. to implement feed-forward neural networks. The plugin offers GUI and scripting interface to train and apply models with data directly from VisuMap. The following screen-cast demonstrates a simple use case of DDP with VisuMap:

Deep data profiling procedure presented here offers a visual profiling technique for high dimensional complex data. Compared to conventional statistics based profiling techniques, DDP is more intuitive and information richer. As an meta-algorithm, DDP can take advantage of new algorithms for NDR, clustering and machine learning. With the rapid development of machine learning technologies, DDR might offer powerful and versatile tools to explore high dimensional complex systems.

On the shape of nucleotide motifs

2017-02-06T09:15:00.000-08:00

Genomes of multi-cellular organism often contain short sequences with many duplicates which have been created and preserved with minor mutations during their evolutionary processes. Those nucleotide sequences, called here generically motifs, carry important information about the structure and history of those genomes. Better understanding of those motifs could help us to explore the function and evolutionary path of genetic code.

In this note I'll put forward a method to visualize motifs as 3D dotted maps, where each motif is represented by a dot in the 3D space; and sub-groups of motifs may take different shapes, which I'll try to interpret in terms of evolutionary variations.

I have put forward some methods to visualize DNA sequence a while ago. In those early attempts, different scanning machines were used to extract high dimensional vectors from DNA sequences; which are then mapped into 3D space with multidimensional scaling algorithm. A main problem of those methods is that they are often not very sensitive towards sequential variations, so that the resulting maps lack capability to reveal structural information. The method presented here basically follows the same framework, except that we'll not use scanning machine, but feed the motif sequences as data points directly into the multidimensional scaling algorithm which uses the neadleman-wunsch algorithm to calculate the editing distance between the motifs.

In the following I'll use the chromosome I of Caenorhabditis Elegans as an example to detail the steps of the visualization process.

Firstly, I downloaded the sequence of chromosome I of C. Elegans from the NCBI site (NC_003279) I then created an blast database for the chromosome with the makeblastdb program; then I used the blastn program to query on all sub-sequences of length 30 nucleotide bases; A sub-sequences will be recorded as motif, if it has more than 30 duplicates on either strand of the chromosome, where a fuzzy matching with one or two mutation or deletion/insertion are employed. In this way 4578 motifs have been found among the chromosome I that has length of about 15 million nucleotide bases. The following picture shows the distribution of these motifs on the chromosome:

In above picture nucleoside types A, C, G, T are shown by color yellow, magenta, red and cyan respectively; The locations of motifs are marked by blue bars. We notice that many motifs overlap each other. Those overlapping motifs are merged together to form the 4578 disjunctive sequences of different length as be shown in following table:

I then apply the t-SNE algorithm on these 4578 motifs to create a 3D map. I used needleman-wunsch algorithm to calculate the minimal editing distance between the motif sequences. The needleman-wunsch algorithm uses a simple penalty schema of (2, 0, 1) for gape, match and mismatch respectively. The following picture shows the resulting 3D dotted map:

Then a clustering algorithm (a combination of k-mean algorithm and mean-shifting algorithm) is applied to the 3D map that results in coloring of major clusters with different colors. With some minor manual color adjustment, we obtained a colored 3D dotted map as shown in the following animation:

From the above map we can roughly find clusters of the following shapes:

Assuming that all motifs are created by duplication and random single nucleotide mutations, we could interpret above shapes as following: Motifs form shape of small blobs indicate those motifs originated from a common ancestor and have undergo minor mutations. Those motifs could be either recent duplicates, so that they don't have time to accumulate large number of mutations. Or, if they exist for long time, their mutation with large number of mutation did not survive the evolutionary selection process.

Motifs form the shape of curves indicate that those motifs, originating from a common ancestor, have mutated gradually away from its initial copy. With help of VisuMap software we can locate those motifs in the chromosome and trace their evolutionary process along the curve ( but we cannot say which end of the curve is the starting end.) The following pictures show a case with a curved motif collection:

In above pictures, the top picture shows a section of the 3D motif map; the second picture shows corresponding locations of those motifs on the chromosome. With VisuMap you can interactively move the selection of motifs in the window displaying the first picture, so that the second picture will mark the location of selected motifs. The last picture shows the nucleotide sequences of these motifs, one motif per row, in the order along the curve from left to right. We can see that these motifs gradually shifted their phases.

We notice some curved motif groups have small blobs attached to them. These blobs might indicate short bust of different mutations.

Interestingly, we see two motif clusters in the 3D map which form roughly two parallel rings. Those motifs are distributed across the whole chromosome, they may have some internal relationship, but I'm not sure in moment how to interpret them in terms evolutionary process.

In addition to C. Elegans, I also applied the visualization algorithm to other organisms. The following short video clips shows several motif maps of the rice and human genome:

It should be noted that for the human chromosomes, calculating the motif map with the current implementation for all motifs would be computationally prohibitive. So, I used a simplified version of blastn; and selected three small sub-sets, each with about 20 000 motifs, for above motif maps. The three sub-sets have increasing duplicate frequencies. So, for instance, motifs of the third set all have over 10 000 duplicates; whereas the first set has motifs with 300 to 500 duplicates.

From above maps we can see that the motifs of rice genome are much less structured: almost all of them are just small blobs. Those motifs are, unlike the case with C. Elegans, mostly located in coding area. On the other side, the motif map of the human genome shows diverse structures depending on the duplicate frequencies; and the third motif map evens shows some kind of 2-dimensional structure.

Discussion

The visualization method presented here provides a simple way to extract geometrical models from pure discrete genetic sequence information. These geometrical shape provide intuitive information about the evolutionary path of genetic code; and they might be helpful to profile and explore whole genome structures.

The method presented here has three key components: the way to find motifs; the way to calculate editing distance and the multidimensional scaling algorithm (MDS). Each of these three methods has significant impact on the results and they also pose significant limitations. For instance, the current implementation of these components are quite slow, so that it is difficult to visualize large motif sets. I have experimented with other editing distance, like the smith-waterman and levensthein method, they seem to perform similarly, needleman-wunsch just produces slightly better visualization. For the future investigation we might work on finding more efficient MDS algorithm or finding more efficient distance metric to help us to attack much larger and different groups of genetic code.

VisuMap version 4.7 with Whole Genome Browser Released.

2016-06-21T13:33:00.000-07:00

I have just released VisuMap version 4.7 on our web site. Apart from many GPU related optimizations of 3D data visualization, this release includes a heat map style data view to visualize long discrete sequences. A whole genome with billions of nucleotides can be aggregated and displayed in a single 2D map. The following screen snapshot shows, for example, the exome distribution of the first human chromosome:

We notice that most genome browsers current available are depth oriented: their exploration are normally focused on relatively small sections of the genome. The whole genome browser in VisuMap offers a new way to help people to explore genome wide patterns and profiling.

Visualizing Gene Sequences with complementary GMS

2015-12-09T11:09:00.002-08:00

We now know that there are about 20 000 protein coding genes in the whole human genome. How do they relate to each other? Is there any geometrical structure among them? What shape do they form? Aiming at these questions, I'll put forward here a method to generate scatter plot from gene sequences that shows the shape of a gene collection.

In recently blogs, I have described GMS framework to profile gene sequences which translates discrete sequence to high dimensional curvature vectors. We could use this method to convert a collection of gene sequences to a set of high dimensional vectors and then apply MDS algorithm to embed them in a low dimensional space. However, if we restrict our focus on protein coding sequence (CDS), the length of CDS varies from few hundreds to tens of thousands nucleotides. That means, the profiling curvature vectors for CDS sequences are of very different dimensions. This in general causes difficulties for most MDS algorithms as they normally accept only vectors of a common dimension.

In order to obtain profiling vectors of the same length from sequences of variable length, we extended our GMS framework as follows: we first select a short reference sequence; Then for each CDS sequence, we mix the reference sequence with the CDS sequence and pass them together to the GMS framework. After the GMS framework embedded the two sequence into the low dimensional space, we then calculate the curvature profile of the curve of the reference sequence, and use it as a profiling vector for the CDS sequence. The modified framework is depicted as follows:

We notice that the reference sequence and the CDS sequence are embedded together in the same space, so that their curves occupy complementary area of the low dimensional space. That means the curvature profile of the reference sequence carries information about the complementary space of the CDS sequence. More generally, we can expect that different CDS sequences will produce different complementary profile as they have different impact on the reference sequence.

For this note I have downloaded the CDS sequences of the first chromosome of the human genome from the Ensembl.org web site; There are about 2000 protein coding genes whose length very from few hundreds to more than 26 thousands nucleotides. As reference sequence, I picked a short CDS sequence of CDS collection (the gene ENSG00000187170) that has only 300 nucleotides. The following picture shows the complementary curvature profile the CDS sequences: each row of the heatmap shows the curvature of the reference sequence mixed with one CDS sequence; bright colors indicate high curvatures. The line chart under the heatmap shows the complementary profile of 4 selected CDS sequences marked in the heamap

We see that these profiles at large have high curvatures at about half dozens regions, but the height and exact location of those high curvature peaks depend on the corresponding CDS sequence. It is important to notice that I have in the previous blog demonstrated that the curvature profile is a characteristics that is invariant from the random nature of the mapping algorithm. Thus, those small variation are not due to the random nature of the mapping algorithm used, they ought to be attributed to different interaction (or "affinity") between the CDS sequence and the reference sequence.

Having obtained one complementary curvature vector for each CDS sequence, we can then apply a MDS algorithm on these high dimensional vectors to embed them in low dimensional space. For this purpose, I have used the tSNE algorithm. The following picture shows a 3D map for these CDS sequences:


Curvature Profile Map of 2045 CDS sequences created with tSNE algorithm

The above map is linked to a short video, you can click on the map to see the 3D map in animation. Also, for the purpose of easy exploration I have added different colors to the different regions on the map with the k-Mean clustering algorithm.

It should be noticed that the selection of the reference sequence has essential impact on the curvature profile and the final map. To see this, I have picked another sequence, the CDS sequence of gene with id ENSG00000158481 (with about 1000 nucleotides,) as reference sequence and repeated the whole procedure. The following picture shows the corresponding complementary curvature profile and the tSNE map of the 2045 CDS sequences:

Curvature Profile and tSNE map of the 2045 CDS sequences with ENSG00000158481 as reference sequence

We see that both the curvature profile and the final map are very different from those created previously with a different reference sequence.

It is not yet clear how the reference sequence impacts the complementary curvatures. Intuitively, the reference sequence plays a role like the a projection plane that we often use to project high dimensional data to 2-dimensional space for visualization purpose. Thus, selecting an appropriate reference sequence might help us to see interesting geometrical structure among large collection of gene sequences.

About the sample data

The sample genome data is downloaded from Ensembl.org web site in fasta format. The fasta data file is imported into VisuMap with a special plugin. The zipped file here contains the VisuMap sample file, the plugin for importing fasta files and a script "CurvatureProfiling.js" to run the simulations. Before running the simulation, the plugin module GeneticAnalysis.dll need to be installed into VisuMap.

Curvature as secondary profiling of GMS maps.

2015-09-19T11:14:00.000-07:00

As a profiling method, GMS translates discrete sequences to space curves which help us to explore sequence features (treats) with help of geometrical shapes. Yet, space curves still pose a challenge for direct pattern recognition. This is especially so, when we try to explore patterns among large collection of sequences.

In this note I'll put forward a method to use curvature to capture characteristic of GMS space curves. Some experiments with sample data will show that curvature profile are suitable for pattern analysis of large collection of GMS space curves.

Curvature of space curves

Curvature is a concept introduced in mathematics to measure how far a segment of a geometrical shape deviates from being flat. Intuitively for space curves, the curvature of a curve at a point measures how strong the curve is bent at the point.

More formally, Let P₁, P₂, ..., be a series of points in the 3 dimensional space representing a space curve, then for the purpose of this note, we use the following simple formula to estimate the curvature, κ_k, at point P_k:

As shown in above diagram, α is the angle between the two segments P_kP_k+1 and P_k+1P_k+2; s is the length of the segment P_kP_k+2. If P_k are the 3-dimensional coordinate vectors of the points, then the above formula becomes:

Thus, the curvature profile of space curve is a series real number values, κ_k, that measure how strong the curve is bent at each point. In practice, when we calculate curvature profile of a GMS curve, we don't need to calculate the curvature at each point, but just at points sparsely and evenly sampled from the the GMS curves. For the sake of simplicity, we simply select one point from for each input sequence node. That means, for a sequence with n nodes, we calculate a n-dimensional curvature profile. The following diagram shows the whole work-flow to calculate the curvature profile from a discrete sequence:

Above work-flow is mainly an extension of the one in a previous note, where more detailed explanation is provided. The only modification to the work-flow is an extra step that calculates curvature profile from the GMS curve of the first clone. As will be discussed latter in this note, it has been turned out in the simulation experiments that the clones normally have about the same curvature profile.

Characteristics of curvature profile

Curvature profile has several properties which make it suitable to characterize GMS space curves. First, just from its definition we know that the curvature profile is rotation and shifting invariant, so that rotation and shifting information are automatically removed from curvature profile. Because of this running GMS algorithm multiple times on a sequence will results in the same curvature profile, even when the corresponding space curve are rotated and shifted differently due to some random nature within the GMS algorithm.

Going one step further we can change the number of clones in the GMS algorithm. The resulting spaces curves for the clones normally vary systematically in shape and size; Yet their corresponding curvature profiles are more or less similar to each other. The following diagram shows the curvature profiles of 5 clones of sequence with 5 fold cloning:

We notice in above picture that the curvature profiles for the middle or inner clones have noticeably larger peaks than those of the outer clones; and there is symmetry alone the middle (the third) profile. These curvature profiles shows that all 5 space curves bend more or less at common positions and the inner curves bend slightly more than those outer curves. On the right side of above picture are the 5 curvature profiles displayed as a heatmap. As will be seen below, heatmap offers a better way to visualize large collection of curvature profiles.

Another interesting property of the curvature profile is that they are in general independent on the sampling frequency of the GMS algorithm. High frequency sampling normally results in smoother space curves, but their shape remain mostly unchanged. The following picture shows the curvature profile of another sequence with sampling frequency of 6, 8, 10, 12 and 14 scans-per-node. The 5 curvature profile are almost identical.

Profiling Transmutation

In addition to the invariant characteristics, the curvature profiles of similar sequences are in general similar. This property enables us to study systematical variation among large collections of sequences by visualizing their corresponding curvature profiles. We demonstrate this method here with an example that simulates a transmutation process. As the be shown in the following diagram, transmutation is the process that one sequence gradually mutates to another sequence.

For the sake of simplicity we assume that both sequences, A and B, have the same length N. A simple transmutation process is that an increasingly longer sub-sequence A replaces a corresponding sub-section in B. More particularly, the simple transmutation is realized by the sequence T_k(A, B) := A[1, k]*B[k+1, N]; where k=1, 2, ..., N; A[1,k] is the sequence of the first k nodes of A; B[K+1, N] is the sequence of the last N-k nodes of B; * is the concatenation operation.

With help of GMS algortihm we can calculate the space curves of T_k; k=1,2,..., N; and then calculate their curvature profiles as N-dimensional vectors. For this example we have used two sequences of 1377 nodes (one of them is the CCDS of CD4); The following picture shows their 1377 curvature profiles displayed as an single heat map (the three curve charts on the right side are profiles of the transmutation sequence at 3 particular times) :

In above heat map we can notice a variation region that runs from top left corner to the right bottom corner, this feature indicates that during the transmutation process, only the part of the space curve that is close to the mutated node undergoes significant change. In other words, features of space curve can be traced back to individual nodes in the transmutation sequence.

As the curvature profiles are vectors of the same dimension we can use a MDS (multidimensional scaling algorithm) algorithm to map them to a low dimensional space, so that we can see their sharp. The following pictures is a 2-dimensional map generated with the t-SNE algorithm for those curvature profiles in above example:

In above map, each dot represents the curvature profile of a transmutation sequence. As a property of MDS maps, closely located dots normally represent transmutations with smaller effects on curvatures. Thus, this map can help us to locate sequence regions which cause large change in space curve.

Discussion

Curvature profile purposed in this note is a secondary characteristics of discrete sequences, it can help us to study patterns and structures among large collection of discrete sequences. Whereas GMS algorithm provides a mean to geometricalize individual sequences; curvature profile offers a tool to geometricalize a set of sequences as a whole. Various invariant properties and visualization experiments discussed in this note seem to validate curvature profile as suitable characteristics for discrete sequences.

During the searching for such characteristics I have experimented with several other quantities, like quantities derived from the rotation speed, torsion and velocities. Curvature profile as purposed above seems to have the best properties among all the tested quantities. It should also be pointed out that the curve curvature purposed in this note differs slightly from the standard defintion. The reason for this discrepancy is that the standard mathematical definition is unstable when the sampling points are located in tiny regions with noise. For evenly distributed curve sampling points, the purposed curvature formula approximates the standard definition pretty well.

In general, curvature profile together with certain MDS algorithm could provide an interesting tool to explore similarities and patterns among sequential structures and shapes encountered in microbiology.

GMS for DNA profiling

2015-05-25T16:56:00.000-07:00

In this note I am going to describe a GMS based algorithm to convert DNA sequences to geometrical shapes with visually identifiable features. I'll apply this algorithm to real genetic sequences to demonstrate its profiling capability.

The main steps of the DNA profiling algorithm are illustrated as follows:

As shown in above diagram, a single strand of a duplex nucleotide sequence is taken as the input for the algorithm. The first step of the algorithm is making three identical copies of the sequence, which will then be scanned in parallel by three identical GMS scanning machines which will produce a set of high dimensional vectors. As described in a previous node, the scanning machine works like the ribosomal machinery: just instead of proteins it produces high dimensional vectors. As indicated in the diagram, a scanning machine in our algorithm is configured by three parameters: the scanning size K; the moving step size r; and the affinity decay speed λ.

Then, as the third step, the affinity embedding algorithm will be applied to the high dimensional vectors to produce a 3D dotted plot. That resulting map will usually contain three clusters corresponding to the three duplicated sequences; and the middle cluster is usually pressed to a quasi 2-dimensional disk. So, as the last step, the middle slice of the 3D map will be extracted, rotated and displayed as a 2D map.

In general to qualify as a DNA profiling method, a method should ideally satisfy the following the following requirements:

The same sequence or similar sequence should result in similar maps.
Significant changes in a sequence should lead to noticeably changes in result maps.
The resulting maps should have structures that can identified by visual examination.
Be able to associate phenotype traits with geometrical patterns on the result maps.

As first example I applied the above algorithm to the VP24 gene of zarie ebola virus that consists of 1633 base pairs. The following pictures show 2 maps created by running the algorithm twice with different random initializations:

We can see that above two pictures are very similar in terms of topological features of the curves. The following picture shows two maps of the BRAC1 gene that contains 4875 base pairs. Again, these two maps are topologically quite similar up to fine details.

As next example we consider how GMS map changes when we delete, duplicate, or invert a segment of the nucleotide sequence. For this example exons of the gene CD4 has been chosen as input. This sequence has 1377 base pairs. I randomly selected a segment of 70 base pairs as a reference segment for deletion, duplication and inversion. The following pictures show the GMS maps of this sequence and the sequence under deletion, duplication and inversion:

In the above picture, the highlighted region correspond to the reference segment under alterations. We can clearly see how these three types of alterations manifested themselves in their GMS maps.

Above examples seem to indicate that our algorithm satisfies, more or less, the first 3 requirements listed above; whereas the last requirement remains open for the future study. Since a geometrical model can capture much larger amount of information than conventional statistics/correlations, one might hope some interesting phenotype traits may manifest themselves in those models in a yet-to-find way.

GMS with multiple scanning and aggregated affinitity

2015-05-12T14:59:00.000-07:00

As said in the title, this note is going to put forward two variations to the GMS model. Both variations aim to create better visualization for discrete sequences.

For the first variation, we have seen in a previous note that the loopy GMS can produce simpler geometrical shapes when the scanning machine runs multiple rounds over a loop sequence. The reason for the simplification is likely due to the competition for space between the duplicated scanning vectors. This kind of competition can be easily extended to GMS model for serial (no-loopy) sequences by cloning the sequences and scanning machine as illustrated in the following diagram:

So with multiple scanning, GMS first clones the sequences into multiple identical copies, then scans each sequence as before to produce scanning vectors with time-stamps. In addition to the time-stamp, an index component p that identifies the different clone sequences is added to the scanning vectors. This index p will be used like the time-stamp to reduce the affinities between scanning vectors from different clone sequences. More precisely, the decay factor for the affinity between two scanning vector produced by p-th and p'-th clone at time t and t' will be changed (see the initial specification for the affinity function) from

For the second variation, we notice that the scanning vectors purposed in the initial note are colored vectors. That means, when calculating the affinity between two scanning vectors, only components with matching color will contribute non-zero affinity to the total affinity. So, as discussed in the initial note, a K dimensional colored vector is mathematically a K×s dimensional spatial vectors, where K is the scanning size, s is the number of colors. Because of such sparsity, the affinity between two scanning vectors are normally very small, and often too small to carry meaningful information over to the GMS maps.

In order to increase the affinity between scanning vectors, we aggregate the K-dimensional colored vector to a s-dimensional vector by adding all components of the same color to form a new uncolored vector component. In particular, as depicted in the following diagram, let C = { 1, 2, ..., s } be the set of colors, s₁, s₂,..., s_k∈ C be the colors of the k nodes in the scanning machine; and let α₁,α₂, ..., α_k be the corresponding amplitudes for the nodes, then the scanning vector V is a s-dimensional vector (v₁, v₂,..., v_s) with:

These two variations discussed here have been implemented in VisuMap v4.2.912 with a new affinity metric named Sequence2 Affinity. The following short video shows some GMS maps created with the new affinity metric for some short sequences with 40 to 60 nodes:

On the origin of the helix structure from the view point of GMS

2015-03-24T13:25:00.001-07:00

Helix structure has been prevalent in biological world. It can be fund in small scale like the folding of DNA and protein sequences, and in large scale like plants.

Whereas the mathematical description of the helix structure is clear; the mechanism that gives rise to such structure is not so obvious. Do all those structure share a common mechanism? Why don't they show up in inorganic world? This note tries to demonstrate with GMS samples that helix structure comes about by some simple dynamics rooted in the discrete sequence structure;

Recall that GMS produces a sequence of high dimensional vectors from a discrete sequences; and the high dimensional vectors are embedded in to low dimensional space according their affinity defined in the following affinity function:

Where t>t' are the timestamps of two vectors produced at type t and t'. We notice that the affinity function consists of two parts:
and

The first part is only dependent on the timestamp: it reduces the affinity calculated by the second part exponentially depending on time separating these two vectors. It is the second part that accounts for variations in the sequence, s_i.

In order the see the effect of decay in time dimension, I modified the loopy GMS algorithm so that the affinity function only contains the first part while the second part (the sequence dependent part) is set to the constant 1.0. I ran the algorithm with following parameters: n=25000; K=10; the sequence is the constant sequence with 100 copies of the letter 'A'; the decay speed λ is successively set to 0.125, 0.25, 0.5 and 1.0. The following screen cast shows the resulting corresponding GMS maps in 3D space:

These maps show clearly the helix alike structure; and the winding number goes up as the as decay speed λ goes higher.

To verify that it is the exponential decaying that leads to helix alike structure, I have replaced the affinity function with three different "decaying" functions: Δt^-2, Δt^-1 and Δt^-0.5 with Δt := t-t'. The following pictures shows the corresponding GMS maps:

We can see clearly that these decay functions result in structures that are totally different from the helix structure. Thus, these simulations indicate that the exponential decay of affinity plays a significant role in forming the helix structure.

We notice that if the scanning size K is sufficiently large and the sequence is random, the affinity contribution of the second (sequence dependent) part will, more or less, become constant. Thus, the helix structure may serve as model for completely random sequences. From this point of view, we might call the helix structure the no-information base model. Additional information in sequences should manifest in discrepancy between their GMS maps and the helix structure.

Seeing some π symbols in some formula, the great physicist Richard Fynman once asked: Where is the circle? In terms of modern genetics, his remark basically assumes that π is the "gene" for the circle as a phenotype feature. Many such analytical "genes" are carried forward and spread around in various fields, they manifest in different forms, but always keep their intrinsic unchanged. Here, this note tried to demonstrate that the "gene" for helix structure is the exponential decay of affinity.

On Loopy GMS (Geometrical Modeling of Sequences)

2015-02-02T19:17:00.003-08:00

This note is going to apply the GMS model to loop structured sequences. The adaptation from previous serial structure to loopy structure is straightforward: the only change is, as illustrated in the following diagram, that the two ends of the serial sequence are now connected to each other.

When applying the GMS model, the scanning machine ( or the scanner ) runs over the loopy sequence and produces a series of high dimensional vectors which are argumented with timestamps. Those high dimensional vectors will then be embedded into low dimensional space with the affinity embedding algorithm.

The first question arise in this new scenario is whether the GMS model will produce loop alike geometrical shapes, when the scanner runs a whole loop back to its initial location. The answer to this question is yes, but under two conditions. First, the decay speed parameter λ used to calculate the affinity between vectors must be zero, so that the effect of timestamps will be nullified. Otherwise, if λ is not zero, different vectors will be produced when the scanner comes back to its initial position.

Secondly, the total number of nodes, L, in the loop must be a multiple of the dimension of the output vector (without the timestampe component), i.e. the parameter K denoted in previous note. This requirement is necessary because of the circular shifting of the scanned vectors. If L is not a multiple of K, after a whole loop the scanner will produce the same set of values, but circularly shifted to different order, so that they will normally be different as vectors in the high dimensional space.

The following pictures shows the resulting maps produced by GMS for a short loopy sequence for different cases discussed above.

Figure 1: Conditions for loopy output maps from loopy sequence. The input sequence is "CCC TGT GGA GCC GGA GCC ACA AGT", K=6. (A) The decay speed λ=0.1, the resulting map is a broken loop in the 3D space; (B) The sequence is extended with an extra node 'G', so that K is not a multiple of L. The resulting map is a broken loop. (C) λ = 0 and L is a multiple of K, the resulting map is a loop in the 3D space.

With a loopy sequence, the scanner can in general run multiple rounds around the loop sequence. In this way, the scanner can produce multiple sets of vectors offset just by different timestamps. These vectors sets form repeating patterns when embedded in to low dimensional space. The following short video shows the resulting maps for above sequence for repeat number n = 1, 2, 5 and 10. The decay speed λ is set to 0.1. We notice that in these maps, the repeated structure gradually become flattened when the repeat number increases to high number. Consequently, the output map changes from simple repeated structure to tube alike shape. We notice that both repeated structures and tubes are pretty frequent structures in biological systems as phenotypic traits.

We notice that running the scanner multiple loops is effectively the same as scannng multiple copies of a sequence sequentially. Thus, the loopy scanning might be considered as a manifestation (or extension ) of sequential scanning where the sequence is a the concatenation of variable number of copies of a reference sequence.

Another extension of the loopy GMS model is, as depicted in the following diagram, using two scanners to scan the loop in the opposite orientations, then concatenate their outputs and a timestampe to form final high dimensional vectors.

The loopy GMS model with dual scanners normally produces symmetrical shapes. The reason for this is that for each configuration of the scanner (i.e. their specific positions on the loop during the scanning) there is always a "dual" configuration in which the two scanners simply swapped their positions. Because of this, the whole collection of output vectors may be split into two sets which differ from each other just by the different timestamps. The following short video shows various symmetrical shapes produced by loopy GMS with dual scanners for various sequences:

Discussion

This note has experimentally demonstrated that GMS model with simple extension may produce interesting macroscopical shapes, like repeating pattern, tubes and symmetrical structures. Understanding how those discrete sequences give rise to geometrical pattern might help us to investigate how genetic code determines phenotypical traits biological organism.

Geometric Modeling of Sequential Data (continued)

2014-11-08T13:25:00.001-08:00

In a previous note geometrical modeling of sequential data (GMS) I have described a framework to convert discrete sequential data to 3D geometrical shapes. This note is going to extend GMS to a general form that allows more efficient sampling of higher dimensional data from sequence data.

The GMS framework

Recall that GMS consists of three basic steps: 1. Scanning a sequence to produces a collection of high dimensional vectors; 2: Reordering the components of these vectors to harmonize sampled vectors; 3: Applying a dimensionality algorithm to embed these vectors into low dimensional space. For the first step, we start with a sequence s_k∈S; k=0, 1, 2, ..., K; where S is a finite set of alphabets. Imaging that the sequence has been put through a scanning machine that quickly takes series of snapshots of the part that is just in the machine:

In above that diagram, the illustrated scanning machine is sized to hold 3 nodes of the sequence. As output the machines produces a series of sampling vectors V_t for a series of discrete time points. A sampling vector V_t is actually a 2n+1 dimensional vectors (α₀, ..., α_n-1; s_k, ..., s_k+n-1; t). The above diagram illustrates the case n=3. Here, s_k, ..., s_k+n-1 are the type of the n nodes currently passing the machine; α₀, ..., α_n-1 are the coefficients, called amplitudes, for the corresponding nodes. For the sack of compactness we denote the pair (α_k, s_k) simply as α_ks_k in above diagram.

The amplitude α_k for the k-th node at time t is calculated as following:

The right side of above formula is called the amplifying function. This amplifying function implies that the node sequence passes the scanning machine by the speed of one node per time unit. In general, any continues function that is zero at point 0 and n could be used as an amplifying function.

We notice that the above formula specifies a continues function with respect t as long as no new node entered the machine to replace an old one. In case that a new node, say s_k-1, entered the machine to replace an old one, say s_k+2, at a time between a small interval t and t'; the sampling vector will undergo the following change:

Since the amplifying function vanishes at the entry 0 and exit points n, the amplitude α₂ and α'₀ will be close to zero; and since the amplifying function is continues, we'll have α₀≈α'₁ and α₁≈α'₂. Thus,
when we apply a circular shifting on the V_t' as illustrated in the following diagram, the shifted vector V̂_t' will be close the V_t:

Notice that the third components of V̂_t' and V_t may be values for different node types (e.g. s_k-1 ≠ s_k+2), but the amplitudes α₂ and α'₀ are close to zero, so that V̂_t' will be close the V_t as high dimensional vectors. Thus, after applied the circular shifting operation, the scanning machine will produces a series of gradually changing vectors in high dimensional space.

More generally for the implementation of the second step, the scanning machine will have a circular shifting operator to cicularly shift the sampled vectors; and the shifting operator will increment its shifting length by 1 every time when a new node entered the machine.

As third step, a dimensionality reduction algorithm will be applied on the vectors to embed them into a low dimension space. For this study I picked the affinity embedding (AE) algorithm. I have used the t-SNE algorithm in the previous note, but by my experiments, AE worked normally better for this purpose, as it runs much faster for large dataset. To apply AE algorithm we need either to define a metric distances to measure dissimilarity between data points; or an affinity function that measures the similarities (or kind of attraction ) between data points. For this study, we uses the following affinity function:

where the summation runs over all k between 0 and n-1, such that s_k= s'_k; λ is an algorithmic constant that can be any positive values. In clear text, if we define the affinity between two nodes as the product of their amplitudes, then the affinity between two sampling vectors is then the sum of affinities between all matching nodes decayed by the time elapsed between the two samplings; The constant λ, called the decay speed, controls how fast that affinity diminishes depending on the time elapsed between the two samplings. More particularly, the real affinity will reduce to half of its value if the two sampling vectors are separated by λn nodes.

The effect of the scanning size n

The scanning size n is a key parameter for the scanning machine, it determines how many consecutive nodes of the sequence will be read (or scanned) to construct an output vector. For larger scanning size, the affinities between these data point will be an aggregation of larger number of consecutive nodes. Thus, the scanning size controls a kind of granularity of the scanning machine.

In order to see effect of the scanning size, I have downloaded the DNA sequence of a relative short gene, the CD8 gene that consists of 744 base pairs. This sequence will then be processed by the GMS framework to create 25000 data points; and then embedded in to the 3D space. The following video shows the resulting 3D maps created with different scanning size:

We see clearly that the 3D curve gradually become simpler and smoother as the scanning size grows from 3 to 24.

The effect of the decay speed λ

The decay speed λ provides a way to differentiate samplings created at different time. A larger λ means that the affinity between similar node pattern will diminish faster as the elapsed time between them grows longer. To demonstrate the effect of decay speed I have created a series of maps for the CD8 gene sequence with increasing λ. The following short video shows those maps:

We can see clearly that the map stretches gradually as λ grows from 0 to 0.08.

Implementation GMS in VisuMap

VisuMap version 4.2.905 offers support for GMS framework through two new metrics "Sequence Metric" and "Sequence Affinity". The former is a distance metric that can be used by most mapping algorithm, the latter is an affinity metric that can only be used by the affinity embedding algorithm.

In order to create a model for a sequence with VisuMap, we first create data table with a single columns and with 5 to 10 thousands rows. The content of the table is not relevant for the modelling, only the number of rows will be used by the scanning machining as the number of samplings (The link SeqVis provides a such sample dataset together with some sample maps and supporting scripts.) We then open the map configuration window to select "Sequence Affinity" as the metric; and specify a new filter as settings for the scanning machine. The following pictures shows the settings for CD8 sequence used in previous examples.

Notice that the field labeled "Stretch Factor" set value for the decay speed, since in normal cases this parameter defines how far the resulting maps will be stretched along the sequences direction.

Also notice that spaces and newline characters in the sequence given in Sequence Definition window will be ignored by the scanning machine. With this filter definition window we can easily generate 3D maps for arbitrary sequences. The following pictures show some 3D maps together with their corresponding sequences:

Discussions

We have extended the GMS framework with large scanning size and an more efficient dimensionality reduction algorithm. With these extensions we can model much large large sequences from different perspectives; and therefore capture more information from more realistic sequences.

We notice that the proposed framework has certain similarity with the ribosomes machine that translates RNA sequences to proteins. Just like the biology scientists believe all macroscopic pattern and features are ultimately encoded in the DNA sequences, I believe that GMS can capture a large class of relevant patterns in discrete sequences with 3-dimensional geometrical models. GMS thus offers us a toy ribosomes machine to simulate the translation of sequential information to geometrical models.

On geometric modeling of sequential data

2014-09-23T09:13:00.003-07:00

The biological cellular system appears to be a universal machine that is capable to translate discrete sequential data, i.e. genetic code, to all kinds of organisms with diverse phonetic features. In this note I am going to put forward a computational framework that translates discrete sequential data to lower dimensional geometrical shapes. Inspired by biological cellular system, this framework first samples high dimensional vectors from the discrete data; then it applies dimensionality reduction methods to embed those vectors into low dimensional space. The resulting map normally forms shapes with geometrical properties that reflect information in initial sequence.

Let's start with an input sequence of nucleic acids s_k; k=0, 1, 2, ..., K with s_k ∈ S:={C, G, A, T}. To illustrate the sampling process, we image that the sequence is a series of marbles of 4 different imaginary colors. Mathematically, these "colors", i.e. the set S, can be represented by 4 dimensional vectors each with just a single non-zero component as follows:

Now, image we have a camera that moves slowly from one end to the other end of the sequence; and during the movement we quickly take a series of pictures of the two closest marbles. These pictures record different colors and intensities of the two closest marbles. As depicted in the following schema, we code those "pictures" as 8 dimensional vectors:

In above schema, α and β are the intensities coefficients for the two closest marbles (they can be multiplied directly into the first half and second of the 8-vectors correspondingly). α and β are calculated as follows: We assume that the camera moves evenly one marble per time unit. So at given time, say t₁, the camera is located between k-th and k+1-th marble; then the phase of the camera is calculated as: p₁ := t₁ - k ; where k the largest integer smaller than t₁, i.e. k=⌊t₁⌋. Then, α and β are calculated as: α := cos(p₁π/2); β:=sin(p₁π/2). We notice here that when the camera moves smoothly from k to k+1-ε for any infinitesimal small number ε, the pair (α, β) changes smooth from (1, 0) to (0, 1).

However, at the time point k+1, the pair (α, β) takes the value (1, 0), since the camera is now considered between the k+1-th and k+2-th marble. That means, (α, β) as a time dependent 2-dimensional variable is not continues at the integer time points; Consequently, the corresponding 8-vectors in general won't change smoothly in the 8-dimensional space.

In order to avoid this discontinuity, we swap the two 4-vectors of a 8-vectors for those vectors produces when ⌊t₁⌋ is an odd integer. That means in above schema, at the time t₂, the 8-vector produced is swapped from v₂ to v'₂ as shown in the following diagram:

With this swapping operation, we can easily verify that the 8-dimensinal vectors sequence actually changes smoothly as the camera smoothly processes.

Additionally to these 8 components, the sampling process at each step also adds a timestamp ct as the 9-th components to the output vectors, where c is a small constant chosen as an algorithm parameter.

Summing-up above steps, the sampling process can be denoted as an operator that operates on discrete sequences s as following:

where t ∈[0, K] is the the time parameter; c is an algorithmic parameter; k:= ⌊t⌋ is the sequential index; α := cos(pπ/2) and β:=sin(pπ/2) with p:= t- k as intensity coefficients for the two closest marbles.

After we have sampled a set of 9-dimensional vectors, we then use a dimensionality reduction (DR) algorithm to embed those vectors into 2- or 3-dimensional space. For our framework, we need a DR algorithm that is strong in preserving local information, since the "camera" basically samples only local information. Algorithms like RPM, CCA, Affinity Embedding and t-SNE appear to be good candidates for this purpose. For this note I picked the t-SNE algorithm implemented in VisuMap software.

As first example we use our algorithm to model a short nucleic acid sequence "CCCTGTGGAGCCACACCCTAG". Notice that the timestamp coefficient c in above description is the only component that grows with time unlimitedly; whereas the other 8 components are confined to the range between 0 and 1.0. Thus, larger c will likely stretch the model further apart. In order to show this property, we created four different data sets with c=0, 0.5/N, 2.5/N, and 10/N; where N is the number of samplings, which is in this note 15000. The following four clips show the resulting geometric models of these data sets. For the sack of clarity, I have colored the data points with increasing brightness as time progresses; and when the scanning camera is at closest to a node, the corresponding sampling data point will be represented with an icon that indicates the type of the nucleic acid.

Above models showed clearly that the coefficient c controls how far the model get stretched. The other 8 components contributes information to the folding structure.

The second example compares models created by reversed sequence. The following clip compares the model of previous sequence with the model of its reversed sequence. We can see that the two models are geometrically more or less unchanged, except that the gradient direction is reversed. This means that the described modeling method is invariant under the sequence inversion.

The third example shows how the model changes when the sequence is duplicated multiple times. The following video clip shows the model of our sample sequence and the model of 5-time duplicates of that sequences. We can see that model of the 5-time duplicates resembles 5 time overlaps of the single sequence model.

As next example, we created a model for the protein sequence MELNSSFWTLIKTKMKSRNDNNKLLDTWLDPIEYVSTTGSADRP. The slightly more complicated model is shown in the following video clip:

We notice that the sampled vectors are all sparse vectors. In practical implementations, we can directly calculate the Euclidean distance between those vectors, without explicitly calculating those high dimensional vectors. In this way, we can model sequences of any base set. As final example

I have created a 3D model for the text sequence "One fish two fish red fish blue fish" as show in the following video clip:

Discussion

The framework described here demonstrates a new way to derive geometrical models from discrete data. The framework is abstract in the sense that it is independent of the physical and chemical properties of the sequence.The experiments demonstrates that high dimensional feature space might be an effective way, as an intermediate stage, to derive 3 dimensional phonetic structure.

We notice that the models created above are all basically 1-dimensional curves folded in different ways. For the future study, it would be interesting to develop 2-dimensional sampling method that produces 2- or 3-dimensional models. More speculatively for future study, can we develop an evolutionary process to find a sequence that gives rise to models with certain properties?

On affinity embedding

2014-09-15T11:30:00.000-07:00

We have just released VisuMap version 4.2.902. A new mapping algorithm called Affinity Embedding (AE) has been introduced in this release. I'll briefly describe AE as a multidimensional scaling method(MDS); or, in more recent terminology, as a dimensionality reduction method.

Like many MDS methods, AE tries to create a low dimensional representation of high dimensional data points by minimizing a stress function. AE's stress function is as following:

where a_ij is a positive number representing the affinity between data points i and j; d_ij is the Euclidean distance between that two points in the low dimensional space. Affinity as a generic concept quantifies certain kind of similarity between data entities. We notice that most MDS methods try to minimize difference (e.g. squared error) between a dissimilarity matrix and d_ij; whereas AE tries to approximate the similarity matrix.

Since many data sets from practices have been characterized by a distance matrix δ_ij, VisuMap uses the following exponential affinity function to convert a distance matrix to an affinity matrix:

where c_ij are constants estimated based on densities around data points in the high dimensional space. This affinity function is just one simple way to convert dissimilarity to similarity; in fact any monotonous decreasing function may be used as affinity function. For more general applications, VisuMap offers a plugin interface to allow users to implement custom affinities.

AE uses a simple version of gradient descent method to minimize the stress function. The computational complexity for each iterative step is O(kN) where N is the number of data points and k is constant that is typically around value 100. Moreover, the memory complexity of AE for each step is O(N). Thus, AE can be applied on relative large data sets, i.e. with a million and more data points.

The main characteristics of AE is its focus on local information (i.e. data with high affinity or low dissimilarity). The reason for this is that lower affinity are represented by close-to zero values and the optimization algorithm can more effectively ignore zero values; whereas other algorithms like CCA or Sammon map need to reduce the effect of large values (e.g. large distance or large dissimilarities.) which is normally harder to do.

AE has been largely inspired by the algorithm Stochastic Neighbor Embedding(SNE) and the more recent version t-SNE. Both SNE algorithms work pretty well in focusing on local information and both of them use a kind of exponential affinity functions in their stress function. However, because of their probabilistic nature, they require in general quadratic computational and memory complexity, which makes it difficult to attack large data sets.

Like SNE, AE works pretty well in unfolding closed manifolds. The following pictures compare AE and t-SNE in mapping three spheres of different sizes in the 3D space to 2D plane. Notice that AE somehow preserves the size the three sphers; whereas t-SNE basically ignores the size information:

Data set forming three spheres in 3D space.

The t-SNE map for the three sphere data set.

The AE map for the three sphere data set.

VisuMap 4.2.896 Released

2014-01-15T12:17:00.002-08:00

I have just released our software package VisuMap version 4.2.896. In this release we have upgraded the graphics library DirectX from version 9 to version 11. Due to this upgrade some of 3D animation data views have improved performance by more than 10 folds. VisuMap is now capable to interactively explore more than one million data points in 3D view on a normal desktop computer.

Also with this upgrade, VisuMap will require OS system Windows 7 or higher. Windows XP will be continually supported, but without new features which take advantage of DirectX 11 library. In the future, we'll likly implement more performance enhancements by tackling the computation power of GPU.

On Multidimensional Sorting

2013-11-04T15:39:00.002-08:00

In a previous blog post I have talked about a method to convert photos to high dimensional datasets for analysis with MDS methods. This post will address the opposite problem: given a high dimensional dataset, can we convert it to a photo alike 2D image for easy feature recognition by human?

Let's consider our sample dataset S&P 500. A direct approach to create a 2D map is to simply convert the matrix of real values to a matrix of gray-scale pixels. This can be easily done in VisuMap with the heatmap view and the gray-scale spectrum. The following picture shows the S&P 500 dataset in normal line diagram view on the left side and the corresponding heatmap on the right side:

Each curve in the left picture corresponds to the price history of a S&P-500 component stock in a year. We notice heavy overlaps among the curves. On the right side, the heatmap represents the stock prices with row strips with different brightness. In the heatmap, we can spot roughly the two systematical downturns at the two slightly darkened vertical strips. There is however no discernible pattern between the rows, as the rows are just ordered by their stock ticks, so that neighboring relationship don't indicate any relationships between their price history.

One simple way to improve the heatmap is to group stocks with similar price history together. This is the task of most clustering algorithms. We can do this pretty easily in VisuMap. The following heatmap shows the same dataset after we have applied k-Mean clustering algorithm on the rows and re-shoveled the rows according to cluster assignments:

The k-Mean algorithm has grouped the rows into about 8 clusters, we can see that rows within each group represent rows with, more or less, similar values. However, the clusters are randomly ordered, and as well as the rows within a cluster. The clustering algorithm does not provide ordering information for individual data points.

Can we do better job than the clustering algorithm? To answer this question, let's recall what we actually tried to do above: we want to reorder the rows of the heatmap, so that neighboring rows are similar to each other. This kind task is basically also the task of MDS (Multdimensional Scaling) algorithms, which aim to map high dimensional data to low dimensional spaces while preserving similarity. Particularly in our case, the low dimensional space is just the one-dimensional space, whereas MDS in general have been used to create 2 or 3 dimensional maps.

Thus, to improve our heatmap, we apply a MDS algorithm to our dataset to map it to an one-dimensional space, then use their coordinates in that one-dimensional space to re-order the rows in the heatmap. For this test, I have adapted the RPM mapping (Relational Perspective Map) algorithm to reduce the dimensionality of the dataset from 50 to 1, then used the one-dimensional RPM map to order the rows in the heatmap. The following picture shows a result heatmap:

We notice this heatmap is much smoother than the previous two. In fact, we can see that the data rows gradually change from bright to dark color, as they go from the top to the bottom. More importantly, we don't see clear cluster structure among the rows, this is in clear contrary to the k-Mean algorithm that indicates 8 or 10 clusters. In this case, it is clear that k-Mean algorithm delivered the wrong artificial cluster information.

Now, a few words about the implementation. The multidimensional sorting has been implemented with RPM algorithm, since RPM allows gradually shrinking the size of the map by starting with 2D or 3D map. This special feature enables RPM to search for optimal 1D map configuration among a large solution space. We could easily adapt other MDS algorithms to perform multidimensional sorting, but we probably will have to restrict solely on one-dimensional space with those algorithms. A few years ago, I have already blogged about this kind of dimensionality reduction by shrinking dimension with some simple illustrations.

Since high dimensional datasets are generally not sequentially ordered, there are in general not an unique way to order the data. What we want to do is just find a solution that preserves as much as possible similarity information with the order information. Thus, an important question arises here: How do we validate the sorting results? How do we compare two such sorting algorithms? A simple way to validate the sorting result is illustrated in the following picture:

So, in order to test the sorting algorithm we first take simple photo and convert it to high dimensional dataset that represents the gray-scale levels of the photo. We then randomize the order of the rows in the data table. Then, we apply the sorting algorithm on the randomized data table, and check how good the original photo can be recovered. I have done this test for some normal photos, the multidimensonal sorting was able to recover pretty well the original photo (see also the short attached video), as long as enough time is given to the learning process.

We have described an application of MDS algorithm for heatmap by using it to reduce data dimension to a single dimension. We might go a step further to use MDS to reduce data dimension to 0, so that data points will be mapped to a finite discrete points set. In this way, we would have archived the service of clustering algorithms. If this works, clustering algorithms can be considered a special case of MDS algorithms; and MDS algorithms might lead us to group of new clustering algorithms.

The multidimsional sorting service has been implemented in VisuMap version 4.0.895 as integrated service. The following short video shows how to use this service for the sample datasets mentioned above.

A more practical application of MDS sorting is for microarray data analysis where heatmaps are often used to visualize the expression levels of a collection of gens for a collection of samples. Normally, neither the gens nor the samples are ordered according to their expression similarity, so that those heatmaps often appear rather random (even after applying additional grouping with hierarchical clustering algorithm.) The following picture shows, for instance, a heatmap for expressions of 12000 genes for about 200 samples:

After applying MDS sorting both on the rows and columns of above heatmap, the heatmap looks as following:

The above heatmap contains theoretically the same expression information. But, just by re-ordering the rows and columns, it shows more discernible structure and information for human perception.

VisuMap version 4.0.895 released

2013-11-01T07:09:00.001-07:00

We have just released VisuMap version 4.0.895. The following is a list of major changes:

Implemented the multidimensional sorting and on-fly mapping & clustering services.
Extended the RPM algorithms with dark-pressure mechanism.
Reduced memory usage of the application significantly, so that it can load much large dataset.
Various defect fixing and enchantments

On the shape of pictures

2013-10-19T11:07:00.001-07:00

This blog post is going to talk about a simple way to convert normal 2D pictures or photos to high dimensional datasets; and then use MDS tool to analyze those data. At the end, I'll mention some potential uses of this type of analysis for data from practical cases.

Assuming that we have a picture that is represented by a MxN matrix of pixels, The pixel matrix can be simply converted to a matrix of real numbers by replacing each pixel with its intensity (or brightness) using the formula: a = 0.114*R + 0.587*G + 0.299*B; where R, G, B are the red, green and blue intensities of the pixel. We consider this MxN matrix as a N-dimensional dataset with M data points, where each data point is just a row vector in the matrix.

Obtained such a high dimensional dataset for our picture, we can then visualize this dataset with various MDS methods. The following picture shows visualizations of a sample photo with three different MDS methods, namely, PCA, tSNE and RPM:

We see that all three methods reveal, more or less, a serial structure among the data points (i.e. rows in the pictures.) When using MDS methods to visualize high dimensional data, the first question we ask is often: Are the maps reliable? Asked differently, Do those geometrical shapes in maps show some nature of the initial dataset, or are those shapes just random result of those algorithms? To answer this question, I have run RPM and tSNE multiple times, all runs produced, more or less, similar maps (the PCA algorithm is deterministic, will therefore always produce the same 2D map.) The following picture shows two more maps produced by RPM and tSNE algorithms:

Comparing above two pictures we see that both RPM and tSNE algorithms produced repeatable results by different runs despite their non-deterministic nature. Going one step further, we could ask what kind maps we'll get when we alter the sample dataset slightly. To answer this question, I have created a RPM map and a tSNE map with half of the rows and half of the columns by chosen alternately each other row and column of the matrix (i.e. only with 1/4 of the data matrix). The following picture shows the result maps:

Above maps show apparent similarity with the maps in previous picture. Thus, we have here strong evidence that the geometrical characteristics in these maps correlate with characteristics in the initial sample picture. We can see such correlation more clearly when exam how regions of the map represent sections in the sample picture. The following picture shows how some sections of the tSNE map correspond to sections in the sample picture:

We notice that MDS maps topologically have simple sequential structure, so no information are embedded into the topological structure. MDS maps rather carry information through geometrical characteristics. For instance, a section of smooth curves; a large arch; a section of points with less density; etc.; Also, two distant sections winding together in particular way may indicate special feature of the underlying data. Furthermore, when we use 3D MDS maps, those secondary structure over the sequential structure may reveal much more rich complex information about the high dimensional data. Thus, MDS maps may offer new ways to describe and analyse the initial picture.

We might ask here: why are the MDS maps of the picture sequential? To answer this question, let's image that we scan the picture from top to bottom, line by line, and store the line in a vector, then this vector will gradually change from one line to the next line. Thus, this line vector manifest a kind random-walk in the high dimensional space mostly with relatively small steps. Those random-walk alike datasets exist in many areas of data analysis. For instance, a while ago I have blogged about one such case where the state of the whole stock market has been considered as an random-walk process. We notice also that the MDS maps shown above resemble, more or less, those map in the blog self-similarity of high dimensional random-walk.

The following picture shows the price history of S&P 500 components for a year, and the corresponding RPM map where each dot represents the state of the stock market at a day:

We notice the apparent similarity of the above RPM map with that produced for our sample picture previously. Thus, notions and methods developed for analyzing pictures may apply to a broader scope of data.

One interesting question arises in view of above picture is: Can we produce 2D picture (maybe not as nice as the photo of Lena but with easier recognizable features) out from a high dimensional dataset like the S&P 500 dataset? The answer is pretty much positive, and I'll blog about this in another post where I will talk about about multidimensional sorting.

At last, the sample picture and maps in this blog is produced by VisuMap version 4.0.895 and the sample dataset can be downloaded from the link here. The following short video shows some basic steps to apply mentioned services:

Visual Data Cleansing with VisuMap

2013-05-30T10:20:00.000-07:00

I have just uploaded a short video (7min) that demonstrates how to do data cleansing visually with VisuMap. The sample dataset used in the video can be downloaded from here.

Sorting high dimensional data with RPM maps

2013-05-07T09:48:00.001-07:00

I have recently published a dataset that contains the daily prices of the S&P 500 component companies for a period of 5 years. The mountain view of VisuMap provides a pretty good overview of the whole dataset in 3D styles. The following short video clips shows the normalized values (so that all price history start from 1.0) of the dataset:

We notice that the curves are colored using the k-Mean algorithm, so that stocks with similar development will have common colors. However, these curves are not ordered according to their group, but according to the alphabetic order of the stock tick, so that curves of different colors are all mixed with each other.

A simple way to enhance the 3D view is just to re-order the curves, so that curves of the same group will be located together. We can do this easily with the heat-map view of VisuMap. However, this method does not reorder the groups in a meaningful way, so that closely located groups, are not necessarily similar groups.

The RPM map provides a better way to sort high dimensional data according to similarities. To do this with VisuMap, we first create 3D RPM map for the dataset with very large width (e.g. 1200 pixels) but small height and depeth (e.g. 50 pixels), so that RPM map geometrically resembles a ring. Then we sort the data points according to the x-coordinates in the 3D RPM map ( that determines the data point's position on the ring.) The nature of the RPM algorithm will make sure that closely located data points on the RPM map will be similar to each other.

The following short video shows how to do this with VisuMap. Notice that we have already created the 3D RPM map. To reorder the data points, we opened a table for the XYZ coordinates, then sorted the table on the X-coordinate column. The mountain view has been configured to automatically re-order its content when re-ordering-event occurred.

Daily Price of S&P Companies in last 5 years.

2013-05-03T17:05:00.002-07:00

I have just released a new VisuMap dataset that contains the daily prices of S&P 500 companies for the year 2008 to 2012. This dataset, available at our web site, provides convenient access to these stock prices for that period. The dataset also contains scripts which download automatically those historical data from Google server.

The following is short video to display these data in 3D mountain view: