Saturday, November 8, 2014

Geometric Modeling of Sequential Data (continued)

In a previous note geometrical modeling of sequential data (GMS) I have described a framework to convert discrete sequential data to 3D geometrical shapes. This note is going to extend GMS to a general form that allows more efficient sampling of higher dimensional data from sequence data.

The GMS framework

Recall that GMS consists of three basic steps: 1. Scanning a sequence to produces a collection of high dimensional vectors; 2: Reordering the components of  these vectors to harmonize sampled vectors; 3: Applying a dimensionality algorithm to embed these vectors into low dimensional space. For the first step, we start with a sequence sk∈S;  k=0, 1, 2, ..., K; where S is a finite set of alphabets. Imaging that the sequence has been put through a scanning machine that quickly takes series of snapshots of the part that is just in the machine:


In above that diagram, the illustrated scanning machine is sized to hold 3 nodes of the sequence. As output the machines produces a series of  sampling vectors Vt for a series of discrete time points.  A sampling vector Vt is actually a 2n+1 dimensional vectors (α0,  ..., αn-1; sk, ..., sk+n-1; t). The above diagram illustrates the case n=3. Here, sk, ..., sk+n-1 are the type of the n nodes currently passing the machine; α0,  ..., αn-1 are the coefficients, called amplitudes, for the corresponding nodes. For the sack of compactness we denote the pair (αk, sk) simply as αksk in above diagram.

The amplitude αk for the k-th node at time t is calculated as following:


The right side of above formula is called the amplifying function. This amplifying function implies that the node sequence passes the scanning machine by the speed of one node per time unit.  In general, any continues function that is zero at point 0 and could be used as an amplifying function.

We notice that the above formula specifies a continues function with respect t as long as no new node entered the machine to replace an old one. In case that a new node, say sk-1, entered the machine to replace an old one, say sk+2, at a time between a small interval t and t';  the sampling vector will undergo the following change:

Since the amplifying function vanishes at the entry 0 and exit points n, the amplitude α2 and α'0 will be close to zero; and since the amplifying function is continues, we'll have α0≈α'1 and α1≈α'2. Thus,
 when we apply a circular shifting on the Vt' as illustrated in the following diagram, the shifted vector t' will be close the Vt:

Notice that the third components of t' and Vt  may be values for different node types (e.g. sk-1 ≠ sk+2), but the amplitudes α2 and α'0 are close to zero, so that t' will be close the Vt as high dimensional vectors. Thus, after applied the circular shifting operation, the scanning machine will produces a series of gradually changing vectors in high dimensional space.

More generally for the implementation of the second step, the scanning machine will have a circular shifting operator to cicularly shift the sampled vectors; and the shifting operator will increment its shifting length by 1 every time when a new node entered the machine.

As third step, a dimensionality reduction algorithm will be applied on the vectors to embed them into a low dimension space. For this study I picked the affinity embedding (AE) algorithm. I have used the t-SNE algorithm in the previous note, but by my experiments, AE worked normally better for this purpose, as it runs much faster for large dataset. To apply AE algorithm we need either to define a metric distances to measure dissimilarity between data points; or an affinity function that measures the similarities (or kind of attraction ) between data points. For this study, we uses the following affinity function:


where the summation runs over all between 0 and n-1, such that  sk= s'k ; λ is an algorithmic constant that can be any positive values. In clear text, if we define the affinity between two nodes as the product of their amplitudes, then the affinity between two sampling vectors is then the sum of affinities between all matching nodes decayed by the time elapsed between the two samplings; The constant λ, called the decay speed, controls how fast that affinity diminishes depending on the time elapsed between the two samplings. More particularly, the real affinity will reduce to half of its value if the two sampling vectors are separated by λn nodes.

The effect of the scanning size n

The scanning size n is a key parameter for the scanning machine, it determines how many consecutive nodes of the sequence will be read (or scanned) to construct an output vector. For larger scanning size, the affinities between these data point will be an aggregation of larger number of consecutive nodes. Thus, the scanning size controls a kind of granularity of the scanning machine.

In order to see effect of the scanning size, I have downloaded the DNA sequence of a relative short gene, the CD8 gene that consists of 744 base pairs. This sequence will then be processed by the GMS framework to create 25000 data points; and then embedded in to the 3D space. The following video shows the resulting 3D maps created with different scanning size:

We see clearly that the 3D curve gradually become simpler and smoother as the scanning size grows from 3 to 24.

The effect of the decay speed λ

The decay speed λ provides a way to differentiate samplings created at different time. A larger λ means that the affinity between similar node pattern will diminish faster as the elapsed time between them grows longer. To demonstrate the effect of decay speed I have created a series of maps for the CD8 gene sequence with increasing λ. The following short video shows those maps:


We can see clearly that the map stretches gradually as λ grows from 0 to 0.08.

Implementation GMS in VisuMap

VisuMap version 4.2.905 offers support for GMS framework through two new metrics "Sequence Metric" and "Sequence Affinity". The former is a distance metric that can be used by most mapping algorithm, the latter is an affinity metric that can only be used by the affinity embedding algorithm.

In order to create a model for a sequence with VisuMap, we first create data table with a single columns and with 5 to 10 thousands rows.  The content of the table is not relevant for the modelling, only the number of rows will be used by the scanning machining as the number of samplings (The link SeqVis provides a such sample dataset together with some sample maps and supporting scripts.) We then open the map configuration window to select "Sequence Affinity" as the metric; and specify a new filter as settings for the scanning machine. The following pictures shows the settings for CD8 sequence used in previous examples.


Notice that the field labeled "Stretch Factor" set value for the decay speed, since in normal cases this parameter defines how far the resulting maps will be stretched along the sequences direction.

Also notice that spaces and newline characters in the sequence given in Sequence Definition window will be ignored by the scanning machine. With this filter definition window we can easily generate 3D maps for arbitrary sequences. The following pictures show some 3D maps together with their corresponding sequences:






Discussions

We have extended the GMS framework with large scanning size and an more efficient dimensionality reduction algorithm. With these extensions we can model much large large sequences from different perspectives; and therefore capture more information from more realistic sequences.

We notice that the proposed framework has certain similarity with the ribosomes machine that translates RNA sequences to proteins. Just like the biology scientists believe all  macroscopic  pattern and features are ultimately encoded in the DNA sequences, I believe that GMS can capture a large class of relevant patterns in discrete sequences with 3-dimensional geometrical models. GMS thus offers us a toy ribosomes machine to simulate the translation of sequential information to geometrical models.