## Tuesday, March 24, 2015

### On the origin of the helix structure from the view point of GMS

Helix structure has been prevalent in biological world. It can be fund in small scale like the folding of DNA and protein sequences, and in large scale like plants.

Whereas the mathematical description of the helix structure is clear; the mechanism that gives rise to such structure is not so obvious. Do all those structure share a common mechanism? Why don't they show up in inorganic world?  This note tries to demonstrate with GMS samples that helix structure comes about by some simple dynamics rooted in the discrete sequence structure;

Recall that GMS produces a sequence of high dimensional vectors from a discrete sequences; and the high dimensional vectors are embedded in to low dimensional space according their affinity defined in the following affinity function:

Where t>t' are the timestamps of two vectors produced at type t and t'. We notice that the affinity function consists of two parts:
and

The first part is only dependent on the timestamp: it reduces the affinity calculated by the second part exponentially depending on time separating these two vectors. It is the second part that accounts for variations in the sequence, si.

In order the see the effect of decay in time dimension, I modified the loopy GMS algorithm so that the affinity function only contains the first part while the second part (the sequence dependent part) is set to the constant 1.0. I ran the algorithm with following parameters: n=25000; K=10; the sequence is the constant sequence with 100 copies of the letter 'A'; the decay speed λ is successively set to 0.125, 0.25, 0.5 and 1.0. The following screen cast shows the resulting corresponding GMS maps in 3D space:

These maps show clearly the helix alike structure; and the winding number goes up as the as decay speed λ goes higher.

To verify that it is the exponential decaying that leads to helix alike structure, I have replaced the affinity function with three different "decaying" functions: Δt-2, Δt-1 and Δt-0.5 with Δt := t-t'. The following pictures shows the corresponding GMS maps:

We can see clearly that these decay functions result in structures that are totally different from the helix structure. Thus, these simulations indicate that the exponential decay of affinity plays a significant role in forming the helix structure.

We notice that if the scanning size K is sufficiently large and the sequence is random, the affinity contribution of the second (sequence dependent) part will, more or less, become constant. Thus, the helix structure may serve as model for completely random sequences. From this point of view, we might call the helix structure the no-information base model. Additional information in sequences should manifest in discrepancy between their GMS maps and the helix structure.

Seeing some π symbols in some formula, the great physicist Richard Fynman once asked: Where is the circle?  In terms of modern genetics, his remark basically assumes that π  is the "gene" for the circle as a phenotype feature. Many such analytical "genes" are carried forward and spread around in various fields, they manifest in different forms, but always keep their intrinsic unchanged.  Here, this note tried to demonstrate that the "gene" for helix structure is the exponential decay of affinity.

## Monday, February 2, 2015

### On Loopy GMS (Geometrical Modeling of Sequences)

This note is going to apply the GMS model to loop structured sequences. The adaptation from previous serial structure to loopy structure is straightforward: the only change is, as illustrated in the following diagram, that the two ends of the serial sequence are now connected to each other.

When applying the GMS model, the scanning machine ( or the scanner ) runs over the loopy sequence and produces a series of high dimensional vectors which are argumented with timestamps. Those high dimensional vectors will then be embedded into low dimensional space with the affinity embedding algorithm.

The first question arise in this new scenario is whether the GMS model will produce loop alike geometrical shapes, when the scanner runs a whole loop back to its initial location. The answer to this question is yes, but under two conditions. First, the decay speed parameter λ used to calculate the affinity between vectors must be zero, so that the effect of timestamps will be nullified. Otherwise, if λ is not zero, different vectors will be produced when the scanner comes back to its initial position.

Secondly, the total number of nodes, L, in the loop must be a multiple of the dimension of the output vector (without the timestampe component), i.e. the parameter K denoted in previous note. This requirement is necessary because of the circular shifting of the scanned vectors. If L is not a multiple of K, after a whole loop the scanner will produce the same set of  values, but circularly shifted to different order, so that they will normally be different as vectors in the high dimensional space.

The following pictures shows the resulting maps produced by GMS for a short loopy sequence for different cases discussed above.

Figure 1: Conditions for loopy output maps from loopy sequence. The input sequence is "CCC TGT GGA GCC GGA GCC ACA AGT", K=6. (A) The decay speed λ=0.1, the resulting map is a broken loop in the 3D space;  (B) The sequence is extended with an extra node 'G', so that K is not a multiple of L. The resulting map is a broken loop.  (C) λ = 0 and L is a multiple of K, the resulting map is a loop in the 3D space.

With a loopy sequence, the scanner can in general run multiple rounds around the loop sequence. In this way, the scanner can produce multiple sets of vectors offset just by different timestamps. These vectors sets form repeating patterns when embedded in to low dimensional space. The following short video shows the resulting maps for above sequence for repeat number n = 1, 2, 5 and 10. The decay speed λ is set to 0.1. We notice that in these maps, the repeated structure gradually become flattened when the repeat number increases to high number. Consequently, the output map changes from simple repeated structure to tube alike shape. We notice that both repeated structures  and tubes are pretty frequent structures in biological systems as phenotypic traits.

We notice that running the scanner multiple loops is effectively the same as scannng multiple copies of a sequence sequentially. Thus, the loopy scanning might be considered as a manifestation (or extension ) of sequential scanning where the sequence is a the concatenation of variable number of copies of a reference sequence.

Another extension of the loopy GMS model is, as depicted in the following diagram, using two scanners to scan the loop in the opposite orientations, then concatenate their outputs and a timestampe to form final high dimensional vectors.

The loopy GMS model with dual scanners normally produces symmetrical shapes. The reason for this is that for each configuration of the scanner (i.e. their specific positions on the loop during the scanning) there is always a "dual" configuration in which the two scanners simply swapped their positions. Because of this, the whole collection of output vectors may be split into two sets which differ from each other just by the different timestamps. The following short video shows various symmetrical shapes produced by loopy GMS with dual scanners for various sequences:

Discussion

This note has experimentally demonstrated that GMS model with simple extension may produce interesting macroscopical shapes, like repeating pattern, tubes and symmetrical structures. Understanding how those discrete sequences give rise to geometrical pattern might help us to investigate how genetic code determines phenotypical traits biological organism.

## Saturday, November 8, 2014

### Geometric Modeling of Sequential Data (continued)

In a previous note geometrical modeling of sequential data (GMS) I have described a framework to convert discrete sequential data to 3D geometrical shapes. This note is going to extend GMS to a general form that allows more efficient sampling of higher dimensional data from sequence data.

The GMS framework

Recall that GMS consists of three basic steps: 1. Scanning a sequence to produces a collection of high dimensional vectors; 2: Reordering the components of  these vectors to harmonize sampled vectors; 3: Applying a dimensionality algorithm to embed these vectors into low dimensional space. For the first step, we start with a sequence sk∈S;  k=0, 1, 2, ..., K; where S is a finite set of alphabets. Imaging that the sequence has been put through a scanning machine that quickly takes series of snapshots of the part that is just in the machine:

In above that diagram, the illustrated scanning machine is sized to hold 3 nodes of the sequence. As output the machines produces a series of  sampling vectors Vt for a series of discrete time points.  A sampling vector Vt is actually a 2n+1 dimensional vectors (α0,  ..., αn-1; sk, ..., sk+n-1; t). The above diagram illustrates the case n=3. Here, sk, ..., sk+n-1 are the type of the n nodes currently passing the machine; α0,  ..., αn-1 are the coefficients, called amplitudes, for the corresponding nodes. For the sack of compactness we denote the pair (αk, sk) simply as αksk in above diagram.

The amplitude αk for the k-th node at time t is calculated as following:

The right side of above formula is called the amplifying function. This amplifying function implies that the node sequence passes the scanning machine by the speed of one node per time unit.  In general, any continues function that is zero at point 0 and could be used as an amplifying function.

We notice that the above formula specifies a continues function with respect t as long as no new node entered the machine to replace an old one. In case that a new node, say sk-1, entered the machine to replace an old one, say sk+2, at a time between a small interval t and t';  the sampling vector will undergo the following change:

Since the amplifying function vanishes at the entry 0 and exit points n, the amplitude α2 and α'0 will be close to zero; and since the amplifying function is continues, we'll have α0≈α'1 and α1≈α'2. Thus,
when we apply a circular shifting on the Vt' as illustrated in the following diagram, the shifted vector t' will be close the Vt:

Notice that the third components of t' and Vt  may be values for different node types (e.g. sk-1 ≠ sk+2), but the amplitudes α2 and α'0 are close to zero, so that t' will be close the Vt as high dimensional vectors. Thus, after applied the circular shifting operation, the scanning machine will produces a series of gradually changing vectors in high dimensional space.

More generally for the implementation of the second step, the scanning machine will have a circular shifting operator to cicularly shift the sampled vectors; and the shifting operator will increment its shifting length by 1 every time when a new node entered the machine.

As third step, a dimensionality reduction algorithm will be applied on the vectors to embed them into a low dimension space. For this study I picked the affinity embedding (AE) algorithm. I have used the t-SNE algorithm in the previous note, but by my experiments, AE worked normally better for this purpose, as it runs much faster for large dataset. To apply AE algorithm we need either to define a metric distances to measure dissimilarity between data points; or an affinity function that measures the similarities (or kind of attraction ) between data points. For this study, we uses the following affinity function:

where the summation runs over all between 0 and n-1, such that  sk= s'k ; λ is an algorithmic constant that can be any positive values. In clear text, if we define the affinity between two nodes as the product of their amplitudes, then the affinity between two sampling vectors is then the sum of affinities between all matching nodes decayed by the time elapsed between the two samplings; The constant λ, called the decay speed, controls how fast that affinity diminishes depending on the time elapsed between the two samplings. More particularly, the real affinity will reduce to half of its value if the two sampling vectors are separated by λn nodes.

The effect of the scanning size n

The scanning size n is a key parameter for the scanning machine, it determines how many consecutive nodes of the sequence will be read (or scanned) to construct an output vector. For larger scanning size, the affinities between these data point will be an aggregation of larger number of consecutive nodes. Thus, the scanning size controls a kind of granularity of the scanning machine.

In order to see effect of the scanning size, I have downloaded the DNA sequence of a relative short gene, the CD8 gene that consists of 744 base pairs. This sequence will then be processed by the GMS framework to create 25000 data points; and then embedded in to the 3D space. The following video shows the resulting 3D maps created with different scanning size:

We see clearly that the 3D curve gradually become simpler and smoother as the scanning size grows from 3 to 24.

The effect of the decay speed λ

The decay speed λ provides a way to differentiate samplings created at different time. A larger λ means that the affinity between similar node pattern will diminish faster as the elapsed time between them grows longer. To demonstrate the effect of decay speed I have created a series of maps for the CD8 gene sequence with increasing λ. The following short video shows those maps:

We can see clearly that the map stretches gradually as λ grows from 0 to 0.08.

Implementation GMS in VisuMap

VisuMap version 4.2.905 offers support for GMS framework through two new metrics "Sequence Metric" and "Sequence Affinity". The former is a distance metric that can be used by most mapping algorithm, the latter is an affinity metric that can only be used by the affinity embedding algorithm.

In order to create a model for a sequence with VisuMap, we first create data table with a single columns and with 5 to 10 thousands rows.  The content of the table is not relevant for the modelling, only the number of rows will be used by the scanning machining as the number of samplings (The link SeqVis provides a such sample dataset together with some sample maps and supporting scripts.) We then open the map configuration window to select "Sequence Affinity" as the metric; and specify a new filter as settings for the scanning machine. The following pictures shows the settings for CD8 sequence used in previous examples.

Notice that the field labeled "Stretch Factor" set value for the decay speed, since in normal cases this parameter defines how far the resulting maps will be stretched along the sequences direction.

Also notice that spaces and newline characters in the sequence given in Sequence Definition window will be ignored by the scanning machine. With this filter definition window we can easily generate 3D maps for arbitrary sequences. The following pictures show some 3D maps together with their corresponding sequences:

Discussions

We have extended the GMS framework with large scanning size and an more efficient dimensionality reduction algorithm. With these extensions we can model much large large sequences from different perspectives; and therefore capture more information from more realistic sequences.

We notice that the proposed framework has certain similarity with the ribosomes machine that translates RNA sequences to proteins. Just like the biology scientists believe all  macroscopic  pattern and features are ultimately encoded in the DNA sequences, I believe that GMS can capture a large class of relevant patterns in discrete sequences with 3-dimensional geometrical models. GMS thus offers us a toy ribosomes machine to simulate the translation of sequential information to geometrical models.

## Tuesday, September 23, 2014

### On geometric modeling of sequential data

The biological cellular system appears to be a universal machine that is capable to translate discrete sequential data, i.e. genetic code, to all kinds of organisms with diverse phonetic features. In this note I am going to put forward a computational framework that translates discrete sequential data to lower dimensional geometrical shapes. Inspired by biological cellular system, this framework first samples high dimensional vectors from the discrete data; then it applies dimensionality reduction methods to embed those vectors into low dimensional space. The resulting map normally forms shapes with geometrical properties that reflect information in initial sequence.

Let's start with an input sequence of nucleic acids sk;  k=0, 1, 2, ..., K  with sk  ∈ S:={C, G, A, T}. To illustrate the sampling process, we image that the sequence is a series of marbles of 4 different imaginary colors. Mathematically,  these "colors", i.e. the set S, can be represented by 4 dimensional vectors each with just a single non-zero component as follows:

Now, image we have a camera that moves slowly from one end to the other end of the sequence; and during the movement we quickly take a series of pictures of the two closest marbles. These pictures record different colors and intensities of the two closest marbles. As depicted in the following schema, we code those "pictures" as 8 dimensional vectors:

In above schema, α and β are the intensities coefficients for the two closest marbles (they can be multiplied directly into the first half and second of the 8-vectors correspondingly). α and β are calculated as follows: We assume that the camera moves evenly one marble per time unit. So at given time, say t1,  the camera is located between k-th and k+1-th marble; then the phase of the camera is calculated as:   p1 :=  t1 - k ;  where k the largest integer smaller than  t1, i.e. k=⌊t1⌋. Then,  α and β are calculated as: α := cos(p1π/2);  β:=sin(p1π/2).  We notice here that when the camera moves smoothly from k to k+1-ε  for any infinitesimal small number ε,  the pair (α, β)  changes smooth from (1, 0) to (0, 1).

However, at the time point k+1, the pair (α, β) takes the value (1, 0), since the camera is now considered between the k+1-th and k+2-th marble.  That means, (α, β) as a time dependent 2-dimensional variable is not continues at the integer time points; Consequently, the corresponding  8-vectors in general won't change smoothly in the 8-dimensional space.

In order to avoid this discontinuity, we swap the two 4-vectors of a 8-vectors for those  vectors produces when ⌊t1⌋ is an odd integer. That means in above schema, at the time  t2, the 8-vector produced is swapped from v2 to v'2 as shown in the following diagram:
With this swapping operation, we can easily verify that the 8-dimensinal vectors sequence actually changes smoothly as the camera smoothly processes.

Additionally to these 8 components, the sampling process at each step also adds a timestamp ct as the 9-th components to the output vectors, where c is a small constant chosen as an algorithm parameter.

Summing-up above steps, the sampling process can be denoted as an operator that operates on discrete sequences s as following:
where t ∈[0, K] is the the time parameter; c is an algorithmic parameter; k:= ⌊t⌋ is the sequential index; α := cos(pπ/2) and β:=sin(pπ/2) with p:= t- k as intensity coefficients for the two closest marbles.

After we have sampled a set of 9-dimensional vectors, we then use a dimensionality reduction (DR) algorithm to embed those vectors into 2- or 3-dimensional space. For our framework, we need a DR algorithm that is strong in preserving local information, since the "camera" basically samples only local information. Algorithms like RPM, CCAAffinity Embedding and t-SNE appear to be good candidates for this purpose. For this note I picked the t-SNE algorithm implemented in VisuMap software.

As first example we use our algorithm to model a short nucleic acid sequence "CCCTGTGGAGCCACACCCTAG". Notice that the timestamp coefficient in  above description is the only component that grows with time unlimitedly; whereas the other 8 components are confined to the range between 0 and 1.0.  Thus, larger c will likely stretch the model further apart. In order to show this property, we created four different data sets with c=0, 0.5/N, 2.5/N, and 10/N;  where N is the number of samplings, which is in this note 15000. The following four clips show the resulting geometric models of these data sets. For the sack of clarity, I have colored the data points with increasing brightness as time progresses; and when the scanning camera is at closest to a node, the corresponding sampling data point will be represented with an icon that indicates the type of the nucleic acid.

Above models showed clearly that the coefficient c  controls how far the model get stretched. The other 8 components contributes information to the folding structure.

The second example compares models created by reversed sequence. The following clip compares the model of previous sequence with the model of its reversed sequence. We can see that the two models are geometrically more or less unchanged, except that the gradient direction is reversed. This means that the described modeling method is invariant under the sequence inversion.

The third example shows how the model changes when the sequence is duplicated multiple times. The following video clip shows the model of our sample sequence and the model of 5-time duplicates of that sequences. We can see that model of the 5-time duplicates resembles 5 time overlaps of the single sequence model.

As next example, we created a model for the protein sequence MELNSSFWTLIKTKMKSRNDNNKLLDTWLDPIEYVSTTGSADRP. The slightly more complicated model is shown in the following video clip:

We notice that the sampled vectors are all sparse vectors. In practical implementations, we can directly calculate the Euclidean distance between those vectors, without explicitly calculating those high dimensional vectors. In this way, we can model sequences of any base set. As final example
I have created a 3D model for the text sequence "One fish two fish red fish blue fish" as show in the following video clip:

Discussion

The framework described here demonstrates a new way to derive geometrical models from discrete data. The framework is abstract in the sense that it is independent of the physical and chemical properties of the sequence.The experiments demonstrates that high dimensional feature space might be an effective way, as an intermediate stage, to derive 3 dimensional phonetic structure.

We notice that the models created above are all basically 1-dimensional curves folded in different ways. For the future study, it would be interesting to develop 2-dimensional sampling method that produces 2- or 3-dimensional models.  More speculatively for future study, can we develop an evolutionary process to find a sequence that gives rise to models with certain properties?