## Tuesday, September 23, 2014

### On geometric modeling of sequential data

The biological cellular system appears to be a universal machine that is capable to translate discrete sequential data, i.e. genetic code, to all kinds of organisms with diverse phonetic features. In this note I am going to put forward a computational framework that translates discrete sequential data to lower dimensional geometrical shapes. Inspired by biological cellular system, this framework first samples high dimensional vectors from the discrete data; then it applies dimensionality reduction methods to embed those vectors into low dimensional space. The resulting map normally forms shapes with geometrical properties that reflect information in initial sequence.

Let's start with an input sequence of nucleic acids sk;  k=0, 1, 2, ..., K  with sk  ∈ S:={C, G, A, T}. To illustrate the sampling process, we image that the sequence is a series of marbles of 4 different imaginary colors. Mathematically,  these "colors", i.e. the set S, can be represented by 4 dimensional vectors each with just a single non-zero component as follows:

Now, image we have a camera that moves slowly from one end to the other end of the sequence; and during the movement we quickly take a series of pictures of the two closest marbles. These pictures record different colors and intensities of the two closest marbles. As depicted in the following schema, we code those "pictures" as 8 dimensional vectors:

In above schema, α and β are the intensities coefficients for the two closest marbles (they can be multiplied directly into the first half and second of the 8-vectors correspondingly). α and β are calculated as follows: We assume that the camera moves evenly one marble per time unit. So at given time, say t1,  the camera is located between k-th and k+1-th marble; then the phase of the camera is calculated as:   p1 :=  t1 - k ;  where k the largest integer smaller than  t1, i.e. k=⌊t1⌋. Then,  α and β are calculated as: α := cos(p1π/2);  β:=sin(p1π/2).  We notice here that when the camera moves smoothly from k to k+1-ε  for any infinitesimal small number ε,  the pair (α, β)  changes smooth from (1, 0) to (0, 1).

However, at the time point k+1, the pair (α, β) takes the value (1, 0), since the camera is now considered between the k+1-th and k+2-th marble.  That means, (α, β) as a time dependent 2-dimensional variable is not continues at the integer time points; Consequently, the corresponding  8-vectors in general won't change smoothly in the 8-dimensional space.

In order to avoid this discontinuity, we swap the two 4-vectors of a 8-vectors for those  vectors produces when ⌊t1⌋ is an odd integer. That means in above schema, at the time  t2, the 8-vector produced is swapped from v2 to v'2 as shown in the following diagram:
With this swapping operation, we can easily verify that the 8-dimensinal vectors sequence actually changes smoothly as the camera smoothly processes.

Additionally to these 8 components, the sampling process at each step also adds a timestamp ct as the 9-th components to the output vectors, where c is a small constant chosen as an algorithm parameter.

Summing-up above steps, the sampling process can be denoted as an operator that operates on discrete sequences s as following:
where t ∈[0, K] is the the time parameter; c is an algorithmic parameter; k:= ⌊t⌋ is the sequential index; α := cos(pπ/2) and β:=sin(pπ/2) with p:= t- k as intensity coefficients for the two closest marbles.

After we have sampled a set of 9-dimensional vectors, we then use a dimensionality reduction (DR) algorithm to embed those vectors into 2- or 3-dimensional space. For our framework, we need a DR algorithm that is strong in preserving local information, since the "camera" basically samples only local information. Algorithms like RPM, CCAAffinity Embedding and t-SNE appear to be good candidates for this purpose. For this note I picked the t-SNE algorithm implemented in VisuMap software.

As first example we use our algorithm to model a short nucleic acid sequence "CCCTGTGGAGCCACACCCTAG". Notice that the timestamp coefficient in  above description is the only component that grows with time unlimitedly; whereas the other 8 components are confined to the range between 0 and 1.0.  Thus, larger c will likely stretch the model further apart. In order to show this property, we created four different data sets with c=0, 0.5/N, 2.5/N, and 10/N;  where N is the number of samplings, which is in this note 15000. The following four clips show the resulting geometric models of these data sets. For the sack of clarity, I have colored the data points with increasing brightness as time progresses; and when the scanning camera is at closest to a node, the corresponding sampling data point will be represented with an icon that indicates the type of the nucleic acid.

Above models showed clearly that the coefficient c  controls how far the model get stretched. The other 8 components contributes information to the folding structure.

The second example compares models created by reversed sequence. The following clip compares the model of previous sequence with the model of its reversed sequence. We can see that the two models are geometrically more or less unchanged, except that the gradient direction is reversed. This means that the described modeling method is invariant under the sequence inversion.

The third example shows how the model changes when the sequence is duplicated multiple times. The following video clip shows the model of our sample sequence and the model of 5-time duplicates of that sequences. We can see that model of the 5-time duplicates resembles 5 time overlaps of the single sequence model.

As next example, we created a model for the protein sequence MELNSSFWTLIKTKMKSRNDNNKLLDTWLDPIEYVSTTGSADRP. The slightly more complicated model is shown in the following video clip:

We notice that the sampled vectors are all sparse vectors. In practical implementations, we can directly calculate the Euclidean distance between those vectors, without explicitly calculating those high dimensional vectors. In this way, we can model sequences of any base set. As final example
I have created a 3D model for the text sequence "One fish two fish red fish blue fish" as show in the following video clip:

Discussion

The framework described here demonstrates a new way to derive geometrical models from discrete data. The framework is abstract in the sense that it is independent of the physical and chemical properties of the sequence.The experiments demonstrates that high dimensional feature space might be an effective way, as an intermediate stage, to derive 3 dimensional phonetic structure.

We notice that the models created above are all basically 1-dimensional curves folded in different ways. For the future study, it would be interesting to develop 2-dimensional sampling method that produces 2- or 3-dimensional models.  More speculatively for future study, can we develop an evolutionary process to find a sequence that gives rise to models with certain properties?