Wednesday, December 9, 2015

Visualizing Gene Sequences with complementary GMS

We now know that there are about 20 000 protein coding genes in the whole human genome. How do they relate to each other?  Is there any geometrical structure among them? What  shape do they form?  Aiming at these questions,  I'll put forward here a method to generate scatter plot from gene sequences that shows the shape of a gene collection.

In recently blogs, I have described GMS framework to profile gene sequences which translates discrete sequence to high dimensional curvature vectors. We could use this method to convert a collection of gene sequences to a set of high dimensional vectors and then apply MDS algorithm to embed them in a low dimensional space. However, if we restrict our focus on protein coding sequence (CDS), the length of CDS varies from few hundreds to tens of thousands nucleotides. That means, the profiling curvature vectors for CDS sequences are of very different dimensions. This in general causes difficulties for most MDS algorithms as they normally accept only vectors of a common dimension.

In order to obtain profiling vectors of the same length from sequences of variable length, we extended our GMS framework as follows: we first select a short reference sequence; Then for each CDS sequence, we mix the reference sequence with the CDS sequence and pass them together to the GMS framework. After the GMS framework embedded the two sequence into the low dimensional space, we then calculate the curvature profile of the curve of the reference sequence, and use it as a profiling vector for the CDS sequence. The modified framework is depicted as follows:

We notice that the reference sequence and the CDS sequence are embedded together in the same space, so that their curves occupy complementary area of the low dimensional space. That means the curvature profile of the reference sequence carries information about the complementary space of the CDS sequence. More generally, we can expect that different CDS sequences will produce different complementary profile as they have different impact on the reference sequence.

For this note I have downloaded the CDS sequences of the first chromosome of the human genome from the web site; There are about 2000 protein coding genes whose length very from few hundreds to more than 26 thousands nucleotides. As reference sequence, I picked a short CDS sequence of CDS collection (the gene ENSG00000187170) that has only 300 nucleotides. The following picture shows the complementary curvature profile the CDS sequences: each row of the heatmap shows the curvature of the reference sequence mixed with one CDS sequence; bright colors indicate high curvatures. The line chart under the heatmap shows the complementary profile of 4 selected CDS sequences marked in the heamap

We see that these profiles at large have high curvatures at about half dozens regions, but the height and exact location of those high curvature peaks depend on the corresponding CDS sequence. It is important to notice that I have in the previous blog demonstrated that the curvature profile is a characteristics that is invariant from the random nature of the mapping algorithm. Thus, those small variation are not due to the random nature of the mapping algorithm used, they ought to be attributed to different interaction (or "affinity") between the CDS sequence and the reference sequence.

Having obtained one complementary curvature vector for each CDS sequence, we can then apply a MDS algorithm on these high dimensional vectors to embed them in low dimensional space. For this purpose, I have used the tSNE algorithm. The following picture shows a 3D map for these CDS sequences:
Curvature Profile Map of 2045 CDS sequences created with tSNE algorithm

The above map is linked to a short video, you can click on the map to see the 3D map in animation. Also, for the purpose of easy exploration I have added different colors to the different regions on the map with the k-Mean clustering algorithm.

It should be noticed that the selection of the reference sequence has essential impact on the curvature profile and the final map. To see this, I have picked another sequence, the CDS sequence of gene with id ENSG00000158481 (with about 1000 nucleotides,) as reference sequence and repeated the whole procedure. The following picture shows the corresponding complementary curvature profile and the tSNE map of the 2045 CDS sequences:

Curvature Profile and tSNE map of the 2045 CDS sequences with ENSG00000158481 as reference sequence
We see that both the curvature profile and the final map are very different from those created previously with a different reference sequence.

It is not yet clear how the reference sequence impacts the complementary curvatures. Intuitively, the reference sequence plays a role like the a projection plane that we often use to project high dimensional data to 2-dimensional space for visualization purpose. Thus, selecting an appropriate reference sequence might help us to see interesting geometrical structure among large collection of gene sequences.

About the sample data

The sample genome data is downloaded from web site in fasta format. The fasta data file is imported into VisuMap with a special plugin. The zipped file here contains the VisuMap sample file, the plugin for importing fasta files and a script "CurvatureProfiling.js" to run the simulations. Before running the simulation, the plugin module GeneticAnalysis.dll need to be installed into VisuMap.

Saturday, September 19, 2015

Curvature as secondary profiling of GMS maps.

As a profiling method, GMS translates discrete sequences to space curves which help us to explore sequence features (treats) with help of geometrical shapes. Yet, space curves still pose a challenge for direct pattern recognition. This is especially so, when we try to explore patterns among large collection of  sequences.

In this note I'll put forward a method to use curvature to capture characteristic of GMS space curves. Some experiments with sample data will show that curvature profile are suitable for pattern analysis of large collection of GMS space curves.

Curvature of  space curves

Curvature is a concept introduced in mathematics to measure how far a segment of a geometrical shape deviates from being flat. Intuitively for space curves, the curvature of a curve at a point measures how strong the curve is bent at the point.

More formally, Let P1, P2, ..., be a series of points in the 3 dimensional space representing a space curve, then for the purpose of this note, we use the following simple formula to estimate the curvature, κk, at point Pk:
As shown in above diagram, α is the angle between the two segments PkPk+1 and Pk+1Pk+2; s is the length of the segment PkPk+2. If Pk are the 3-dimensional coordinate vectors of the points, then the above formula becomes:
Thus, the curvature  profile of space curve is a series real number values, κk, that measure how strong the curve is bent at each point. In practice, when we calculate curvature profile of a GMS curve, we don't need to calculate the curvature at each point, but just at points sparsely and evenly sampled from the the GMS curves. For the sake of simplicity, we simply select one point from for each input sequence node. That means, for a sequence with n nodes, we calculate a n-dimensional curvature profile. The following diagram shows the whole work-flow to calculate the curvature profile from a discrete sequence:

Above work-flow is mainly an extension of the one in a previous note, where more detailed explanation is provided. The only modification to the work-flow is an extra step that calculates  curvature profile from the GMS curve of the first clone. As will be discussed latter in this note, it has been turned out in the simulation experiments that the clones normally have about the same curvature profile.

Characteristics of curvature profile

Curvature profile has several properties which make it suitable to characterize GMS space curves. First, just from its definition we know that the curvature profile is rotation and shifting invariant, so that rotation and shifting information are automatically removed from curvature profile. Because of this running GMS algorithm multiple times on a sequence will results in the same curvature profile, even when the corresponding space curve are rotated and shifted differently due to some random nature within the GMS algorithm.

Going one step further we can change the number of clones in the GMS algorithm. The resulting spaces curves for the clones normally vary systematically in shape and size; Yet their corresponding curvature profiles are more or less similar to each other. The following diagram shows the curvature profiles of  5 clones of sequence with 5 fold cloning:

We notice in above picture that the curvature profiles for the middle or inner clones have noticeably larger peaks than those of the outer clones; and there is symmetry alone the middle (the third) profile. These curvature profiles shows that all 5 space curves bend more or less at common positions and the inner curves bend slightly more than those outer curves. On the right side of above picture are the 5 curvature profiles displayed as a heatmap. As will be seen below, heatmap offers a better way to visualize large collection of curvature profiles.

Another interesting property of the curvature profile is that they are in general independent on the sampling frequency of the GMS algorithm. High frequency sampling normally results in smoother space curves, but their shape remain mostly unchanged. The following picture shows the curvature profile of another sequence with sampling frequency of 6, 8, 10, 12 and 14 scans-per-node. The 5 curvature profile are almost identical.

Profiling Transmutation

In addition to the invariant characteristics, the curvature profiles of similar sequences are in general similar. This property enables us to study systematical variation among large collections of sequences by visualizing their corresponding curvature profiles.  We demonstrate this method here with an example that simulates a transmutation process. As the be shown in the following diagram, transmutation is the process that one sequence gradually mutates to another sequence.

For the sake of simplicity we assume that both sequences, A and B, have the same length N. A simple transmutation process is that an increasingly longer sub-sequence A replaces a corresponding sub-section in B. More particularly, the simple transmutation is realized by the sequence Tk(A, B) := A[1, k]*B[k+1, N]; where k=1, 2, ..., N; A[1,k] is the sequence of the first k nodes of A; B[K+1, N] is the sequence of the last N-k nodes of B; * is the concatenation operation.

With help of GMS algortihm we can calculate the space curves of Tk; k=1,2,..., N; and then calculate their curvature profiles as N-dimensional vectors. For this example we have used two sequences of 1377 nodes (one of them is the CCDS of CD4); The following picture shows their 1377 curvature profiles displayed as an single heat map (the three curve charts on the right side are profiles of the transmutation sequence at 3 particular times) :

In above heat map we can notice a variation region that runs from top left corner to the right bottom corner, this feature indicates that during the transmutation process, only the part of the space curve that is close to the mutated node undergoes significant change. In other words, features of space curve can be traced back to individual nodes in the transmutation sequence.

As the curvature profiles are vectors of the same dimension we can use a MDS (multidimensional scaling algorithm) algorithm to map them to a low dimensional space, so that we can see their sharp. The following pictures is a 2-dimensional map generated with the t-SNE algorithm for those curvature profiles in above example:

In above map, each dot represents the curvature profile of a transmutation sequence. As a property of MDS maps,  closely located dots normally represent transmutations with smaller effects on curvatures. Thus, this map can help us to locate sequence regions which cause large change in space curve.


Curvature profile purposed in this note is a secondary characteristics of discrete sequences, it can help us to study patterns and structures among large collection of discrete sequences. Whereas GMS algorithm provides a mean to geometricalize individual sequences; curvature profile offers a tool to geometricalize a set of sequences as a whole. Various invariant properties and visualization experiments discussed in this note seem to validate curvature profile as suitable characteristics for discrete sequences.

During the searching for such characteristics I have experimented with several other quantities, like quantities derived from the rotation speed, torsion and velocities. Curvature profile as purposed above seems to have the best properties among all the tested quantities. It should also be pointed out that the curve curvature purposed in this note differs slightly from the standard defintion. The reason for this discrepancy is that the standard mathematical definition is unstable when the sampling points are located in tiny regions with noise. For evenly distributed curve sampling points, the purposed curvature formula approximates the standard definition pretty well.

In general, curvature profile together with certain MDS algorithm could provide an interesting tool to explore similarities and patterns among sequential structures and shapes encountered in microbiology.

Monday, May 25, 2015

GMS for DNA profiling

In this note I am going to describe a GMS based algorithm to convert DNA sequences to geometrical shapes with visually identifiable features. I'll apply this algorithm to real genetic sequences to demonstrate its profiling capability.

The  main steps of the DNA profiling algorithm are illustrated as follows:

As shown in above diagram, a single strand of a duplex nucleotide sequence is taken as the input for the algorithm. The first step of the algorithm is making three identical copies of the sequence, which will then be scanned in parallel by three identical GMS scanning machines which will produce a set of high dimensional vectors. As described in a previous node, the scanning machine works like the ribosomal machinery: just instead of proteins it produces high dimensional vectors. As indicated in the diagram, a scanning machine in our algorithm is configured by three parameters: the scanning size K; the moving step size r; and the affinity decay speed λ.

Then, as the third step, the affinity embedding algorithm will be applied to the high dimensional vectors to produce a 3D dotted plot. That resulting map will usually contain three clusters corresponding to the three duplicated sequences; and the middle cluster is usually pressed to a quasi 2-dimensional disk. So, as the last step, the middle slice of the 3D map will be extracted, rotated and displayed as a 2D map.

In general to qualify as a DNA profiling method, a method should ideally satisfy the following the following requirements:
  1. The same sequence or similar sequence should result in similar maps.
  2. Significant changes in a sequence should lead to noticeably changes in result maps. 
  3. The resulting maps should have structures that can identified by visual examination.
  4. Be able to associate phenotype traits with geometrical patterns on the result maps.
As first example I applied the above algorithm to the VP24 gene of  zarie ebola virus that consists of  1633 base pairs. The following pictures show 2 maps created by running the algorithm twice with different random initializations:

We can see that above two pictures are very similar in terms of topological features of the curves. The following picture shows two maps of the BRAC1 gene that contains 4875 base pairs. Again, these two maps are topologically quite similar up to fine details.

As next example we consider how GMS map changes when we delete, duplicate, or invert a segment of the nucleotide sequence. For this example exons of the gene CD4 has been chosen as input. This sequence has 1377 base pairs. I randomly selected a segment of  70 base pairs as a reference segment for deletion, duplication and inversion. The following pictures show the GMS maps of this sequence and the sequence under deletion, duplication and inversion:

In the above picture, the highlighted region correspond to the reference segment under alterations. We can clearly see how these three types of alterations manifested themselves in their GMS maps.

Above examples seem to indicate that our algorithm satisfies, more or less, the first 3 requirements listed above; whereas the last requirement remains open for the future study. Since a geometrical model can capture much larger amount of information than conventional statistics/correlations, one might hope some interesting phenotype traits may manifest themselves in those models in a yet-to-find way.

Tuesday, May 12, 2015

GMS with multiple scanning and aggregated affinitity

As said in the title, this note is going to put forward two variations to the GMS model. Both variations aim to create better visualization for discrete sequences.

For the first variation, we have seen in a previous note that the loopy GMS can produce simpler geometrical shapes when the scanning machine runs multiple rounds over a loop sequence. The reason for the simplification is likely due to the competition for space between the duplicated scanning vectors. This kind of competition can be easily extended to GMS model for serial (no-loopy) sequences by cloning the sequences and scanning machine as illustrated in the following diagram:

So with multiple scanning, GMS first clones the sequences into multiple identical copies, then scans each sequence as before to produce scanning vectors with time-stamps. In addition to the time-stamp, an index component p that identifies the different clone sequences is added to the scanning vectors. This index p will be used like the time-stamp to reduce the affinities between scanning vectors from different clone sequences. More precisely, the decay factor for the affinity between two scanning vector produced by p-th and p'-th clone at time t and t' will be changed (see the initial specification for the affinity function) from

For the second variation, we notice that the scanning vectors purposed in the initial  note are colored vectors. That means, when calculating the affinity between two scanning vectors, only components with matching color will contribute non-zero affinity to the total affinity. So, as discussed in the initial note,  a K dimensional colored vector is mathematically a K×s dimensional spatial vectors, where K is the scanning size, s is the number of colors. Because of such sparsity, the affinity between two scanning vectors are normally very small, and often too small to carry meaningful information over to the GMS maps.

In order to increase the affinity between scanning vectors, we aggregate the K-dimensional colored vector to a s-dimensional vector by adding all components of the same color to form a new uncolored vector component. In particular, as depicted in the following diagram, let C = { 1, 2, ..., s } be the set of colors, s1, s2,..., sk∈ C be the colors of the k nodes in the scanning machine; and let α12, ..., αk be the corresponding amplitudes for the nodes, then the scanning vector V is a s-dimensional vector (v1, v2,..., vs) with:

These two variations discussed here have been implemented in VisuMap v4.2.912 with a new affinity metric named Sequence2 Affinity. The following short video shows some GMS maps created with the new affinity metric for some short sequences with 40 to 60 nodes:

Tuesday, March 24, 2015

On the origin of the helix structure from the view point of GMS

Helix structure has been prevalent in biological world. It can be fund in small scale like the folding of DNA and protein sequences, and in large scale like plants.

Whereas the mathematical description of the helix structure is clear; the mechanism that gives rise to such structure is not so obvious. Do all those structure share a common mechanism? Why don't they show up in inorganic world?  This note tries to demonstrate with GMS samples that helix structure comes about by some simple dynamics rooted in the discrete sequence structure;

Recall that GMS produces a sequence of high dimensional vectors from a discrete sequences; and the high dimensional vectors are embedded in to low dimensional space according their affinity defined in the following affinity function:

Where t>t' are the timestamps of two vectors produced at type t and t'. We notice that the affinity function consists of two parts:

The first part is only dependent on the timestamp: it reduces the affinity calculated by the second part exponentially depending on time separating these two vectors. It is the second part that accounts for variations in the sequence, si.

In order the see the effect of decay in time dimension, I modified the loopy GMS algorithm so that the affinity function only contains the first part while the second part (the sequence dependent part) is set to the constant 1.0. I ran the algorithm with following parameters: n=25000; K=10; the sequence is the constant sequence with 100 copies of the letter 'A'; the decay speed λ is successively set to 0.125, 0.25, 0.5 and 1.0. The following screen cast shows the resulting corresponding GMS maps in 3D space:

These maps show clearly the helix alike structure; and the winding number goes up as the as decay speed λ goes higher.

To verify that it is the exponential decaying that leads to helix alike structure, I have replaced the affinity function with three different "decaying" functions: Δt-2, Δt-1 and Δt-0.5 with Δt := t-t'. The following pictures shows the corresponding GMS maps:

We can see clearly that these decay functions result in structures that are totally different from the helix structure. Thus, these simulations indicate that the exponential decay of affinity plays a significant role in forming the helix structure.

We notice that if the scanning size K is sufficiently large and the sequence is random, the affinity contribution of the second (sequence dependent) part will, more or less, become constant. Thus, the helix structure may serve as model for completely random sequences. From this point of view, we might call the helix structure the no-information base model. Additional information in sequences should manifest in discrepancy between their GMS maps and the helix structure.

Seeing some π symbols in some formula, the great physicist Richard Fynman once asked: Where is the circle?  In terms of modern genetics, his remark basically assumes that π  is the "gene" for the circle as a phenotype feature. Many such analytical "genes" are carried forward and spread around in various fields, they manifest in different forms, but always keep their intrinsic unchanged.  Here, this note tried to demonstrate that the "gene" for helix structure is the exponential decay of affinity.

Monday, February 2, 2015

On Loopy GMS (Geometrical Modeling of Sequences)

This note is going to apply the GMS model to loop structured sequences. The adaptation from previous serial structure to loopy structure is straightforward: the only change is, as illustrated in the following diagram, that the two ends of the serial sequence are now connected to each other.

When applying the GMS model, the scanning machine ( or the scanner ) runs over the loopy sequence and produces a series of high dimensional vectors which are argumented with timestamps. Those high dimensional vectors will then be embedded into low dimensional space with the affinity embedding algorithm.

The first question arise in this new scenario is whether the GMS model will produce loop alike geometrical shapes, when the scanner runs a whole loop back to its initial location. The answer to this question is yes, but under two conditions. First, the decay speed parameter λ used to calculate the affinity between vectors must be zero, so that the effect of timestamps will be nullified. Otherwise, if λ is not zero, different vectors will be produced when the scanner comes back to its initial position.

Secondly, the total number of nodes, L, in the loop must be a multiple of the dimension of the output vector (without the timestampe component), i.e. the parameter K denoted in previous note. This requirement is necessary because of the circular shifting of the scanned vectors. If L is not a multiple of K, after a whole loop the scanner will produce the same set of  values, but circularly shifted to different order, so that they will normally be different as vectors in the high dimensional space.

The following pictures shows the resulting maps produced by GMS for a short loopy sequence for different cases discussed above.

 Figure 1: Conditions for loopy output maps from loopy sequence. The input sequence is "CCC TGT GGA GCC GGA GCC ACA AGT", K=6. (A) The decay speed λ=0.1, the resulting map is a broken loop in the 3D space;  (B) The sequence is extended with an extra node 'G', so that K is not a multiple of L. The resulting map is a broken loop.  (C) λ = 0 and L is a multiple of K, the resulting map is a loop in the 3D space.

With a loopy sequence, the scanner can in general run multiple rounds around the loop sequence. In this way, the scanner can produce multiple sets of vectors offset just by different timestamps. These vectors sets form repeating patterns when embedded in to low dimensional space. The following short video shows the resulting maps for above sequence for repeat number n = 1, 2, 5 and 10. The decay speed λ is set to 0.1. We notice that in these maps, the repeated structure gradually become flattened when the repeat number increases to high number. Consequently, the output map changes from simple repeated structure to tube alike shape. We notice that both repeated structures  and tubes are pretty frequent structures in biological systems as phenotypic traits.

We notice that running the scanner multiple loops is effectively the same as scannng multiple copies of a sequence sequentially. Thus, the loopy scanning might be considered as a manifestation (or extension ) of sequential scanning where the sequence is a the concatenation of variable number of copies of a reference sequence.

Another extension of the loopy GMS model is, as depicted in the following diagram, using two scanners to scan the loop in the opposite orientations, then concatenate their outputs and a timestampe to form final high dimensional vectors.

The loopy GMS model with dual scanners normally produces symmetrical shapes. The reason for this is that for each configuration of the scanner (i.e. their specific positions on the loop during the scanning) there is always a "dual" configuration in which the two scanners simply swapped their positions. Because of this, the whole collection of output vectors may be split into two sets which differ from each other just by the different timestamps. The following short video shows various symmetrical shapes produced by loopy GMS with dual scanners for various sequences:


This note has experimentally demonstrated that GMS model with simple extension may produce interesting macroscopical shapes, like repeating pattern, tubes and symmetrical structures. Understanding how those discrete sequences give rise to geometrical pattern might help us to investigate how genetic code determines phenotypical traits biological organism.