Saturday, September 19, 2015

Curvature as secondary profiling of GMS maps.

As a profiling method, GMS translates discrete sequences to space curves which help us to explore sequence features (treats) with help of geometrical shapes. Yet, space curves still pose a challenge for direct pattern recognition. This is especially so, when we try to explore patterns among large collection of  sequences.

In this note I'll put forward a method to use curvature to capture characteristic of GMS space curves. Some experiments with sample data will show that curvature profile are suitable for pattern analysis of large collection of GMS space curves.

Curvature of  space curves

Curvature is a concept introduced in mathematics to measure how far a segment of a geometrical shape deviates from being flat. Intuitively for space curves, the curvature of a curve at a point measures how strong the curve is bent at the point.

More formally, Let P1, P2, ..., be a series of points in the 3 dimensional space representing a space curve, then for the purpose of this note, we use the following simple formula to estimate the curvature, κk, at point Pk:
As shown in above diagram, α is the angle between the two segments PkPk+1 and Pk+1Pk+2; s is the length of the segment PkPk+2. If Pk are the 3-dimensional coordinate vectors of the points, then the above formula becomes:
Thus, the curvature  profile of space curve is a series real number values, κk, that measure how strong the curve is bent at each point. In practice, when we calculate curvature profile of a GMS curve, we don't need to calculate the curvature at each point, but just at points sparsely and evenly sampled from the the GMS curves. For the sake of simplicity, we simply select one point from for each input sequence node. That means, for a sequence with n nodes, we calculate a n-dimensional curvature profile. The following diagram shows the whole work-flow to calculate the curvature profile from a discrete sequence:



Above work-flow is mainly an extension of the one in a previous note, where more detailed explanation is provided. The only modification to the work-flow is an extra step that calculates  curvature profile from the GMS curve of the first clone. As will be discussed latter in this note, it has been turned out in the simulation experiments that the clones normally have about the same curvature profile.

Characteristics of curvature profile

Curvature profile has several properties which make it suitable to characterize GMS space curves. First, just from its definition we know that the curvature profile is rotation and shifting invariant, so that rotation and shifting information are automatically removed from curvature profile. Because of this running GMS algorithm multiple times on a sequence will results in the same curvature profile, even when the corresponding space curve are rotated and shifted differently due to some random nature within the GMS algorithm.

Going one step further we can change the number of clones in the GMS algorithm. The resulting spaces curves for the clones normally vary systematically in shape and size; Yet their corresponding curvature profiles are more or less similar to each other. The following diagram shows the curvature profiles of  5 clones of sequence with 5 fold cloning:

We notice in above picture that the curvature profiles for the middle or inner clones have noticeably larger peaks than those of the outer clones; and there is symmetry alone the middle (the third) profile. These curvature profiles shows that all 5 space curves bend more or less at common positions and the inner curves bend slightly more than those outer curves. On the right side of above picture are the 5 curvature profiles displayed as a heatmap. As will be seen below, heatmap offers a better way to visualize large collection of curvature profiles.

Another interesting property of the curvature profile is that they are in general independent on the sampling frequency of the GMS algorithm. High frequency sampling normally results in smoother space curves, but their shape remain mostly unchanged. The following picture shows the curvature profile of another sequence with sampling frequency of 6, 8, 10, 12 and 14 scans-per-node. The 5 curvature profile are almost identical.



Profiling Transmutation

In addition to the invariant characteristics, the curvature profiles of similar sequences are in general similar. This property enables us to study systematical variation among large collections of sequences by visualizing their corresponding curvature profiles.  We demonstrate this method here with an example that simulates a transmutation process. As the be shown in the following diagram, transmutation is the process that one sequence gradually mutates to another sequence.

For the sake of simplicity we assume that both sequences, A and B, have the same length N. A simple transmutation process is that an increasingly longer sub-sequence A replaces a corresponding sub-section in B. More particularly, the simple transmutation is realized by the sequence Tk(A, B) := A[1, k]*B[k+1, N]; where k=1, 2, ..., N; A[1,k] is the sequence of the first k nodes of A; B[K+1, N] is the sequence of the last N-k nodes of B; * is the concatenation operation.

With help of GMS algortihm we can calculate the space curves of Tk; k=1,2,..., N; and then calculate their curvature profiles as N-dimensional vectors. For this example we have used two sequences of 1377 nodes (one of them is the CCDS of CD4); The following picture shows their 1377 curvature profiles displayed as an single heat map (the three curve charts on the right side are profiles of the transmutation sequence at 3 particular times) :

In above heat map we can notice a variation region that runs from top left corner to the right bottom corner, this feature indicates that during the transmutation process, only the part of the space curve that is close to the mutated node undergoes significant change. In other words, features of space curve can be traced back to individual nodes in the transmutation sequence.

As the curvature profiles are vectors of the same dimension we can use a MDS (multidimensional scaling algorithm) algorithm to map them to a low dimensional space, so that we can see their sharp. The following pictures is a 2-dimensional map generated with the t-SNE algorithm for those curvature profiles in above example:


In above map, each dot represents the curvature profile of a transmutation sequence. As a property of MDS maps,  closely located dots normally represent transmutations with smaller effects on curvatures. Thus, this map can help us to locate sequence regions which cause large change in space curve.

Discussion 

Curvature profile purposed in this note is a secondary characteristics of discrete sequences, it can help us to study patterns and structures among large collection of discrete sequences. Whereas GMS algorithm provides a mean to geometricalize individual sequences; curvature profile offers a tool to geometricalize a set of sequences as a whole. Various invariant properties and visualization experiments discussed in this note seem to validate curvature profile as suitable characteristics for discrete sequences.

During the searching for such characteristics I have experimented with several other quantities, like quantities derived from the rotation speed, torsion and velocities. Curvature profile as purposed above seems to have the best properties among all the tested quantities. It should also be pointed out that the curve curvature purposed in this note differs slightly from the standard defintion. The reason for this discrepancy is that the standard mathematical definition is unstable when the sampling points are located in tiny regions with noise. For evenly distributed curve sampling points, the purposed curvature formula approximates the standard definition pretty well.

In general, curvature profile together with certain MDS algorithm could provide an interesting tool to explore similarities and patterns among sequential structures and shapes encountered in microbiology.