Visualizing High Dimensional Data: GMS for DNA profiling

In this note I am going to describe a GMS based algorithm to convert DNA sequences to geometrical shapes with visually identifiable features. I'll apply this algorithm to real genetic sequences to demonstrate its profiling capability.

The main steps of the DNA profiling algorithm are illustrated as follows:

As shown in above diagram, a single strand of a duplex nucleotide sequence is taken as the input for the algorithm. The first step of the algorithm is making three identical copies of the sequence, which will then be scanned in parallel by three identical GMS scanning machines which will produce a set of high dimensional vectors. As described in a previous node, the scanning machine works like the ribosomal machinery: just instead of proteins it produces high dimensional vectors. As indicated in the diagram, a scanning machine in our algorithm is configured by three parameters: the scanning size K; the moving step size r; and the affinity decay speed λ.

Then, as the third step, the affinity embedding algorithm will be applied to the high dimensional vectors to produce a 3D dotted plot. That resulting map will usually contain three clusters corresponding to the three duplicated sequences; and the middle cluster is usually pressed to a quasi 2-dimensional disk. So, as the last step, the middle slice of the 3D map will be extracted, rotated and displayed as a 2D map.

In general to qualify as a DNA profiling method, a method should ideally satisfy the following the following requirements:

The same sequence or similar sequence should result in similar maps.
Significant changes in a sequence should lead to noticeably changes in result maps.
The resulting maps should have structures that can identified by visual examination.
Be able to associate phenotype traits with geometrical patterns on the result maps.

As first example I applied the above algorithm to the VP24 gene of zarie ebola virus that consists of 1633 base pairs. The following pictures show 2 maps created by running the algorithm twice with different random initializations:

We can see that above two pictures are very similar in terms of topological features of the curves. The following picture shows two maps of the BRAC1 gene that contains 4875 base pairs. Again, these two maps are topologically quite similar up to fine details.

As next example we consider how GMS map changes when we delete, duplicate, or invert a segment of the nucleotide sequence. For this example exons of the gene CD4 has been chosen as input. This sequence has 1377 base pairs. I randomly selected a segment of 70 base pairs as a reference segment for deletion, duplication and inversion. The following pictures show the GMS maps of this sequence and the sequence under deletion, duplication and inversion:

In the above picture, the highlighted region correspond to the reference segment under alterations. We can clearly see how these three types of alterations manifested themselves in their GMS maps.

Above examples seem to indicate that our algorithm satisfies, more or less, the first 3 requirements listed above; whereas the last requirement remains open for the future study. Since a geometrical model can capture much larger amount of information than conventional statistics/correlations, one might hope some interesting phenotype traits may manifest themselves in those models in a yet-to-find way.

Visualizing High Dimensional Data

Monday, May 25, 2015

GMS for DNA profiling

No comments:

About Me

Blog Archive

Tweet