We now know that there are about 20 000 protein coding genes in the whole human genome. How do they relate to each other? Is there any geometrical structure among them? What shape do they form? Aiming at these questions, I'll put forward here a method to generate scatter plot from gene sequences that shows the shape of a gene collection.
In recently blogs, I have described GMS framework to profile gene sequences which translates discrete sequence to high dimensional curvature vectors. We could use this method to convert a collection of gene sequences to a set of high dimensional vectors and then apply MDS algorithm to embed them in a low dimensional space. However, if we restrict our focus on protein coding sequence (CDS), the length of CDS varies from few hundreds to tens of thousands nucleotides. That means, the profiling curvature vectors for CDS sequences are of very different dimensions. This in general causes difficulties for most MDS algorithms as they normally accept only vectors of a common dimension.
In order to obtain profiling vectors of the same length from sequences of variable length, we extended our GMS framework as follows: we first select a short reference sequence; Then for each CDS sequence, we mix the reference sequence with the CDS sequence and pass them together to the GMS framework. After the GMS framework embedded the two sequence into the low dimensional space, we then calculate the curvature profile of the curve of the reference sequence, and use it as a profiling vector for the CDS sequence. The modified framework is depicted as follows:
We notice that the reference sequence and the CDS sequence are embedded together in the same space, so that their curves occupy complementary area of the low dimensional space. That means the curvature profile of the reference sequence carries information about the complementary space of the CDS sequence. More generally, we can expect that different CDS sequences will produce different complementary profile as they have different impact on the reference sequence.
For this note I have downloaded the CDS sequences of the first chromosome of the human genome from the Ensembl.org web site; There are about 2000 protein coding genes whose length very from few hundreds to more than 26 thousands nucleotides. As reference sequence, I picked a short CDS sequence of CDS collection (the gene ENSG00000187170) that has only 300 nucleotides. The following picture shows the complementary curvature profile the CDS sequences: each row of the heatmap shows the curvature of the reference sequence mixed with one CDS sequence; bright colors indicate high curvatures. The line chart under the heatmap shows the complementary profile of 4 selected CDS sequences marked in the heamap
We see that these profiles at large have high curvatures at about half dozens regions, but the height and exact location of those high curvature peaks depend on the corresponding CDS sequence. It is important to notice that I have in the previous blog demonstrated that the curvature profile is a characteristics that is invariant from the random nature of the mapping algorithm. Thus, those small variation are not due to the random nature of the mapping algorithm used, they ought to be attributed to different interaction (or "affinity") between the CDS sequence and the reference sequence.
Having obtained one complementary curvature vector for each CDS sequence, we can then apply a MDS algorithm on these high dimensional vectors to embed them in low dimensional space. For this purpose, I have used the tSNE algorithm. The following picture shows a 3D map for these CDS sequences:
The above map is linked to a short video, you can click on the map to see the 3D map in animation. Also, for the purpose of easy exploration I have added different colors to the different regions on the map with the k-Mean clustering algorithm.
It should be noticed that the selection of the reference sequence has essential impact on the curvature profile and the final map. To see this, I have picked another sequence, the CDS sequence of gene with id ENSG00000158481 (with about 1000 nucleotides,) as reference sequence and repeated the whole procedure. The following picture shows the corresponding complementary curvature profile and the tSNE map of the 2045 CDS sequences:
We see that both the curvature profile and the final map are very different from those created previously with a different reference sequence.
It is not yet clear how the reference sequence impacts the complementary curvatures. Intuitively, the reference sequence plays a role like the a projection plane that we often use to project high dimensional data to 2-dimensional space for visualization purpose. Thus, selecting an appropriate reference sequence might help us to see interesting geometrical structure among large collection of gene sequences.
About the sample data
The sample genome data is downloaded from Ensembl.org web site in fasta format. The fasta data file is imported into VisuMap with a special plugin. The zipped file here contains the VisuMap sample file, the plugin for importing fasta files and a script "CurvatureProfiling.js" to run the simulations. Before running the simulation, the plugin module GeneticAnalysis.dll need to be installed into VisuMap.
In recently blogs, I have described GMS framework to profile gene sequences which translates discrete sequence to high dimensional curvature vectors. We could use this method to convert a collection of gene sequences to a set of high dimensional vectors and then apply MDS algorithm to embed them in a low dimensional space. However, if we restrict our focus on protein coding sequence (CDS), the length of CDS varies from few hundreds to tens of thousands nucleotides. That means, the profiling curvature vectors for CDS sequences are of very different dimensions. This in general causes difficulties for most MDS algorithms as they normally accept only vectors of a common dimension.
In order to obtain profiling vectors of the same length from sequences of variable length, we extended our GMS framework as follows: we first select a short reference sequence; Then for each CDS sequence, we mix the reference sequence with the CDS sequence and pass them together to the GMS framework. After the GMS framework embedded the two sequence into the low dimensional space, we then calculate the curvature profile of the curve of the reference sequence, and use it as a profiling vector for the CDS sequence. The modified framework is depicted as follows:
We notice that the reference sequence and the CDS sequence are embedded together in the same space, so that their curves occupy complementary area of the low dimensional space. That means the curvature profile of the reference sequence carries information about the complementary space of the CDS sequence. More generally, we can expect that different CDS sequences will produce different complementary profile as they have different impact on the reference sequence.
For this note I have downloaded the CDS sequences of the first chromosome of the human genome from the Ensembl.org web site; There are about 2000 protein coding genes whose length very from few hundreds to more than 26 thousands nucleotides. As reference sequence, I picked a short CDS sequence of CDS collection (the gene ENSG00000187170) that has only 300 nucleotides. The following picture shows the complementary curvature profile the CDS sequences: each row of the heatmap shows the curvature of the reference sequence mixed with one CDS sequence; bright colors indicate high curvatures. The line chart under the heatmap shows the complementary profile of 4 selected CDS sequences marked in the heamap
We see that these profiles at large have high curvatures at about half dozens regions, but the height and exact location of those high curvature peaks depend on the corresponding CDS sequence. It is important to notice that I have in the previous blog demonstrated that the curvature profile is a characteristics that is invariant from the random nature of the mapping algorithm. Thus, those small variation are not due to the random nature of the mapping algorithm used, they ought to be attributed to different interaction (or "affinity") between the CDS sequence and the reference sequence.
Having obtained one complementary curvature vector for each CDS sequence, we can then apply a MDS algorithm on these high dimensional vectors to embed them in low dimensional space. For this purpose, I have used the tSNE algorithm. The following picture shows a 3D map for these CDS sequences:
Curvature Profile Map of 2045 CDS sequences created with tSNE algorithm |
It should be noticed that the selection of the reference sequence has essential impact on the curvature profile and the final map. To see this, I have picked another sequence, the CDS sequence of gene with id ENSG00000158481 (with about 1000 nucleotides,) as reference sequence and repeated the whole procedure. The following picture shows the corresponding complementary curvature profile and the tSNE map of the 2045 CDS sequences:
Curvature Profile and tSNE map of the 2045 CDS sequences with ENSG00000158481 as reference sequence |
It is not yet clear how the reference sequence impacts the complementary curvatures. Intuitively, the reference sequence plays a role like the a projection plane that we often use to project high dimensional data to 2-dimensional space for visualization purpose. Thus, selecting an appropriate reference sequence might help us to see interesting geometrical structure among large collection of gene sequences.
About the sample data
The sample genome data is downloaded from Ensembl.org web site in fasta format. The fasta data file is imported into VisuMap with a special plugin. The zipped file here contains the VisuMap sample file, the plugin for importing fasta files and a script "CurvatureProfiling.js" to run the simulations. Before running the simulation, the plugin module GeneticAnalysis.dll need to be installed into VisuMap.