Visualizing High Dimensional Data: July 2010

SNP and Linkage Disequilibrium

I have just released a sample dataset Haplotype Analysis that shows how to use VisuMap to visualize clusters among SNPs (single nucleotide polymorphism). SNPs are about 0.1% of base-pairs locations in DNA sequences that vary from population to population; and from individual to individual. It has been said that the difference of many phonetic traits, like height, eye color, etc, of human beings can be attributed to variations of SNPs.

Haplotype analysis aims to find correlation between SNPs and phonetic traits. One of the often used method in haplotype analysis is the concept linkage disequilibrium (LD) that measures the correlation between SNP pairs with respect to a population. In abstract terms, LD induces an information structure over SNPs. Such a relational structure may offer a helpful framework to study the correlation between SNPs and phonetic variations.

The numerical calculation of LD is actually pretty straightforward. But all descriptions about LD I can find are too vague for step-by-step calculation. For the sack of reference, I will briefly describe the way how LD is calculated here. Assuming that we have obtained the genotype data about two SNPs, SNP_a and SNP_b, for a group of individuals I₁, I₂, I₃, ..., I_k as in the following table:

	I₁	I₂	I₃	...	I_k
SNP_a	T/T	G/T	T/T	...	G/G
SNP_b	C/T	C/C	C/T	...	T/T

Notice that each element in above table is a pair of nucleotides A, C, G or T. For certain reason (which I don't know), all SNPs are bi-allelic, that means each row in above table will only have maximal two different nucleotides. So, in above sample, SNP_a has G and T, whereas SNP_b has C and T. In order to calculate LD between the two SNPs, we first select a nucleotide arbitrarily from each row; and then count how many times that nucleotide appears in that row in each column. For instance, if we have selected T and C for SNP_a and SNP_b respectively, we will get a frequency table as following:

	I₁	I₂	I₃	...	I_k
SNP_a	2	1	2	...	0
SNP_b	1	2	1	...	0

The LD between the SNP_a and SNP_b (more accurately the R-Squared LD) is then defined as the squared value of the linear correlation coefficient between the two frequency rows. Notice that since the SNPs are bi-allelic, the final LD value won't depend on which nucleotide we have selected from each row for the calculation. We can verify this easily based on the formula below. More formerly, let {a₁, a₂, a₃, ..., a_k} and {b₁, b₂, b₃, ..., b_k} be the two frequency row vectors in above table; let a and b be their mean values; then the R-Squared LD between SNP_a and SNP_b is defined as following:

The Sample Dataset

The uploaded sample dataset is obtained from the HapMap Project website that provides accesses to a large collection of genotype datasets. More interestingly, from that website we can select genotype data for special SNPs and populations. To generate the sample dataset I have select all SNPs from the region of the first 2 Million base pairs in the 9-th chromosome for the CEU population that consists of 169 individuals. There are altogether 1900 SNPs in the selected region. Our sample dataset is thus basically a table with 1900x169 nucleotide pairs.

In order to import the downloaded data into VisuMap we have implemented a plugin module HaploExplorer that enables users to import data downloaded from the HapMap website (files must have the extension .hmap.) The HaploExplorer plugin also implements a special metric named "LD R-Sqaured", so that users can simply select this metric to generate maps to visualize LD values. The following picture shows a sphere map of the 1900 SNPs:

In above map, each glyph represents a SNP from the dataset. The size of the glyph indicates the chromosome location of the SNP, ie. smaller glyphs represent SNPs located at the beginning of the chromosome. Most importantly, the distances between the glyphs indicate the LD between the SNPs; that means closely located glyphs correspond to SNPs with high LD. We can see clearly various type of clusters in above picture.

We can also visualize the chromosome locations more directly with the help a spectrum view. The following picture shows, for instance, a 2D-RPM map of the the dataset together with a spectrum view of their chromosome locations:

When interactively exploring about views, the user can select a group of SNPs in the lower window, the upper spectrum window will automatically show the chromosome locations of the selected SNPs.

For comparison purpose the HapMap project recommends a program called Haploview that allows user to directly visualize LD values with a triangle matrix, where the LD values are represented by different levels of gray colors. The following picture shows such a map generated by Haploview:

In above picture, all 1900 SNPs are sequentially lined up to the upper edge, so that the matrix is basically a 1900x1900 triangle matrix. In order support exploration of such large matrix Haploview implements an overview window (displayed on the lower left corner) that allows users to select a small section of the data for close investigation.

Comparing above maps, we can see that VisuMap provides a more direct and intuitive way to visualize patterns among SNPs. More importantly, as an integrated software package, VisuMap offers simple framework to investigate different type of LD relationships between SNPs. We can, for instance, easily experiment with any other comparable distance metric available in VisuMap; and we can use any of clustering algorithms in VisuMap to cluster the SNPs.

At last, but not at least, after we have imported the data in to VisuMap, we can also visualize patterns among populations by transposing the dataset table. In order to study relationships between individuals or populations a special distance metric, called the IBS (identity-by-state) distance, has been suggested by some researchers. The IBS distance metric is also implemented in the HaploExplorer plugin. After we have transposed a SNP dataset, we can select the IBS distance metric to produce population maps.

Visualizing High Dimensional Data

Sunday, July 18, 2010

Visualizing linkage disequilibrium clusters of genotype SNPs

About Me

Blog Archive

Tweet