Friday, June 12, 2020

Dual Embedding of scRNA-seq Data

Single-cell RNA sequencing (scRNA-seq) is a powerful tool to profile cells under given condition with respect to selected genes. As a contingency matrix, a scRNA-seq dataset  records the the expression level of individual cells with respect to a set of genes or related transcriptical signatures. A widely used method to study such scRNA-seq data is embedding the expression profiles of cells, i.e. rows of such expression matrix, into low dimensional space, and visualizing them as 2D or 3D maps. Those maps offers an intuitive way to trace the phenotype clusters among cells.  Yet, this kind of maps do not show the which cells are regulated by which genes, despite that the expression matrix is direct information about reglating relationship between cells and genes.

This note puts forward a method to embed expression profiles of cells and genes,  ie rows and columns of the expression matrix, simultaneously into a single low dimensional space. The resulting map visualizes both cells and genes; and more interestingly, also the association between the clusters. The following diagram illustrates the relationship between the dual embedding and conventional embedding:


The dual embedding method uses the tSNE algorithm for the embedding. In particular, let C and G denote the set of cells and genes; d and dg be distance functions over C and G respectively, With tSNE we can then obtain a cell map and a gene map from (C, dc) and (G, dg) respective as indicated in above diagram. As an extension to these methods, the dual embedding method embeds both C and G in to low dimensional space with tSNE. In order to do so, we define a distance function dcg over CG as following:

In above formula,  M is the expression matrix of the scRNA-seq dataset; M[x,y] stands for the expression level of x-th cell w.r.t. y-th gene; Mmax is the maximal element of M; h is a positive bias factor that additionally adjusts the distances between cells and genes.The last two terms in above formula define the distance between C and G in the sense that higher expression level between cells and genes will leads two lower distance. Additionally, larger bias factor h leads to larger separation between cells and gens.

As an example, I used a dataset with 12000 cells and 8000 genes from the data used in the publication tasic2018. The follow image shows a cell-gene map of this dataset.  The expression matrix has been log-transformed; the perplexity for the tSNE algorithm is about 1000.0.
As a feature of embedding maps, we can expect that two closely located genes/cells have similar expression profile. Additionally, a gene (red dot) should have high expression level on closely located cells (green dots). As the cells and genes overlap each in the map significantly, the following image uses gif animation to blend-in the two groups successively into foreground.

Figure 1: Dual Embedding of cells and genes with tSNE Algorithm. 

We notice here that the cells show a clear cluster structure, whereas the genes are rather less structured, except that a large gene clusters located on the far right side, those genes might be less relevant for the formation of these cells.

We also notice that, as shown in the following image, there are 4 or 5 cell clusters on the upper right corner whose nearby genes form a kind of sequential path. Those genes might play a special role in regulating those target cell clusters.


Embedding with Correlation Affinity

Notice that scRNA-seq datasets are basically contingency table of counts for cell/gene co-expression events, thus the nature of an expression matrix is more probabilistic, than spatial. For this reason, we purpose here a correlation distance for the embedding algorithm to replace the Euclidean distance.

More particularly, let M' be the row-normalized version of M as defined by

M'[x,y] := (M[x,y] - M[x, *])/||M[x,*]||

where M[x, *] and ||M[x,*]|| is the mean value and the norm of  the x-th row of M respectively. Similarly, let M" be the column-normalized version M as defined by:


M"[x,y] := (M[x,y] - M[*,y])/||M[*,y]||

After the normalization, the correlation coefficient between two rows or two columns of M is just the scalar product the corresponding rows or columns in M' or M''.  We now define our correlation distance pcg over CG as following:
The operator ⦁ in above formula denotes the scalar product of two row or column vectors; is as before a positive bias factor that additionally adjusts the distances between cells and genes. The default value for h is 1.0. Notice also that matrix M doesn't need to be log-transformed like the case for Euclidean distance function as the normalization scale the numbers in to common ranges.

As an example, the following picture shows the dual embedding of the data used in previous example by applying tSNE with the correlation distance  pcg:


Figure 2: Dual Embedding of cells and genes with tSNE and correlation distance  pcg. 

Comparing to Figure 1, we notice that in this dual embedding, the genes show more clear and simple cluster structure; and there are much less overlapping between the clusters, albeit that quite some clusters intertwine with each other. In general, a dual embedding map has the property that genes are highly expressed in by nearby cells. As be shown in Figure 2, we can see that the most high density gene clusters are rather distant from the major cell clusters, this indicates that the majority of those genes probably don't participate in the cell regulation.

Dual Embedding with Eigengenes

Mathematically, eigengenes are the top principal components of the expression matrix with the largest eigen-values. Eigengenes are normally obtained with the SVD or PCA algorithm, they are widely used as a way to reduce the dimensionality of a dataset, and therefore reduce the complexity of calculations. 

We can apply the dual embedding method to eigengenes to visualize their relationship with the cell clusters. The following picture shows a dual embedding of the cells of above example together with their 50 eigengenes, represented in purple color. We notice that the cells in this map form a similar cluster structure as that in Figure 2; whereas the genes are replaced by eigengenes in the neighborhood of the cell clusters. We can clearly see that most cell clusters are regulated by 2 or 3 nearby eigengenes. 


This visual characteristics indirectly validate the correlation based distance as a proper choice for the dual embedding algorithm.

Conclusion

We have introduced a correlation based distance function for cell and gene expression profiles. With help of an embedding algorithm, like tSNE, we can obtain low dimensional map for both cells and genes. These dual embedding maps, not only shows the cluster among cells and genes, but also their correlation through their proximity in the maps.

As demonstrated with real examples, correlation in the sense of statistics seems to be a better metric to visualize events of different nature like cells and genes in the scRNA seq dataset; and in general large contingency table of counts. The dual embedding provides intuitive way to visualize and explore those data.

All simulations and examples in this note is done with the software package VisuMap and a custom plug-in module named Single Cell Analysis, which implements in particular the two distance functions described in this note. The plug-in can be downloaded and installed from VisuMap online.