Contents

1 Road map

2 Application: inferring steps in tumor metastasis in a breast cancer patient

We’ll examine data distributed with a 2021 Genome Biology paper from the Gabor Marth lab.

Clinical sequence of interventions.

Clinical sequence of interventions.

Event sequence.

Event sequence.

2.1 A view of copy number aberrations for 1Mb tiling

28 tumors were sampled and sequenced in a rapid autopsy procedure. Copy number variation was assessed using FACETS.

The tissues from which tumors were taken are Br (Breast), Bo (Bone), Bn (Brain), Ln (Lung), Lv (Liver), Pa (Pancreas), Ly (Lymph nodes), Kd (Kidney)

A plotly-based visualization

The (vertical) ordering of tissues is chosen to exemplify certain similarities.

For example the block of blue on chr10 is seen for only three samples. This is an indication of a deletion.

2.2 A cluster analysis proposed in support of the evolutionary map

This code is lightly modified from a script distributed at https://github.com/xiaomengh/tumor-evo-rapid-autopsy.git.

## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.

2.3 Drilling down on the clustering

2.3.1 Comparing Euclidean and Correlation distances

For a given correlation distance value, there can be wide variation in euclidean distance, and vice versa.

Open question: What distance metric is most relevant for biological interpretation of CNV?

2.3.3 Redo clustering with alternative distance and agglomeration method

2.3.4 Silhouette measure

From ?silhouette with the cluster library:

For each observation i, the _silhouette width_ s(i) is defined as follows:

     Put a(i) = average dissimilarity between i and all other points of
     the cluster to which i belongs (if i is the _only_ observation in
     its cluster, s(i) := 0 without further calculations).  For all
     _other_ clusters C, put d(i,C) = average dissimilarity of i to all
     observations of C.  The smallest of these d(i,C) is b(i) := \min_C
     d(i,C), and can be seen as the dissimilarity between i and its
     "neighbor" cluster, i.e., the nearest one to which it does _not_
     belong.  Finally,

                   s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).         
     
     'silhouette.default()' is now based on C code donated by Romain
     Francois (the R version being still available as
     'cluster:::silhouette.default.R').

     Observations with a large s(i) (almost 1) are very well clustered,
     a small s(i) (around 0) means that the observation lies between
     two clusters, and observations with a negative s(i) are probably
     placed in the wrong cluster.

3 Clustering single cell RNA-seq

This code is taken verbatim from the bluster “diagnostics” vignette.

3.1 Acquire Grun et al’s single cell RNA-seq dataset

[Grun 2016] (https://www.sciencedirect.com/science/article/pii/S1934590916300947) define an algorithm, StemID, that infers candidate multipotent cell populations in the human pancreas.

## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## snapshotDate(): 2022-04-26
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## snapshotDate(): 2022-04-26
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## Warning in .get_med_and_mad(metric, batch = batch, subset = subset,
## share.medians = share.medians, : missing values ignored during outlier detection
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.

3.2 Clustering using a nearest-neighbor graph; visualization via TSNE and PCA

From bluster’s makeSNNGraph help page

   The 'makeSNNGraph' function builds a shared nearest-neighbour
   graph using observations as nodes. For each observation, its 'k'
   nearest neighbours are identified using the 'findKNN' function,
   based on distances between their expression profiles (Euclidean by
   default). An edge is drawn between all pairs of observations that
   share at least one neighbour, weighted by the characteristics of
   the shared nearest neighbors - see "Weighting Schemes" below.

   The aim is to use the SNN graph to perform clustering of
   observations via community detection algorithms in the 'igraph'
   package. This is faster and more memory efficient than
   hierarchical clustering for large numbers of observations. In
   particular, it avoids the need to construct a distance matrix for
   all pairs of observations. Only the identities of nearest
   neighbours are required, which can be obtained quickly with
   methods in the 'BiocNeighbors' package.
## 0/0 packages newly attached/loaded, see sessionInfo() for details.
## clusters
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 285 171 161  59 174  49  70 137  69  65  28  23