--- title: "Using crumblr in practice" subtitle: '' author: "Developed by [Gabriel Hoffman](http://gabrielhoffman.github.io/)" date: "Run on `r Sys.time()`" documentclass: article output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{crumblr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc} --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set( tidy = FALSE, cache = TRUE, echo = TRUE, dev = c("png", "pdf"), package.startup.message = FALSE, message = FALSE, error = FALSE, warning = FALSE ) options(width = 100) ``` ## Introduction Changes in cell type composition play an important role in health and disease. Recent advances in single cell technology have enabled measurement of cell type composition at increasing cell lineage resolution across large cohorts of individuals. Yet this raises new challenges for statistical analysis of these compositional data to identify changes associated with a phenotype. We introduce `crumblr`, a scalable statistical method for analyzing count ratio data using precision-weighted linear models incorporating random effects for complex study designs. Uniquely, `crumblr` performs tests of association at multiple levels of the cell lineage hierarchy using multivariate regression to increase power over tests of a single cell component. In simulations, `crumblr` increases power compared to existing methods, while controlling the false positive rate. The `crumblr` package integrates with the [`variancePartition`](https://www.bioconductor.org/packages/variancePartition/) and [`dreamlet`](https://www.bioconductor.org/packages/dreamlet/) packages in the Bioconductor ecosystem. Here we consider counts for 8 cell types from quantified using single cell RNA-seq data from unstimulated and interferon β stimulated PBMCs from 8 subjects [(Kang, et al., 2018)](https://www.nature.com/articles/nbt.4042). The functions here incorporate the precision weights: - `variancePartition::fitExtractVarPartModel()` - `variancePartition::dream()` - `limma::lmFit()` # Installation To install this package, start R and enter: ```{r install, eval=FALSE} # 1) Make sure Bioconductor is installed if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # 2) Install crumblr and dependencies: # From Bioconductor BiocManager::install("crumblr") ``` ## Process data Here we evaluate whether the observed cell proportions change in response to interferon β. Given the results here, we cannot reject the null hypothesis that interferon β does not affect the cell type proportions. ```{r crumblr} library(crumblr) # Load cell counts, clustering and metadata # from Kang, et al. (2018) https://doi.org/10.1038/nbt.4042 data(IFNCellCounts) # Apply crumblr transformation # cobj is an EList object compatable with limma workflow # cobj$E stores transformed values # cobj$weights stores precision weights # corresponding to the regularized inverse variance cobj <- crumblr(df_cellCounts) ``` ## Variance partitioning Decomposing the variance illustrates that more variation is explained by subject than stimulation status. ```{r vp, fig.height=4, fig.width=6} library(variancePartition) # Partition variance into components for Subject (i.e. ind) # and stimulation status, and residual variation form <- ~ (1 | ind) + (1 | StimStatus) vp <- fitExtractVarPartModel(cobj, form, info) # Plot variance fractions fig.vp <- plotPercentBars(vp) fig.vp ``` ## PCA Performing PCA on the transformed cell counts indicates that the samples cluster based on subject rather than stimulation status. ```{r pca, fig.height=5, fig.width=5} library(ggplot2) # Perform PCA # use crumblr::standardize() to get values with # approximately equal sampling variance, # which is a key property for downstream PCA and clustering analysis. pca <- prcomp(t(standardize(cobj))) # merge with metadata df_pca <- merge(pca$x, info, by = "row.names") # Plot PCA # color by Subject # shape by Stimulated vs unstimulated ggplot(df_pca, aes(PC1, PC2, color = as.character(ind), shape = StimStatus)) + geom_point(size = 3) + theme_classic() + theme(aspect.ratio = 1) + scale_color_discrete(name = "Subject") + xlab("PC1") + ylab("PC2") ``` ## Hierarchical clustering The samples from the same subject also cluster together. ```{r hclust} heatmap(cobj$E) ``` ## Differential testing ```{r dream} # Use variancePartition workflow to analyze each cell type # Perform regression on each cell type separately # then use eBayes to shrink residual variance # Also compatible with limma::lmFit() fit <- dream(cobj, ~ StimStatus + ind, info) fit <- eBayes(fit) # Extract results for each cell type topTable(fit, coef = "StimStatusstim", number = Inf) ``` ### Multivariate testing along a tree We can gain power by jointly testing multiple cell types using a multivariate statistical model, instead of testing one cell type at a time. Here we construct a hierarchical clustering between cell types based on gene expression from pseudobulk, and perform a multivariate test for each internal node of the tree based on its leaf nodes. The results for the leaves are the same as from `topTable()` above. At each internal node `treeTest()` performs a fixed effects meta-analysis of the coefficients of the leaves while modeling the covariance between coefficient estimates. In the backend, this is implemented using `variancePartition::mvTest()` and [remaCor](https://cran.r-project.org/package=remaCor) package. ```{r treeTest} # Perform multivariate test across the hierarchy res <- treeTest(fit, cobj, hcl, coef = "StimStatusstim") # Plot hierarchy and testing results plotTreeTest(res) # Plot hierarchy and regression coefficients plotTreeTestBeta(res) ``` #### Combined plotting ```{r combined, fig.width=12} plotTreeTestBeta(res) + theme(legend.position = "bottom", legend.box = "vertical") | plotForest(res, hide = FALSE) | fig.vp ``` # Session Info

```{r session, echo=FALSE} sessionInfo() ```