--- title: "Get Started with DeconvoBuddies" author: - name: Louise Huuki-Myers affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus email: lahuuki@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('DeconvoBuddies')`" vignette: > %\VignetteIndexEntry{Get Started with DeconvoBuddies} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), AnnotationHub = citation("AnnotationHub")[1], BiocFileCache = citation("BiocFileCache")[1], dplyr = citation("dplyr")[1], ExperimentHub = citation("ExperimentHub")[1], ggplot2 = citation("ggplot2")[1], graphics = citation("graphics")[1], grDevices = citation("grDevices")[1], matrixStats = citation("matrixStats")[1], methods = citation("methods")[1], purrr = citation("purrr")[1], rafalib = citation("rafalib")[1], RColorBrewer = citation("RColorBrewer")[1], reshape2 = citation("reshape2")[1], S4Vectors = citation("S4Vectors")[1], scran = citation("scran")[1], SingleCellExperiment = citation("SingleCellExperiment")[1], spatialLIBD = citation("spatialLIBD")[1], stats = citation("stats")[1], stringr = citation("stringr")[1], SummarizedExperiment = citation("SummarizedExperiment")[1], tibble = citation("tibble")[1], utils = citation("utils")[1], Biobase = citation("Biobase")[1], BiocStyle = citation("BiocStyle")[1], BisqueRNA = citation("BisqueRNA")[1], covr = citation("covr")[1], HDF5Array = citation("HDF5Array")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], tidyr = citation("tidyr")[1], tidyverse = citation("tidyverse")[1], DeconvoBuddies = citation("DeconvoBuddies")[1], DeconvoBuddiespaper = citation("DeconvoBuddies")[2] ) ``` # Introduction `DeconvoBuddies` is an R package developed to assist in running bulk RNA-seq deconvolution. This package provides functions to access a data set designed to evaluate deconvolution method performance, find marker genes, and create plots useful in deconvolution. This package is associated with "Benchmark of cellular deconvolution methods using a multi-assay reference dataset from postmortem human prefrontal cortex" from Huuki-Myers et al. (10.1101/2024.02.09.579665v2)[https://www.biorxiv.org/content/10.1101/2024.02.09.579665v2]. # Basics ## Install `DeconvoBuddies` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg("DeconvoBuddies")` is a `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg("DeconvoBuddies")` by using the following commands in your `R` session: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("DeconvoBuddies") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Required knowledge `r Biocpkg("DeconvoBuddies")` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with snRNA-seq data. That is, packages like `r Biocpkg("SingleCellExperiment")`. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `DeconvoBuddies` tag and check [the older posts](https://support.bioconductor.org/tag/DeconvoBuddies/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `DeconvoBuddies` We hope that `r Biocpkg("DeconvoBuddies")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r "citation"} ## Citation info citation("DeconvoBuddies") ``` # Quick start to using `DeconvoBuddies` Let's load some packages we'll use in this vignette. ```{r "load packages", message=FALSE, warning=FALSE} suppressMessages({ library("DeconvoBuddies") library("SummarizedExperiment") library("dplyr") library("tidyr") library("tibble") }) ``` ## Access Data Use `fetch_deconvo_data` to download RNA sequencing data from the [Human DLPFC](https://github.com/LieberInstitute/Human_DLPFC_Deconvolution) `r Citep(bib[["DeconvoBuddiespaper"]])`. - `rse_gene`: 110 samples of bulk RNA-seq. [110 bulk RNA-seq samples x 21k genes] (41 MB). - `sce` : snRNA-seq data from the Human DLPFC. [77k nuclei x 36k genes] (172 MB) - `sce_DLPFC_example`: Sub-set of `sce` useful for testing. [10k nuclei x 557 genes] (49 MB) ```{r `access data} ## Access and snRNA-seq example data if (!exists("sce_DLPFC_example")) sce_DLPFC_example <- fetch_deconvo_data("sce_DLPFC_example") ## Explore snRNA-seq data in sce_DLPFC_example sce_DLPFC_example ## Access Bulk RNA-seq data if (!exists("rse_gene")) rse_gene <- fetch_deconvo_data("rse_gene") ## Explore bulk data in rse_gene rse_gene ``` For more details on this dataset, and an example deconvolution run check out the [Vignette: Deconvolution Benchmark in Human DLPFC](https://research.libd.org/DeconvoBuddies/articles/Deconvolution_Benchmark_DLPFC.html). ## Marker Finding ### Using *MeanRatio* to Find Cell Type Markers Accurate deconvolution requires highly specific marker genes for each cell type to be defined. To select genes specific for each cell type, you can evaluate the *MeanRatio* for each gene x each cell type, where `MeanRatio = mean(Expression of target cell type) / mean(Expression of highest non-target cell type)`. These values can be calculated for a single cell RNA-seq dataset using `get_mean_ratio()`. This can also work for spatially-resolved transcriptomics datasets. That is, `get_mean_ratio()` can also work with `SpatialExperiment::SpatialExperiment()` objects. ```{r `get_mean_ratio demo`} ## find marker genes with get_mean_ratio marker_stats <- get_mean_ratio( sce_DLPFC_example, cellType_col = "cellType_broad_hc", gene_name = "gene_name", gene_ensembl = "gene_id" ) ## explore tibble output, gene with high MeanRatio values are good marker genes marker_stats ``` For more discussion of finding marker genes with `DeconvoBuddies` check out the [Vignette: Finding Marker Genes with DeconvoBuddies.](https://research.libd.org/DeconvoBuddies/articles/Marker_Finding.html) ## Plotting Tools ### Creating A Cell Type Color palette As you work with single-cell data and deconvolution outputs, it is very useful to establish a consistent color palette to use across different plots. The function `create_cell_colors()` returns a named vector of hex values, corresponding to the names of cell types. This list is compatible with functions like `ggplot2::scale_color_manual()`. There are three palettes to choose from to generate colors or users can provide their own color palette: - "classic" (default): classic set of 8 cell type colors from LIBD, checked for visability and color blind accessibility. - "gg": Equi-distant hues, same process for selecting colors as `ggplot` - no maximum number - "tableau": [tableau20](https://jrnold.github.io/ggthemes/reference/tableau_color_pal.html) color set - max 20 colors ```{r `create_cell_colors demo 1`} test_cell_types <- c("cell_A", "cell_B", "cell_C", "cell_D", "cell_E") ## Preview "classic" colors test_cell_colors_classic <- create_cell_colors( cell_types = test_cell_types, palette_name = "classic", preview = TRUE ) ## Preview "gg" colors test_cell_colors_gg <- create_cell_colors( cell_types = test_cell_types, palette_name = "gg", preview = TRUE ) ## Preview "tableau" colors test_cell_colors_tableau <- create_cell_colors( cell_types = test_cell_types, palette_name = "tableau", preview = TRUE ) ## Check the color hex codes for "tableau" test_cell_colors_tableau ## Provide a palette from RColorBrewer test_cell_colors_brew <- create_cell_colors( cell_types = test_cell_types, palette = RColorBrewer::brewer.pal(n = length(test_cell_types), name = "Dark2"), preview = TRUE ) ``` If there are sub-cell types with consistent delimiters, the `split` argument creates a scale of related colors. This helps expand on the maximum number of colors and makes your palette flexible when considering different 'resolutions' of cell types. This works by ignoring any prefixes after the `split` character. In this example below, `Excit_01` and `Excit_02` will just be considered as `Excit` since `split = "_"`. ```{r create_cell_colors demo 2`} my_cell_types <- levels(sce_DLPFC_example$cellType_hc) ## Ignore any suffix after the "_" character by using the "split" argument my_cell_colors <- create_cell_colors( cell_types = my_cell_types, palette_name = "classic", preview = TRUE, split = "_" ) ``` ### Plot Expression of Top Markers The function `plot_marker_express()` helps quickly visualize expression of top marker genes, by ordering and annotating violin plots of expression over cell type. Here we'll plot the expression of the top 6 marker genes for Astrocytes. ```{r `plot_marker_expression demo`} # plot expression of the top 6 Astro marker genes plot_marker_express( sce = sce_DLPFC_example, stats = marker_stats, cell_type = "Astro", n_genes = 6, cellType_col = "cellType_broad_hc", color_pal = my_cell_colors ) ``` The violin plots of gene expression confirm the cell type specificity of these marker genes, most of the nuclei with high expression of these six genes are astrocytes (Astro). ### Plot Composition Bar Plot The output of deconvolution are cell type estimates that sum to 1. A good visulization for these predictions is a stacked bar plot. The function `plot_composition_bar()` creates a stacked bar plot showing the cell type proportion for each sample, or the average proportion for a group of samples. In this example data, the `RNum` is a sample (donor) identifier and `Dx` is a group variable for the diagnosis status of the donors. ```{r `demo plot_composition_bar`} # load example data data("rse_bulk_test") data("est_prop") # access the colData of a test rse dataset pd <- colData(rse_bulk_test) |> as.data.frame() ## pivot data to long format and join with test estimated proportion data est_prop_long <- est_prop |> rownames_to_column("RNum") |> pivot_longer(!RNum, names_to = "cell_type", values_to = "prop") |> left_join(pd) ## explore est_prop_long est_prop_long ## the composition bar plot shows cell type composition for Sample plot_composition_bar(est_prop_long, x_col = "RNum", add_text = FALSE ) + ggplot2::scale_fill_manual(values = test_cell_colors_classic) ## the composition bar plot shows the average cell type composition for each Dx plot_composition_bar(est_prop_long, x_col = "Dx") + ggplot2::scale_fill_manual(values = test_cell_colors_classic) ``` We can see that the mean proportions of cell types A through E are very similar across the `Dx` groups (`Case` and `Control`). In this case, this is expected given that we are using simulated data. Although if you look across each donor with `RNum` we can see more variability across the simulated data. Since you are now familiar with the basic overview of `DeconvoBuddies`, you are now ready to dive deeper into: * the [selection of marker genes](https://research.libd.org/DeconvoBuddies/articles/Marker_Finding.html), * followed by the application of `DeconvoBuddies` to the [Human Brain (DLPFC) Deconvolution dataset](https://research.libd.org/DeconvoBuddies/articles/Deconvolution_Benchmark_DLPFC.html). # Reproducibility The `r Biocpkg("DeconvoBuddies")` package `r Citep(bib[["DeconvoBuddies"]])` was made possible thanks to: * R `r Citep(bib[["R"]])` * `r Biocpkg("AnnotationHub")` `r Citep(bib[["AnnotationHub"]])` * `r Biocpkg("BiocFileCache")` `r Citep(bib[["BiocFileCache"]])` * `r CRANpkg("dplyr")` `r Citep(bib[["dplyr"]])` * `r Biocpkg("ExperimentHub")` `r Citep(bib[["ExperimentHub"]])` * `r CRANpkg("ggplot2")` `r Citep(bib[["ggplot2"]])` * `r CRANpkg("graphics")` `r Citep(bib[["graphics"]])` * `r CRANpkg("grDevices")` `r Citep(bib[["grDevices"]])` * `r CRANpkg("matrixStats")` `r Citep(bib[["matrixStats"]])` * `r CRANpkg("methods")` `r Citep(bib[["methods"]])` * `r CRANpkg("purrr")` `r Citep(bib[["purrr"]])` * `r CRANpkg("rafalib")` `r Citep(bib[["rafalib"]])` * `r CRANpkg("reshape2")` `r Citep(bib[["reshape2"]])` * `r Biocpkg("S4Vectors")` `r Citep(bib[["S4Vectors"]])` * `r Biocpkg("scran")` `r Citep(bib[["scran"]])` * `r Biocpkg("SingleCellExperiment")` `r Citep(bib[["SingleCellExperiment"]])` * `r Biocpkg("spatialLIBD")` `r Citep(bib[["spatialLIBD"]])` * `r CRANpkg("stats")` `r Citep(bib[["stats"]])` * `r CRANpkg("stringr")` `r Citep(bib[["stringr"]])` * `r Biocpkg("SummarizedExperiment")` `r Citep(bib[["SummarizedExperiment"]])` * `r CRANpkg("tibble")` `r Citep(bib[["tibble"]])` * `r CRANpkg("utils")` `r Citep(bib[["utils"]])` This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`. This package was developed using `r BiocStyle::Biocpkg("biocthis")`. `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography ```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```