--- title: "RCTD Tutorial" author: "Dylan Cable and Gabriel Grajeda" date: "March 11th, 2025" output: BiocStyle::html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{rctd-tutorial} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%" ) ``` # Installation You can install the latest version of spacexr from Bioconductor: ```{r installation, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("spacexr") ``` # Introduction Robust Cell Type Decomposition (RCTD) is a statistical method for decomposing cell type mixtures in spatial transcriptomics data. In this vignette, we will use a simulated dataset to demonstrate how you can run RCTD on spatial transcriptomics data and visualize your results. We will use the term *pixel*, synonymous with *spot* or *bead*, to refer to any observation that measures gene expression at a fixed location. We do not use the term *cell* because, in spatial transcriptomics data, single *pixels* may contain mixtures of multiple cells. The purpose of RCTD is to estimate the proportions of different cell types in each pixel of a dataset. # Setup ```{r setup, message = FALSE} library(ggplot2) library(SpatialExperiment) library(SummarizedExperiment) library(spacexr) ``` # Data Preprocessing Before running RCTD, we must first load in: - an annotated reference dataset, which is used to learn cell type profiles. - a spatial transcriptomics dataset, which contains pixels for deconvolution. ## Reference The reference data should be stored in a `r BiocStyle::Biocpkg("SummarizedExperiment")` object with: *Required components*: 1. `assay`: A matrix (or `dgCMatrix`) of gene expression counts, with genes as rows and cells as columns. Row names should be unique gene names, and column names should be unique cell barcodes. Counts should be untransformed integers. 2. `colData`: A factor column containing cell type annotations for each cell in the reference. The column name is assumed to be "cell_type", but another name may be specified using the `cell_type_col` parameter in `createRctd`. *Optional components*: 1. `colData`: A numeric column named `nUMI`, containing total UMI counts for each cell. If not provided, `nUMI` will be calculated as the column sums of the counts matrix. In practice, your reference could be an annotated dataset from: - Single-nucleus RNA sequencing (snRNA-seq) - Single-cell RNA sequencing (scRNA-seq) - Cell type-specific bulk RNA sequencing Here, we load counts and cell type annotations from the `rctdSim` object, but in a real-world setting, the reference information may be distributed across several files or R objects. ```{r reference} # Load simulated data. data("rctdSim") # Create SummarizedExperiment. reference_se <- SummarizedExperiment( assays = list(counts = rctdSim$reference_counts), colData = rctdSim$reference_cell_types ) # Examine reference. (optional) print(dim(assay(reference_se))) # Gene expression matrix dimensions table(colData(reference_se)$cell_type) # Number of occurrences of each cell type ``` In this vignette, we have a reference dataset of 750 genes for 75 cells. We find that there are three cell types (`ct1`, `ct2`, and `ct3`), each appearing 25 times in the reference. ## Spatial Transcriptomics Data Next, we will load the spatial transcriptomics data into a `r BiocStyle::Biocpkg("SpatialExperiment")` object, though we could use any `r BiocStyle::Biocpkg("SummarizedExperiment")` object with: *Required components*: 1. `assay`: A matrix (or `dgCMatrix`) of gene expression counts, with genes as rows and pixels as columns. Row names should be unique gene names, and column names should be unique pixel barcodes. Counts should be untransformed integers. *Optional components*: 1. `spatialCoords`: A matrix containing x and y coordinates for each pixel. 2. `colData`: A numeric column named `nUMI`, containing total UMI counts for each pixel. If not provided, `nUMI` will be calculated as the column sums of the counts matrix. Here, we load coordinates and counts from the `rctdSim` object. Again, in practice, this step will depend on how you store your spatial transcriptomics data. ```{r spatial, fig.align='center', fig.width=5, fig.height=5, out.width='50%'} # Create SpatialExperiment. spatial_spe <- SpatialExperiment( assay = rctdSim$spatial_rna_counts, spatialCoords = rctdSim$spatial_rna_coords ) # Examine pixels. (optional) print(dim(assay(spatial_spe))) # Gene expression matrix dimensions ggplot(spatialCoords(spatial_spe), aes(x, y)) + geom_point(size = 5) + theme(axis.text = element_blank(), axis.ticks = element_blank() ) ```
We find that we have a dataset of 750 genes for 12 pixels, plotted above. ### Visualizing Ground Truth The ground truth for our cell type mixtures is given in the `r BiocStyle::Biocpkg("SpatialExperiment")` object `rctdSim$proportions_spe`. (The ground truth has been stored in the same format as the RCTD output, which will be discussed later.) Note that this ground truth will not be known in practice---it is what we will need to estimate with RCTD. ```{r trueProportions} unique(t(assay(rctdSim$proportions_spe))) ``` We find that there are two distinct cell type mixtures in our dataset: - 90% `ct1` and 10% `ct2`. - 20% `ct1` and 40% `ct2` and `ct3` each. We can visualize these mixtures with the `plotAllWeights` function, which plots each pixel as a pie chart of cell types. We will return to this function later to visualize our RCTD results. ```{r trueProportionsPlot, fig.height = 5, fig.width = 7, out.width='75%'} plotAllWeights( rctdSim$proportions_spe, r = 0.05, lwd = 0, title = "Ground Truth" ) ```
# Running RCTD Running RCTD is a two-step process: 1. Preprocess the data using the `createRctd` function, which: - Filters pixels and genes based on UMI counts and other thresholds. - Identifies differentially expressed genes. - Creates cell type profiles from the reference data. 2. Run the deconvolution with `runRctd`. ## Step 1: Preprocess Data Before running RCTD, we perform some preprocessing with `createRctd`. Some important configuration options include: - `gene_cutoff`, `fc_cutoff`: Used for filtering genes in platform effect normalization. `gene_cutoff` filters for average expression and `fc_cutoff` filters for log fold change across cell types. - `gene_cutoff_reg`, `fc_cutoff_reg`: Similar to above, but for the RCTD step. - `UMI_min`, `UMI_max`: Minimum and maximum UMI count per pixel. ```{r createRctd, message = FALSE} # Preprocess data rctd_data <- createRctd(spatial_spe, reference_se) ``` ## Step 2: Run RCTD Now we can run RCTD on the preprocessed data with the `runRctd` function. RCTD has three modes, each suited for different spatial technologies: - `"doublet"`: Assigns 1-2 cell types per pixel. Best for high-resolution technologies like Slide-seq and MERFISH. - `"multi"`: Like doublet mode but can assign more cell types (up to `max_multi_types`). Best for lower-resolution technologies like 100-micron Visium. - `"full"`: No restrictions on number of cell types. The `rctd_mode` argument is used to specify the mode in which the RCTD algorithm runs. Other important configuration options include: - `max_cores`: Number of cores used for parallel processing. We recommend setting this to at least 4 to improve efficiency. If set to 1, parallel processing is not used. - `max_multi_types`: Maximum number of cell types per pixel (multi mode only). For the purposes of demonstration, we'll decompose the data into doublets in this vignette. ```{r runRctd, message = FALSE} results_spe <- runRctd(rctd_data, rctd_mode = "doublet", max_cores = 1) ``` # RCTD Results RCTD returns a `r BiocStyle::Biocpkg("SpatialExperiment")` object containing: - Three assays (one in full mode): 1. `weights`: Cell type proportions restricted according to the specified mode 2. `weights_unconfident`: Cell type proportions restricted according to the specified mode, including unconfident predictions (not available in full mode) 2. `weights_full`: Unrestricted cell type proportions (not available in full mode, use `weights` instead) Assays have cell types as rows and pixels as columns, with values representing the proportion (0 to 1) of each cell type in each pixel. Assay columns sum to 1 (except for rejected pixels, which sum to 0). **NOTE**: Weights are transposed compared to the output of the original `r BiocStyle::Githubpkg("dmcable/spacexr", "dmcable/spacexr")` package on GitHub. - Spatial coordinates for each pixel stored in `spatialCoords`. - Additional `colData` including: * For doublet mode: * `spot_class`: Classification as `singlet`, `doublet_certain`, `doublet_uncertain`, or `reject` * `first_type`, `second_type`: Predicted cell types * Additional metrics like `min_score`, `singlet_score` for advanced users * For multi mode: * `cell_type_list`: List of cell types per pixel * `conf_list`: List of whether cell type predictions are confident * Additional metrics like `min_score` for advanced users ## Cell Type Weights We can view the doublet weights for our data: ```{r doubletWeights} # Cell type mixture 1 assay(results_spe, "weights")[, 1:6] # Cell type mixture 2 assay(results_spe, "weights")[, 7:12] ``` You can access the full, unrestricted weights using `assay(results_spe, "weights_full")`. We will visualize these weights later. You may use `assay(results_spe, "weights_unconfident")` to access the less confident predictions (e.g., the `second_type` weight for singlets and uncertain doublets). ## Classifications Below, we can view the results of RCTD deconvolution: ```{r classification} classification_df <- data.frame( pixel = colnames(assay(results_spe)), spot_class = colData(results_spe)$spot_class, first_type = colData(results_spe)$first_type, second_type = colData(results_spe)$second_type ) classification_df ``` In particular, this data frame provides us with the following information: - `spot_class`: A factor variable with the following levels of classification: * `singlet` (1 cell type in the pixel) * `doublet_certain` (2 cell types in the pixel) * `doublet_uncertain` (2 cell types in the pixel, but only confident of 1) * `reject` (no prediction given for pixel) Typically, `reject` pixels should not be used, and `doublet_uncertain` pixels should only be used for applications that do not require knowledge of all cell types in a pixel. - `first_type`: the first cell type predicted on the bead (for all `spot_class` conditions except `reject`). - `second_type`: the second cell type predicted on the bead for doublet `spot_class` conditions (not a confident prediction for `doublet_uncertain`). We can see that RCTD generally succeeded in identifying the components of the cell mixtures, although pixels with 90% `ct1` and 10% `ct2` were classified as singlets. (As an exercise, try changing the `doublet_threshold` parameter in `runRctd` so that these mixtures will be identified as doublets.) # Visualization ## All Cell Types We can visualize the results of our RCTD run with the `plotAllWeights` function. Simply pass in the RCTD results (and some aesthetic arguments), and the cell type proportions will be plotted as pie charts for each pixel. ```{r doubletVisualization, fig.height = 5, fig.width = 7, out.width='75%'} plotAllWeights(results_spe, r = 0.05, lwd = 0, title = "Doublet Results") ```

Note that all mixtures with 90% `ct1` (on the left) were classified as singlets, so only `ct1` is plotted. And for the other mixtures (on the right), only cell types `ct2` and `ct3` are plotted (without any `ct1`) due to the restrictions of doublet mode. We can visualize the full, unrestricted proportions by passing in the name of the relevant assay (`"weights_full"`) to the `plotAllWeights` function: ```{r fullVisualization, fig.height = 5, fig.width = 7, out.width='75%'} plotAllWeights( results_spe, "weights_full", r = 0.05, lwd = 0, title = "Full Results" ) ```

This is very close to the ground truth! ## Single Cell Type Finally, we can visualize how the proportion of a specific cell type varies across space with the `plotCellTypeWeight` function. We show an example with the dummy cell type `ct1`. ```{r densityVisualization, fig.height = 5, fig.width = 7, out.width='75%'} plotCellTypeWeight( results_spe, "ct1", "weights_full", size = 5, stroke = 0.5, title = "\nDensity of Cell Type 1\n" ) ```

You are now ready to apply RCTD to your own spatial transcriptomics datasets! ## Session Information ```{r sessionInfo} sessionInfo() ```