---
title: "RCTD Tutorial"
author: "Dylan Cable and Gabriel Grajeda"
date: "March 11th, 2025"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{rctd-tutorial}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    out.width = "100%"
)
```

# Installation

You can install the latest version of spacexr from Bioconductor:

```{r installation, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("spacexr")
```

# Introduction

Robust Cell Type Decomposition (RCTD) is a statistical method for decomposing
cell type mixtures in spatial transcriptomics data. In this vignette, we will
use a simulated dataset to demonstrate how you can run RCTD on spatial
transcriptomics data and visualize your results.

We will use the term *pixel*, synonymous with *spot* or *bead*, to refer to any
observation that measures gene expression at a fixed location. We do not use
the term *cell* because, in spatial transcriptomics data, single *pixels* may
contain mixtures of multiple cells. The purpose of RCTD is to estimate the
proportions of different cell types in each pixel of a dataset.

# Setup

```{r setup, message = FALSE}
library(ggplot2)
library(SpatialExperiment)
library(SummarizedExperiment)

library(spacexr)
```

# Data Preprocessing

Before running RCTD, we must first load in:

- an annotated reference dataset, which is used to learn cell type profiles.
- a spatial transcriptomics dataset, which contains pixels for deconvolution.

## Reference

The reference data should be stored in a
`r BiocStyle::Biocpkg("SummarizedExperiment")` object with:

*Required components*:

1. `assay`: A matrix (or `dgCMatrix`) of gene expression counts, with genes as
rows and cells as columns. Row names should be unique gene names, and column
names should be unique cell barcodes. Counts should be untransformed integers.

2. `colData`: A factor column containing cell type annotations for each cell in
the reference. The column name is assumed to be "cell_type", but another name
may be specified using the `cell_type_col` parameter in `createRctd`.

*Optional components*:

1. `colData`: A numeric column named `nUMI`, containing total UMI counts for
each cell. If not provided, `nUMI` will be calculated as the column sums of the
counts matrix.

In practice, your reference could be an annotated dataset from:

- Single-nucleus RNA sequencing (snRNA-seq)
- Single-cell RNA sequencing (scRNA-seq)
- Cell type-specific bulk RNA sequencing

Here, we load counts and cell type annotations from the `rctdSim`
object, but in a real-world setting, the reference information may be
distributed across several files or R objects.

```{r reference}
# Load simulated data.
data("rctdSim")

# Create SummarizedExperiment.
reference_se <- SummarizedExperiment(
    assays = list(counts = rctdSim$reference_counts),
    colData = rctdSim$reference_cell_types
)

# Examine reference. (optional)
print(dim(assay(reference_se))) # Gene expression matrix dimensions
table(colData(reference_se)$cell_type) # Number of occurrences of each cell type
```

In this vignette, we have a reference dataset of 750 genes for 75 cells. We
find that there are three cell types (`ct1`, `ct2`, and `ct3`), each appearing
25 times in the reference.

## Spatial Transcriptomics Data

Next, we will load the spatial transcriptomics data into a
`r BiocStyle::Biocpkg("SpatialExperiment")` object, though we could use any
`r BiocStyle::Biocpkg("SummarizedExperiment")` object with:

*Required components*:

1. `assay`: A matrix (or `dgCMatrix`) of gene expression counts, with genes as
rows and pixels as columns. Row names should be unique gene names, and column
names should be unique pixel barcodes. Counts should be untransformed integers.

*Optional components*:

1. `spatialCoords`: A matrix containing x and y coordinates for each pixel.

2. `colData`: A numeric column named `nUMI`, containing total UMI counts for
each pixel. If not provided, `nUMI` will be calculated as the column sums of the
counts matrix.

Here, we load coordinates and counts from the `rctdSim` object. Again,
in practice, this step will depend on how you store your spatial transcriptomics
data.

```{r spatial, fig.align='center', fig.width=5, fig.height=5, out.width='50%'}
# Create SpatialExperiment.
spatial_spe <- SpatialExperiment(
    assay = rctdSim$spatial_rna_counts,
    spatialCoords = rctdSim$spatial_rna_coords
)

# Examine pixels. (optional)
print(dim(assay(spatial_spe))) # Gene expression matrix dimensions
ggplot(spatialCoords(spatial_spe), aes(x, y)) + 
    geom_point(size = 5) + 
    theme(axis.text = element_blank(), axis.ticks = element_blank()
)
```
<br/>

We find that we have a dataset of 750 genes for 12 pixels, plotted above.

### Visualizing Ground Truth

The ground truth for our cell type mixtures is given in the
`r BiocStyle::Biocpkg("SpatialExperiment")` object `rctdSim$proportions_spe`.
(The ground truth has been stored in the same format as the RCTD output, which
will be discussed later.) Note that this ground truth will not be known in
practice---it is what we will need to estimate with RCTD.

```{r trueProportions}
unique(t(assay(rctdSim$proportions_spe)))
```

We find that there are two distinct cell type mixtures in our dataset:

- 90% `ct1` and 10% `ct2`.
- 20% `ct1` and 40% `ct2` and `ct3` each.

We can visualize these mixtures with the `plotAllWeights` function, which
plots each pixel as a pie chart of cell types. We will return to this function
later to visualize our RCTD results.

```{r trueProportionsPlot, fig.height = 5, fig.width = 7, out.width='75%'}
plotAllWeights(
    rctdSim$proportions_spe,
    r = 0.05, lwd = 0, title = "Ground Truth"
)
```
<br/>

# Running RCTD

Running RCTD is a two-step process:

1. Preprocess the data using the `createRctd` function, which:
    - Filters pixels and genes based on UMI counts and other thresholds.
    - Identifies differentially expressed genes.
    - Creates cell type profiles from the reference data.

2. Run the deconvolution with `runRctd`.

## Step 1: Preprocess Data

Before running RCTD, we perform some preprocessing with `createRctd`. Some
important configuration options include:

- `gene_cutoff`, `fc_cutoff`: Used for filtering genes in platform effect
normalization. `gene_cutoff` filters for average expression and `fc_cutoff`
filters for log fold change across cell types.
- `gene_cutoff_reg`, `fc_cutoff_reg`: Similar to above, but for the RCTD step.
- `UMI_min`, `UMI_max`: Minimum and maximum UMI count per pixel.

```{r createRctd, message = FALSE}
# Preprocess data
rctd_data <- createRctd(spatial_spe, reference_se)
```

## Step 2: Run RCTD

Now we can run RCTD on the preprocessed data with the `runRctd` function.

RCTD has three modes, each suited for different spatial technologies:

- `"doublet"`: Assigns 1-2 cell types per pixel. Best for high-resolution
technologies like Slide-seq and MERFISH.
- `"multi"`: Like doublet mode but can assign more cell types (up to
`max_multi_types`). Best for lower-resolution technologies like 100-micron
Visium.
- `"full"`: No restrictions on number of cell types.

The `rctd_mode` argument is used to specify the mode in which the RCTD algorithm
runs. Other important configuration options include:

- `max_cores`: Number of cores used for parallel processing. We recommend
setting this to at least 4 to improve efficiency. If set to 1, parallel
processing is not used.
- `max_multi_types`: Maximum number of cell types per pixel (multi mode only).

For the purposes of demonstration, we'll decompose the data into doublets in
this vignette.

```{r runRctd, message = FALSE}
results_spe <- runRctd(rctd_data, rctd_mode = "doublet", max_cores = 1)
```

# RCTD Results

RCTD returns a `r BiocStyle::Biocpkg("SpatialExperiment")` object containing:

- Three assays (one in full mode):
    1. `weights`: Cell type proportions restricted according to the specified
    mode
    2. `weights_unconfident`: Cell type proportions restricted according to the
    specified mode, including unconfident predictions (not available in full
    mode)
    2. `weights_full`: Unrestricted cell type proportions (not available in full
    mode, use `weights` instead)
    
    Assays have cell types as rows and pixels as columns, with values
    representing the proportion (0 to 1) of each cell type in each pixel. Assay
    columns sum to 1 (except for rejected pixels, which sum to 0).

    **NOTE**: Weights are transposed compared to the output of the original
    `r BiocStyle::Githubpkg("dmcable/spacexr", "dmcable/spacexr")` package on
    GitHub.

- Spatial coordinates for each pixel stored in `spatialCoords`.

- Additional `colData` including:
    * For doublet mode:
        * `spot_class`: Classification as `singlet`, `doublet_certain`,
        `doublet_uncertain`, or `reject`
        * `first_type`, `second_type`: Predicted cell types
        * Additional metrics like `min_score`, `singlet_score` for advanced
        users
    * For multi mode:
        * `cell_type_list`: List of cell types per pixel
        * `conf_list`: List of whether cell type predictions are confident
        * Additional metrics like `min_score` for advanced users

## Cell Type Weights

We can view the doublet weights for our data:

```{r doubletWeights}
# Cell type mixture 1
assay(results_spe, "weights")[, 1:6]

# Cell type mixture 2
assay(results_spe, "weights")[, 7:12]
```

You can access the full, unrestricted weights using
`assay(results_spe, "weights_full")`. We will visualize these weights later. You
may use `assay(results_spe, "weights_unconfident")` to access the less confident
predictions (e.g., the `second_type` weight for singlets and uncertain
doublets).

## Classifications

Below, we can view the results of RCTD deconvolution:

```{r classification}
classification_df <- data.frame(
    pixel = colnames(assay(results_spe)),
    spot_class = colData(results_spe)$spot_class,
    first_type = colData(results_spe)$first_type,
    second_type = colData(results_spe)$second_type
)
classification_df
```

In particular, this data frame provides us with the following information:

- `spot_class`: A factor variable with the following levels of classification:
    * `singlet` (1 cell type in the pixel)
    * `doublet_certain` (2 cell types in the pixel)
    * `doublet_uncertain` (2 cell types in the pixel, but only confident of 1)
    * `reject` (no prediction given for pixel)

    Typically, `reject` pixels should not be used, and `doublet_uncertain`
    pixels should only be used for applications that do not require knowledge of
    all cell types in a pixel.
- `first_type`: the first cell type predicted on the bead (for all `spot_class`
conditions except `reject`).
- `second_type`: the second cell type predicted on the bead for doublet
`spot_class` conditions (not a confident prediction for `doublet_uncertain`).

We can see that RCTD generally succeeded in identifying the components of the
cell mixtures, although pixels with 90% `ct1` and 10% `ct2` were classified
as singlets. (As an exercise, try changing the `doublet_threshold` parameter
in `runRctd` so that these mixtures will be identified as doublets.)

# Visualization

## All Cell Types

We can visualize the results of our RCTD run with the `plotAllWeights`
function. Simply pass in the RCTD results (and some aesthetic arguments), and
the cell type proportions will be plotted as pie charts for each pixel.

```{r doubletVisualization, fig.height = 5, fig.width = 7, out.width='75%'}
plotAllWeights(results_spe, r = 0.05, lwd = 0, title = "Doublet Results")
```
<br/>
<br/>

Note that all mixtures with 90% `ct1` (on the left) were classified as singlets,
so only `ct1` is plotted. And for the other mixtures (on the right), only cell
types `ct2` and `ct3` are plotted (without any `ct1`) due to the restrictions of
doublet mode. We can visualize the full, unrestricted proportions by passing in
the name of the relevant assay (`"weights_full"`) to the `plotAllWeights`
function:

```{r fullVisualization, fig.height = 5, fig.width = 7, out.width='75%'}
plotAllWeights(
    results_spe, "weights_full", r = 0.05, lwd = 0, title = "Full Results"
)
```
<br/>
<br/>

This is very close to the ground truth!

## Single Cell Type

Finally, we can visualize how the proportion of a specific cell type varies
across space with the `plotCellTypeWeight` function. We show an example with
the dummy cell type `ct1`.

```{r densityVisualization, fig.height = 5, fig.width = 7, out.width='75%'}
plotCellTypeWeight(
    results_spe, "ct1", "weights_full",
    size = 5, stroke = 0.5, title = "\nDensity of Cell Type 1\n"
)
```
<br/>
<br/>

You are now ready to apply RCTD to your own spatial transcriptomics datasets!

## Session Information
```{r sessionInfo}
sessionInfo()
```