---
title: "Find Batch-biased Features in SVGs"
package: "BatchSVG"
author:
  - name: "Christine Hou"
    affiliation: Department of Biostatistics, Johns Hopkins University
    email: chris2018hou@gmail.com
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{01 Tutorial for spe data object}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

### Introduction

`BatchSVG` is the R/Bioconductor package for spatial transcriptomics data 
quality control (QC). As the feature-based QC method, the package provides 
functions to identify the biased features associated with the batch effect(s) 
(e.g. sample,  slide, and sex) in spatially variable genes (SVGs) using 
binomial deviance model, aiming to develop the downstream clustering 
performances and remove the technical noises caused by batch effects. The 
package works with 
[SpatialExperiment](https://github.com/drighelli/SpatialExperiment) objects. 

### Installation

(After accepted in [Bioconductor](https://bioconductor.org/)). 

```{r install bioc, eval=FALSE}
if (!requireNamespace("BiocManager")) {
    install.packages("BiocManager")
}
BiocManager::install("BatchSVG")
```

Install the development version from 
[GitHub](https://christinehou11.github.io/BatchSVG).

```{r install github, eval = FALSE}
remotes::install("christinehou11/BatchSVG")
```

### Biased Feature Identification

In this section, we will include the standard workflow for using `BatchSVG` to
show how the method help to detect and visualize the biased features in SVGs. 

```{r library, message=FALSE}
library(BatchSVG)
library(spatialLIBD)
library(cowplot)
```

#### Data

We will use the spatially-resolved transcriptomics (SRT) dataset from the 
[spatialLIBD](https://research.libd.org/spatialLIBD/) package.

```{r load data, comment=NA, warning=FALSE, message=FALSE}
spatialLIBD_spe <- fetch_data(type = "spe")
spatialLIBD_spe
```

We will use the spatially variable genes set generated. The 
result is generated from 
[nnSVG]((https://www.nature.com/articles/s41467-023-39748-z)) package.

```{r load nnsvg, comment=NA, warning=FALSE, message=FALSE}
libd_nnsvgs <- read.csv(
    system.file("extdata","libd-all_nnSVG_p-05-features-df.csv",
        package = "BatchSVG"),
    row.names = 1, check.names = FALSE)
```


#### Perform Feature Selection using `featureSelect()`

We will perform feature selection on a subset of spatial transcriptomics data 
(*input*) using a predefined set of spatially variable genes (*VGs*). 
Specifically, we will compute the number of standard deviations for the relative
change in deviance (**nSD_dev_{batch effect}**) and rank difference 
(**nSD_rank_{batch effect}**) before and after adjusting for batch effects.

The `featureSelect()` function enables feature selection while accounting for 
multiple batch effects. It returns a **list** of data frames, where each batch 
effect is associated with a corresponding data frame containing key results, 
including:

- Relative change in deviance before and after batch effect adjustment

- Rank differences between the batch-corrected and uncorrected results

- Number of standard deviations (nSD) for both relative change in deviance and 
rank difference

We will use the example of applying `featureSelect()` to the sample dataset
while adjusting for the batch effect of *subject*.

```{r feature select, comment = NA, warning=FALSE}
list_batch_df <- featureSelect(input = spatialLIBD_spe, 
    batch_effect = "subject", VGs = libd_nnsvgs$gene_id)
```

To suppress the message, let `verbose = FALSE`.

```{r, eval=FALSE}
list_batch_df <- featureSelect(input = spatialLIBD_spe, 
    batch_effect = "subject", VGs = libd_nnsvgs$gene_id, verbose = FALSE)
```

```{r feature select class, comment = NA, warning=FALSE}
class(list_batch_df)
```

```{r feature select print, comment = NA, warning=FALSE}
head(list_batch_df$subject)
```

#### Visualize SVG Selection Using `svg_nSD` for Batch Effects

The `svg_nSD()` function generates visualizations to assess batch effects in 
spatially variable genes (SVGs). It produces bar charts showing the distribution
of SVGs based on relative change in deviance and rank difference, with colors 
representing different nSD intervals. Additionally, scatter plots compare 
deviance and rank values with and without batch effects.

By interpreting these plots, we can determine appropriate nSD thresholds for
filtering biased features. The left panels illustrate the distribution of SVGs 
in terms of deviance and rank difference, while the right panels compare values 
before and after accounting for batch effects.

```{r svg, comment=NA, warning=FALSE, message=FALSE}
plots <- svg_nSD(list_batch_df = list_batch_df, 
            sd_interval_dev = 3, sd_interval_rank = 3)
```

*Figure 1. Visualizations of nSD_dev and nSD_rank threshold selection*

```{r figure 1, warning=FALSE, message=FALSE, fig.width=10, fig.height=8}
plots$subject
```

#### Identify Biased Genes Using `biasDetect()`

The function `biasDetect()` is designed to identify and filter out biased genes
across different batch effects. Using threshold values selected from
the visualization results generated by `svg_nSD()`, this function systematically
detects outliers that exceed a specified number of standard deviation (nSD) 
threshold in either relative deviance change, rank difference, or both.

The function outputs visualizations comparing deviance and rank values with and 
without batch effects. Genes with high deviations, highlighted in color, are 
identified as potentially biased and can be excluded based on the selected nSD 
thresholds.

The function offers flexibility in customizing the plot aesthetics, 
allowing users to adjust the data point size (**plot_point_size**), 
shape (**plot_point_shape**), annotated text size (**plot_text_size**), and 
data point color palette (**plot_palette**). Default values are provided for 
these parameters if not specified. Users should refer to 
[ggplot2](https://ggplot2.tidyverse.org/index.html) aesthetic guidelines to 
ensure appropriate values are assigned for each parameter.

We will use `nSD_dev = 3` and `nSD_rank = 3` as the example. The user should 
adjust the value based on their dataset features.

**Usage of Different Threshold Options**

- `threshold = "dev"`: Filters biased genes based only on the relative change 
in deviance. Genes with deviance changes exceeding the specified `nSD_dev`
threshold are identified as batch-affected and can be removed.

```{r bias detect dev, comment = NA, message=FALSE, warning=FALSE}
bias_dev <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "dev", nSD_dev = 3)
```

*Table 1. Outlier Genes defined by nSD_dev only*

```{r table 1, comment = NA, message=FALSE, warning=FALSE}
head(bias_dev$subject$Table)
```

We can change the data point size using **plot_point_size**.

```{r size change, message=FALSE, warning=FALSE}
# size default = 3
bias_dev_size <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "dev", nSD_dev = 3, plot_point_size = 4)
```

*Figure 2. Customize point size*

```{r figure 2, warning=FALSE, message=FALSE, fig.width= 10, fig.height=4}
plot_grid(bias_dev$subject$Plot, bias_dev_size$subject$Plot)
```

- `threshold = "rank"`: Identifies biased genes based solely on rank difference.
Genes with rank shifts exceeding `nSD_rank` are considered biased.

```{r bias detect rank, comment = NA, message=FALSE, warning=FALSE}
bias_rank <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "rank", nSD_rank = 3)
```

*Table 2. Outlier Genes defined by nSD_rank only*

```{r table 2, comment = NA, message=FALSE, warning=FALSE}
head(bias_rank$subject$Table)
```

We can change the data point shape using **plot_point_shape**.

*Figure 3. Customize point shape*

```{r figure 3, message=FALSE, warning=FALSE, fig.width= 10, fig.height=4}
# shape default = 16
bias_rank_shape <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "rank", nSD_rank = 3, plot_point_shape = 2)

plot_grid(bias_rank$subject$Plot, bias_rank_shape$subject$Plot)
```

- `threshold = "both"`: Detects biased genes based on both deviance change and 
rank difference, providing a more stringent filtering approach.

```{r both, comment = NA, message=FALSE, warning=FALSE}
bias_both <- biasDetect(list_batch_df = list_batch_df, threshold = "both",
    nSD_dev = 3, nSD_rank = 3)
```

*Table 3. Outlier Genes defined by nSD_dev and nSD_rank*

```{r table 3, comment = NA, message=FALSE, warning=FALSE}
head(bias_both$subject$Table)
```

We can change the data point color using **plot_palette**. The color palette
[here](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html) can be 
referenced on since the function uses `RColorBrewer` to generate colors.

*Figure 4. Customize the palette color*

```{r figure 4, message=FALSE, warning=FALSE, fig.width= 10, fig.height=8}
# color default = "YlOrRd"
bias_both_color <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "both", nSD_dev = 3, nSD_rank = 3, plot_palette = "Greens")

plot_grid(bias_both$subject$Plot, bias_both_color$subject$Plot,nrow = 2)
```

We can change the text size using **plot_text_size**. We also specify
the color palettes for both batch effects at the same time.

*Figure 5. Customize text size and color palette*

```{r figure 5, message=FALSE, warning=FALSE, fig.width= 10, fig.height=8}
# text size default = 3
bias_both_color_text <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "both", nSD_dev = 3, nSD_rank = 3, 
    plot_palette = c("Blues"), plot_text_size = 4)

plot_grid(bias_both$subject$Plot, bias_both_color_text$subject$Plot,nrow = 2)
```

#### Refine SVGs by Removing Batch-Affected Outliers

Finally, we obtain a refined set of spatially variable genes (SVGs) by removing 
the identified outliers based on user-defined thresholds for `nSD_dev` and 
`nSD_rank`.

Here, we use the results from bias_both, which applied `threshold = "both"` to 
account for both deviance and rank differences, with the batch effect set to 
sample ID.

```{r new svgs, comment = NA, message=FALSE, warning=FALSE}
bias_both_df <- bias_both$subject$Table
svgs_filt <- setdiff(libd_nnsvgs$gene_id, bias_both_df$gene_id)
svgs_filt_spe <- libd_nnsvgs[libd_nnsvgs$gene_id %in% svgs_filt, ]
nrow(svgs_filt_spe)
```

After obtaining the refined set of SVGs, these genes can be further analyzed 
using established spatial transcriptomics clustering algorithms to explore 
tissue layers and spatial organization.

### `R` session information {.unnumbered}

```{r session info}
## Session info
sessionInfo()
```