--- title: "Find Batch-biased Features in SVGs" package: "BatchSVG" author: - name: "Christine Hou" affiliation: Department of Biostatistics, Johns Hopkins University email: chris2018hou@gmail.com output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{01 Tutorial for spe data object} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ### Introduction `BatchSVG` is the R/Bioconductor package for spatial transcriptomics data quality control (QC). As the feature-based QC method, the package provides functions to identify the biased features associated with the batch effect(s) (e.g. sample, slide, and sex) in spatially variable genes (SVGs) using binomial deviance model, aiming to develop the downstream clustering performances and remove the technical noises caused by batch effects. The package works with [SpatialExperiment](https://github.com/drighelli/SpatialExperiment) objects. ### Installation (After accepted in [Bioconductor](https://bioconductor.org/)). ```{r install bioc, eval=FALSE} if (!requireNamespace("BiocManager")) { install.packages("BiocManager") } BiocManager::install("BatchSVG") ``` Install the development version from [GitHub](https://christinehou11.github.io/BatchSVG). ```{r install github, eval = FALSE} remotes::install("christinehou11/BatchSVG") ``` ### Biased Feature Identification In this section, we will include the standard workflow for using `BatchSVG` to show how the method help to detect and visualize the biased features in SVGs. ```{r library, message=FALSE} library(BatchSVG) library(spatialLIBD) library(cowplot) ``` #### Data We will use the spatially-resolved transcriptomics (SRT) dataset from the [spatialLIBD](https://research.libd.org/spatialLIBD/) package. ```{r load data, comment=NA, warning=FALSE, message=FALSE} spatialLIBD_spe <- fetch_data(type = "spe") spatialLIBD_spe ``` We will use the spatially variable genes set generated. The result is generated from [nnSVG]((https://www.nature.com/articles/s41467-023-39748-z)) package. ```{r load nnsvg, comment=NA, warning=FALSE, message=FALSE} libd_nnsvgs <- read.csv( system.file("extdata","libd-all_nnSVG_p-05-features-df.csv", package = "BatchSVG"), row.names = 1, check.names = FALSE) ``` #### Perform Feature Selection using `featureSelect()` We will perform feature selection on a subset of spatial transcriptomics data (*input*) using a predefined set of spatially variable genes (*VGs*). Specifically, we will compute the number of standard deviations for the relative change in deviance (**nSD_dev_{batch effect}**) and rank difference (**nSD_rank_{batch effect}**) before and after adjusting for batch effects. The `featureSelect()` function enables feature selection while accounting for multiple batch effects. It returns a **list** of data frames, where each batch effect is associated with a corresponding data frame containing key results, including: - Relative change in deviance before and after batch effect adjustment - Rank differences between the batch-corrected and uncorrected results - Number of standard deviations (nSD) for both relative change in deviance and rank difference We will use the example of applying `featureSelect()` to the sample dataset while adjusting for the batch effect of *subject*. ```{r feature select, comment = NA, warning=FALSE} list_batch_df <- featureSelect(input = spatialLIBD_spe, batch_effect = "subject", VGs = libd_nnsvgs$gene_id) ``` To suppress the message, let `verbose = FALSE`. ```{r, eval=FALSE} list_batch_df <- featureSelect(input = spatialLIBD_spe, batch_effect = "subject", VGs = libd_nnsvgs$gene_id, verbose = FALSE) ``` ```{r feature select class, comment = NA, warning=FALSE} class(list_batch_df) ``` ```{r feature select print, comment = NA, warning=FALSE} head(list_batch_df$subject) ``` #### Visualize SVG Selection Using `svg_nSD` for Batch Effects The `svg_nSD()` function generates visualizations to assess batch effects in spatially variable genes (SVGs). It produces bar charts showing the distribution of SVGs based on relative change in deviance and rank difference, with colors representing different nSD intervals. Additionally, scatter plots compare deviance and rank values with and without batch effects. By interpreting these plots, we can determine appropriate nSD thresholds for filtering biased features. The left panels illustrate the distribution of SVGs in terms of deviance and rank difference, while the right panels compare values before and after accounting for batch effects. ```{r svg, comment=NA, warning=FALSE, message=FALSE} plots <- svg_nSD(list_batch_df = list_batch_df, sd_interval_dev = 3, sd_interval_rank = 3) ``` *Figure 1. Visualizations of nSD_dev and nSD_rank threshold selection* ```{r figure 1, warning=FALSE, message=FALSE, fig.width=10, fig.height=8} plots$subject ``` #### Identify Biased Genes Using `biasDetect()` The function `biasDetect()` is designed to identify and filter out biased genes across different batch effects. Using threshold values selected from the visualization results generated by `svg_nSD()`, this function systematically detects outliers that exceed a specified number of standard deviation (nSD) threshold in either relative deviance change, rank difference, or both. The function outputs visualizations comparing deviance and rank values with and without batch effects. Genes with high deviations, highlighted in color, are identified as potentially biased and can be excluded based on the selected nSD thresholds. The function offers flexibility in customizing the plot aesthetics, allowing users to adjust the data point size (**plot_point_size**), shape (**plot_point_shape**), annotated text size (**plot_text_size**), and data point color palette (**plot_palette**). Default values are provided for these parameters if not specified. Users should refer to [ggplot2](https://ggplot2.tidyverse.org/index.html) aesthetic guidelines to ensure appropriate values are assigned for each parameter. We will use `nSD_dev = 3` and `nSD_rank = 3` as the example. The user should adjust the value based on their dataset features. **Usage of Different Threshold Options** - `threshold = "dev"`: Filters biased genes based only on the relative change in deviance. Genes with deviance changes exceeding the specified `nSD_dev` threshold are identified as batch-affected and can be removed. ```{r bias detect dev, comment = NA, message=FALSE, warning=FALSE} bias_dev <- biasDetect(list_batch_df = list_batch_df, threshold = "dev", nSD_dev = 3) ``` *Table 1. Outlier Genes defined by nSD_dev only* ```{r table 1, comment = NA, message=FALSE, warning=FALSE} head(bias_dev$subject$Table) ``` We can change the data point size using **plot_point_size**. ```{r size change, message=FALSE, warning=FALSE} # size default = 3 bias_dev_size <- biasDetect(list_batch_df = list_batch_df, threshold = "dev", nSD_dev = 3, plot_point_size = 4) ``` *Figure 2. Customize point size* ```{r figure 2, warning=FALSE, message=FALSE, fig.width= 10, fig.height=4} plot_grid(bias_dev$subject$Plot, bias_dev_size$subject$Plot) ``` - `threshold = "rank"`: Identifies biased genes based solely on rank difference. Genes with rank shifts exceeding `nSD_rank` are considered biased. ```{r bias detect rank, comment = NA, message=FALSE, warning=FALSE} bias_rank <- biasDetect(list_batch_df = list_batch_df, threshold = "rank", nSD_rank = 3) ``` *Table 2. Outlier Genes defined by nSD_rank only* ```{r table 2, comment = NA, message=FALSE, warning=FALSE} head(bias_rank$subject$Table) ``` We can change the data point shape using **plot_point_shape**. *Figure 3. Customize point shape* ```{r figure 3, message=FALSE, warning=FALSE, fig.width= 10, fig.height=4} # shape default = 16 bias_rank_shape <- biasDetect(list_batch_df = list_batch_df, threshold = "rank", nSD_rank = 3, plot_point_shape = 2) plot_grid(bias_rank$subject$Plot, bias_rank_shape$subject$Plot) ``` - `threshold = "both"`: Detects biased genes based on both deviance change and rank difference, providing a more stringent filtering approach. ```{r both, comment = NA, message=FALSE, warning=FALSE} bias_both <- biasDetect(list_batch_df = list_batch_df, threshold = "both", nSD_dev = 3, nSD_rank = 3) ``` *Table 3. Outlier Genes defined by nSD_dev and nSD_rank* ```{r table 3, comment = NA, message=FALSE, warning=FALSE} head(bias_both$subject$Table) ``` We can change the data point color using **plot_palette**. The color palette [here](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html) can be referenced on since the function uses `RColorBrewer` to generate colors. *Figure 4. Customize the palette color* ```{r figure 4, message=FALSE, warning=FALSE, fig.width= 10, fig.height=8} # color default = "YlOrRd" bias_both_color <- biasDetect(list_batch_df = list_batch_df, threshold = "both", nSD_dev = 3, nSD_rank = 3, plot_palette = "Greens") plot_grid(bias_both$subject$Plot, bias_both_color$subject$Plot,nrow = 2) ``` We can change the text size using **plot_text_size**. We also specify the color palettes for both batch effects at the same time. *Figure 5. Customize text size and color palette* ```{r figure 5, message=FALSE, warning=FALSE, fig.width= 10, fig.height=8} # text size default = 3 bias_both_color_text <- biasDetect(list_batch_df = list_batch_df, threshold = "both", nSD_dev = 3, nSD_rank = 3, plot_palette = c("Blues"), plot_text_size = 4) plot_grid(bias_both$subject$Plot, bias_both_color_text$subject$Plot,nrow = 2) ``` #### Refine SVGs by Removing Batch-Affected Outliers Finally, we obtain a refined set of spatially variable genes (SVGs) by removing the identified outliers based on user-defined thresholds for `nSD_dev` and `nSD_rank`. Here, we use the results from bias_both, which applied `threshold = "both"` to account for both deviance and rank differences, with the batch effect set to sample ID. ```{r new svgs, comment = NA, message=FALSE, warning=FALSE} bias_both_df <- bias_both$subject$Table svgs_filt <- setdiff(libd_nnsvgs$gene_id, bias_both_df$gene_id) svgs_filt_spe <- libd_nnsvgs[libd_nnsvgs$gene_id %in% svgs_filt, ] nrow(svgs_filt_spe) ``` After obtaining the refined set of SVGs, these genes can be further analyzed using established spatial transcriptomics clustering algorithms to explore tissue layers and spatial organization. ### `R` session information {.unnumbered} ```{r session info} ## Session info sessionInfo() ```