Contents

0.0.1 Introduction

BatchSVG is the R/Bioconductor package for spatial transcriptomics data quality control (QC). As the feature-based QC method, the package provides functions to identify the biased features associated with the batch effect(s) (e.g. sample, slide, and sex) in spatially variable genes (SVGs) using binomial deviance model, aiming to develop the downstream clustering performances and remove the technical noises caused by batch effects. The package works with SpatialExperiment objects.

0.0.2 Installation

(After accepted in Bioconductor).

if (!requireNamespace("BiocManager")) {
    install.packages("BiocManager")
}
BiocManager::install("BatchSVG")

Install the development version from GitHub.

remotes::install("christinehou11/BatchSVG")

0.0.3 Biased Feature Identification

In this section, we will include the standard workflow for using BatchSVG to show how the method help to detect and visualize the biased features in SVGs.

library(BatchSVG)
# library(humanHippocampus2024)
library(ExperimentHub)
library(SpatialExperiment)
library(SummarizedExperiment)
library(tidyr)
library(dplyr)
library(tibble)
library(cowplot)

0.0.3.1 Data

We will use the spatially-resolved transcriptomics (SRT) dataset from the adjacent tissue sections of the anterior human hippocampus across ten adult neurotypical donors. The dataset is obtained from humanHippocampus2024 package which currently is in the development version on Bioconductor 3.21, and it is the spatialExperiment object generated and processed from the spatial_HPC project. Please read here if interested in humanHippocampus2024 data package.

(The codes to access the spe dataset in humanHippocampus2024 package will be updated after the official release of Bioconductor 3.21.)

ehub <- ExperimentHub()

# Load the datasets of the package
# myfiles <- query(ehub, "humanHippocampus2024")
# Resulting humanHippocampus2024 datasets from ExperimentHub query
# myfiles
# ExperimentHub with 2 records
# # snapshotDate(): 2024-10-24
# # $dataprovider: Lieber Institute for Brain Development (LIBD)
# # $species: Homo sapiens
# # $rdataclass: SpatialExperiment, SingleCellExperiment
# # additional mcols(): taxonomyid, genome, description,
# #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# #   rdatapath, sourceurl, sourcetype 
# # retrieve records with, e.g., 'object[["EH9605"]]' 
# 
#            title
#   EH9605 | spe  
#   EH9606 | sce
#   
# spe <- myfiles[["EH9605"]]

spe <- ehub[["EH9605"]]
spe
class: SpatialExperiment 
dim: 31483 150917 
metadata(1): Obtained_from
assays(2): counts logcounts
rownames(31483): MIR1302-2HG AL627309.1 ... AC007325.4 AC007325.2
rowData names(7): source type ... gene_type gene_search
colnames(150917): AAACAACGAATAGTTC-1_V10B01-086_D1
  AAACAAGTATCTCCCA-1_V10B01-086_D1 ... TTGTTTCCATACAACT-1_Br2720_B1
  TTGTTTGTATTACACG-1_Br2720_B1
colData names(150): sample_id in_tissue ... nmf99 nmf100
reducedDimNames(3): 10x_pca 10x_tsne 10x_umap
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

We will use the spatially variable genes set generated from spatial_HPC project. The result is generated from nnSVG package.

We will select four samples from the raw data as an example:

  • V11L05-333_B1

  • V11L05-333_D1

  • V11L05-335_D1

  • V11L05-336_A1.

fix_order <- distinct(
    as.data.frame(colData(spe)), slide, array, brnum, sample_id, 
    position, sex) %>% 
    arrange(slide, array)
sub4 <- fix_order$sample_id[c(14,16, 20,21)]

spe_sub4 <- spe[,spe$sample_id %in% sub4]
spe_sub4 # 31483, 18945
class: SpatialExperiment 
dim: 31483 18945 
metadata(1): Obtained_from
assays(2): counts logcounts
rownames(31483): MIR1302-2HG AL627309.1 ... AC007325.4 AC007325.2
rowData names(7): source type ... gene_type gene_search
colnames(18945): AAACAACGAATAGTTC-1_V11L05-333_B1
  AAACAAGTATCTCCCA-1_V11L05-333_B1 ... TTGTTTGTATTACACG-1_V11L05-336_A1
  TTGTTTGTGTAAATTC-1_V11L05-336_A1
colData names(150): sample_id in_tissue ... nmf99 nmf100
reducedDimNames(3): 10x_pca 10x_tsne 10x_umap
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

We will refine our selection to include only the top 2,000 ranked features (rank\(\leq\) 2000) and only genes that appear in more than one sample (n > 1).

After applying these criteria, we obtain 2,082 spatially variable genes across the four samples.

# res_ranks: SVGs results with rank values
res_df_sub <- pivot_longer(
    rownames_to_column(as.data.frame(res_ranks), var<-"gene_id"), 
        colnames(res_ranks), 
    names_to="sample_id", 
    values_to="rank", 
    values_drop_na=TRUE)
    
res_df_sub <- filter(res_df_sub,
    sample_id %in% 
        c("V11L05-333_B1", "V11L05-333_D1", "V11L05-335_D1", "V11L05-336_A1"), 
    rank <= 2000) # top 2k sig features
    
svgs_sub4 <- group_by(res_df_sub, gene_id) |>
    tally() |> 
    filter(n>1)
nrow(svgs_sub4)
[1] 2082

0.0.3.2 Perform Feature Selection using featureSelect()

We will perform feature selection on a subset of spatial transcriptomics data (input) using a predefined set of spatially variable genes (VGs). Specifically, we will compute the number of standard deviations for the relative change in deviance (nSD_dev_{batch effect}) and rank difference (nSD_rank_{batch effect}) before and after adjusting for batch effects.

The featureSelect() function enables feature selection while accounting for multiple batch effects. It returns a list of data frames, where each batch effect is associated with a corresponding data frame containing key results, including:

  • Relative change in deviance before and after batch effect adjustment

  • Rank differences between the batch-corrected and uncorrected results

  • Number of standard deviations (nSD) for both relative change in deviance and rank difference

We will use the example of applying featureSelect() to a four sample dataset while adjusting for the batch effect sample_id and sex.

spe_sub4 <- spe_sub4[rowData(spe_sub4)$gene_id %in% svgs_sub4$gene_id,]
rownames(spe_sub4) <- rowData(spe_sub4)$gene_id

SVGs <- svgs_sub4$gene_id
list_batch_df <- featureSelect(input = spe_sub4, 
    batch_effect = c("sample_id", "sex"), VGs = SVGs)
Running feature selection without batch...
Batch Effect: sample_id
Running feature selection without batch...
Calculating deviance and rank difference...
Batch Effect: sex
Running feature selection without batch...
Calculating deviance and rank difference...
class(list_batch_df)
[1] "list"
head(list_batch_df$sample_id)
          gene_id gene_name dev_default rank_default dev_sample_id
1 ENSG00000131584     ACAP3    16125.31         1262      15900.14
2 ENSG00000175756  AURKAIP1    17344.09         1060      17167.86
3 ENSG00000242485    MRPL20    17629.33         1023      17517.05
4 ENSG00000179403      VWA1    12860.93         1726      12825.66
5 ENSG00000160075     SSU72    16145.20         1255      16136.31
6 ENSG00000078369      GNB1    22402.83          516      22271.32
  rank_sample_id       d_diff nSD_dev_sample_id r_diff nSD_rank_sample_id
1           1269 0.0141612453       -0.09513109      7         0.16380252
2           1058 0.0102651525       -0.16945410     -2        -0.04680072
3           1004 0.0064098269       -0.24299943    -19        -0.44460683
4           1702 0.0027493688       -0.31282741    -24        -0.56160863
5           1220 0.0005506572       -0.35477067    -35        -0.81901258
6            497 0.0059049925       -0.25262980    -19        -0.44460683
head(list_batch_df$sex)
          gene_id gene_name dev_default rank_default  dev_sex rank_sex
1 ENSG00000131584     ACAP3    16125.31         1262 16118.48     1250
2 ENSG00000175756  AURKAIP1    17344.09         1060 17247.44     1064
3 ENSG00000242485    MRPL20    17629.33         1023 17585.70     1013
4 ENSG00000179403      VWA1    12860.93         1726 12860.90     1709
5 ENSG00000160075     SSU72    16145.20         1255 16141.12     1243
6 ENSG00000078369      GNB1    22402.83          516 22314.17      509
        d_diff nSD_dev_sex r_diff nSD_rank_sex
1 4.234208e-04  -0.2615600    -12   -0.3080188
2 5.603690e-03  -0.1013769      4    0.1026729
3 2.480783e-03  -0.1979427    -10   -0.2566824
4 1.811106e-06  -0.2745969    -17   -0.4363600
5 2.527558e-04  -0.2668373    -12   -0.3080188
6 3.973515e-03  -0.1517848     -7   -0.1796776

0.0.3.3 Visualize SVG Selection Using svg_nSD for Batch Effects

The svg_nSD() function generates visualizations to assess batch effects in spatially variable genes (SVGs). It produces bar charts showing the distribution of SVGs based on relative change in deviance and rank difference, with colors representing different nSD intervals. Additionally, scatter plots compare deviance and rank values with and without batch effects.

By interpreting these plots, we can determine appropriate nSD thresholds for filtering biased features. The left panels illustrate the distribution of SVGs in terms of deviance and rank difference, while the right panels compare values before and after accounting for batch effects.

plots <- svg_nSD(list_batch_df = list_batch_df, 
                sd_interval_dev = c(5,4), sd_interval_rank = c(4,6))
plots$sample_id

plots$sex

We can also apply svg_nSD() to a single batch effect. Note that the function requires the input to be a list of data frames, even when analyzing only one batch.

plots <- svg_nSD(list_batch_df = list_batch_df[1], 
                sd_interval_dev = 5, sd_interval_rank = 7)
plots$sample_id

0.0.3.4 Identify Biased Genes Using biasDetect()

The function biasDetect() is designed to identify and filter out biased genes across different batch effects. Using threshold values selected from the visualization results generated by svg_nSD(), this function systematically detects outliers that exceed a specified normalized standard deviation (nSD) threshold in either relative deviance change, rank difference, or both.

The function outputs visualizations comparing deviance and rank values with and without batch effects. Genes with high deviations, highlighted in color, are identified as potentially biased and can be excluded based on the selected nSD thresholds.

The function offers flexibility in customizing the plot aesthetics, allowing users to adjust the data point size (plot_point_size), shape (plot_point_shape), annotated text size (plot_text_size), and data point color pallete (plot_pallete). Default values are provided for these parameters if not specified. Users should refer to ggplot2 aesthetic guidelines to ensure appropriate values are assigned for each parameter.

We will use nSD_dev = 7 and nSD_rank = 6 as the example. The user should adjust the value based on their dataset features.

Usage of Different Threshold Options

  • threshold = "dev": Filters biased genes based only on the relative change in deviance. Genes with deviance changes exceeding the specified nSD_dev threshold are identified as batch-affected and can be removed.
bias_dev <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "dev", nSD_dev = 7)
head(bias_dev$sample_id$Table)
          gene_id gene_name dev_default rank_default dev_sample_id
1 ENSG00000174576     NPAS4    35003.31          125     24629.414
2 ENSG00000123358     NR4A1    23299.81          457     16115.928
3 ENSG00000170345       FOS    42305.65           73     27146.089
4 ENSG00000256618  MTRNR2L1    69206.34           28     24876.086
5 ENSG00000118271       TTR  4719046.58            1   3292127.945
6 ENSG00000229807      XIST    15223.50         1408      8819.689
  rank_sample_id    d_diff nSD_dev_sample_id r_diff nSD_rank_sample_id
1            363 0.4211996          7.669653    238           5.569286
2           1226 0.4457631          8.138234    769          17.994876
3            263 0.5584435         10.287758    190           4.446068
4            351 1.7820430         33.629502    323           7.558316
5              1 0.4334335          7.903030      0           0.000000
6           2050 0.7260812         13.485664    642          15.023031
  nSD_bin_dev dev_outlier
1      [7,14)        TRUE
2      [7,14)        TRUE
3      [7,14)        TRUE
4     [28,35]        TRUE
5      [7,14)        TRUE
6      [7,14)        TRUE
bias_dev$sample_id$Plot

We can change the data point size using plot_point_size.

# size default = 3
bias_dev_size <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "dev", nSD_dev = 7, plot_point_size = c(2,4))

plot_grid(bias_dev_size$sample_id$Plot,bias_dev_size$sex$Plot)

  • threshold = "rank": Identifies biased genes based solely on rank difference. Genes with rank shifts exceeding nSD_rank are considered biased.
bias_rank <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "rank", nSD_rank = 6)
head(bias_rank$sex$Table)
          gene_id gene_name dev_default rank_default  dev_sex rank_sex
1 ENSG00000159388      BTG2    18311.28          926 14257.70     1543
2 ENSG00000135625      EGR4    20336.84          705 17851.54      972
3 ENSG00000120738      EGR1    19882.54          752 17444.96     1037
4 ENSG00000120129     DUSP1    25054.85          365 19007.41      834
5 ENSG00000204388    HSPA1B    19085.73          841 16440.69     1197
6 ENSG00000130222   GADD45G    16565.74         1191 13815.91     1592
     d_diff nSD_dev_sex r_diff nSD_rank_sex nSD_bin_rank rank_outlier
1 0.2843081    8.516661    617    15.837301      [12,18)         TRUE
2 0.1392207    4.030301    267     6.853419       [6,12)         TRUE
3 0.1397297    4.046039    285     7.315447       [6,12)         TRUE
4 0.3181621    9.563487    469    12.038402      [12,18)         TRUE
5 0.1608836    4.700155    356     9.137892       [6,12)         TRUE
6 0.1990332    5.879808    401    10.292962       [6,12)         TRUE
bias_rank$sex$Plot

We can change the data point shape using plot_point_shape.

# shape default = 16
bias_rank_shape <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "rank", nSD_rank = 6, plot_point_shape = c(2, 18))

plot_grid(bias_rank_shape$sample_id$Plot,bias_rank_shape$sex$Plot)

  • threshold = "both": Detects biased genes based on both deviance change and rank difference, providing a more stringent filtering approach.
bias_both <- biasDetect(list_batch_df = list_batch_df, threshold = "both",
    nSD_dev = 7, nSD_rank = 6)
bias_both$sample_id$Plot

head(bias_both$sex$Table)
          gene_id gene_name dev_default rank_default   dev_sex rank_sex
1 ENSG00000173110     HSPA6    9887.942         2011  7867.831     2074
2 ENSG00000159388      BTG2   18311.285          926 14257.704     1543
3 ENSG00000135625      EGR4   20336.842          705 17851.538      972
4 ENSG00000120738      EGR1   19882.541          752 17444.962     1037
5 ENSG00000120129     DUSP1   25054.848          365 19007.410      834
6 ENSG00000204389    HSPA1A   52523.899           47 41069.968       75
     d_diff nSD_dev_sex r_diff nSD_rank_sex nSD_bin_dev dev_outlier
1 0.2567557    7.664691     63    1.6170988      [7,14)        TRUE
2 0.2843081    8.516661    617   15.8373012      [7,14)        TRUE
3 0.1392207    4.030301    267    6.8534188       [0,7)       FALSE
4 0.1397297    4.046039    285    7.3154471       [0,7)       FALSE
5 0.3181621    9.563487    469   12.0384024      [7,14)        TRUE
6 0.2788882    8.349068     28    0.7187106      [7,14)        TRUE
  nSD_bin_rank rank_outlier
1        [0,6)        FALSE
2      [12,18)         TRUE
3       [6,12)         TRUE
4       [6,12)         TRUE
5      [12,18)         TRUE
6        [0,6)        FALSE

We can change the data point color using plot_pallete. The color pallete here can be referenced on since the function uses RColorBrewer to generate colors.

# color default = "YlOrRd"
bias_both_color <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "both", nSD_dev = 7, nSD_rank = 6, plot_pallete = "Greens")

plot_grid(bias_both_color$sample_id$Plot,bias_both_color$sex$Plot,nrow = 2)

We can change the text size using plot_text_size. We also specify the color palletes for both batch effects at the same time.

# text size default = 3
bias_both_color_text <- biasDetect(list_batch_df = list_batch_df, 
    threshold = "both", nSD_dev = 7, nSD_rank = 6, 
    plot_pallete = c("Blues","Greens"), plot_text_size = c(2,4))

plot_grid(bias_both_color_text$sample_id$Plot,
    bias_both_color_text$sex$Plot,nrow = 2)

0.0.3.5 Refine SVGs by Removing Batch-Affected Outliers

Finally, we obtain a refined set of spatially variable genes (SVGs) by removing the identified outliers based on user-defined thresholds for nSD_dev and nSD_rank.

Here, we use the results from bias_both, which applied threshold = "both" to account for both deviance and rank differences, with the batch effect set to sample ID.

bias_both_df <- bias_both$sample_id$Table
svgs_filt <- setdiff(svgs_sub4$gene_id, bias_both_df$gene_id)
svgs_sub4_filt <- svgs_sub4[svgs_sub4$gene_id %in% svgs_filt, ]
nrow(svgs_sub4_filt)
[1] 2067

After obtaining the refined set of SVGs, these genes can be further analyzed using established spatial transcriptomics clustering algorithms to explore tissue layers and spatial organization.

R session information

## Session info
sessionInfo()
#> R Under development (unstable) (2025-02-19 r87757)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] humanHippocampus2024_0.99.8 cowplot_1.1.3              
#>  [3] tibble_3.2.1                dplyr_1.1.4                
#>  [5] tidyr_1.3.1                 SpatialExperiment_1.17.0   
#>  [7] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#>  [9] Biobase_2.67.0              GenomicRanges_1.59.1       
#> [11] GenomeInfoDb_1.43.4         IRanges_2.41.3             
#> [13] S4Vectors_0.45.4            MatrixGenerics_1.19.1      
#> [15] matrixStats_1.5.0           ExperimentHub_2.15.0       
#> [17] AnnotationHub_3.15.0        BiocFileCache_2.15.1       
#> [19] dbplyr_2.5.0                BiocGenerics_0.53.6        
#> [21] generics_0.1.3              BatchSVG_0.99.2            
#> [23] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        farver_2.1.2            blob_1.2.4             
#>  [4] filelock_1.0.3          Biostrings_2.75.4       fastmap_1.2.0          
#>  [7] digest_0.6.37           rsvd_1.0.5              mime_0.12              
#> [10] lifecycle_1.0.4         KEGGREST_1.47.0         RSQLite_2.3.9          
#> [13] magrittr_2.0.3          compiler_4.5.0          rlang_1.1.5            
#> [16] sass_0.4.9              tools_4.5.0             yaml_2.3.10            
#> [19] knitr_1.49              labeling_0.4.3          S4Arrays_1.7.3         
#> [22] bit_4.6.0               curl_6.2.1              DelayedArray_0.33.6    
#> [25] RColorBrewer_1.1-3      abind_1.4-8             BiocParallel_1.41.2    
#> [28] withr_3.0.2             purrr_1.0.4             grid_4.5.0             
#> [31] beachmat_2.23.6         colorspace_2.1-1        ggplot2_3.5.1          
#> [34] scales_1.3.0            tinytex_0.56            cli_3.6.4              
#> [37] rmarkdown_2.29          crayon_1.5.3            rjson_0.2.23           
#> [40] httr_1.4.7              DBI_1.2.3               cachem_1.1.0           
#> [43] parallel_4.5.0          AnnotationDbi_1.69.0    BiocManager_1.30.25    
#> [46] XVector_0.47.2          vctrs_0.6.5             Matrix_1.7-2           
#> [49] jsonlite_1.9.1          bookdown_0.42           BiocSingular_1.23.0    
#> [52] bit64_4.6.0-1           ggrepel_0.9.6           scry_1.19.0            
#> [55] irlba_2.3.5.1           magick_2.8.5            jquerylib_0.1.4        
#> [58] glue_1.8.0              codetools_0.2-20        gtable_0.3.6           
#> [61] BiocVersion_3.21.1      UCSC.utils_1.3.1        ScaledMatrix_1.15.0    
#> [64] munsell_0.5.1           pillar_1.10.1           rappdirs_0.3.3         
#> [67] htmltools_0.5.8.1       GenomeInfoDbData_1.2.13 R6_2.6.1               
#> [70] evaluate_1.0.3          lattice_0.22-6          png_0.1-8              
#> [73] memoise_2.0.1           bslib_0.9.0             Rcpp_1.0.14            
#> [76] SparseArray_1.7.6       xfun_0.51               pkgconfig_2.0.3