BatchSVG 0.99.2
BatchSVG
is the R/Bioconductor package for spatial transcriptomics data
quality control (QC). As the feature-based QC method, the package provides
functions to identify the biased features associated with the batch effect(s)
(e.g. sample, slide, and sex) in spatially variable genes (SVGs) using
binomial deviance model, aiming to develop the downstream clustering
performances and remove the technical noises caused by batch effects. The
package works with
SpatialExperiment objects.
(After accepted in Bioconductor).
if (!requireNamespace("BiocManager")) {
install.packages("BiocManager")
}
BiocManager::install("BatchSVG")
Install the development version from GitHub.
remotes::install("christinehou11/BatchSVG")
In this section, we will include the standard workflow for using BatchSVG
to
show how the method help to detect and visualize the biased features in SVGs.
library(BatchSVG)
# library(humanHippocampus2024)
library(ExperimentHub)
library(SpatialExperiment)
library(SummarizedExperiment)
library(tidyr)
library(dplyr)
library(tibble)
library(cowplot)
We will use the spatially-resolved transcriptomics (SRT)
dataset from the
adjacent tissue sections of the anterior human hippocampus across ten adult
neurotypical donors. The dataset is obtained from humanHippocampus2024
package which currently is in the
development version
on Bioconductor 3.21, and it is the spatialExperiment
object generated and
processed from the spatial_HPC
project. Please read
here
if interested in humanHippocampus2024
data package.
(The codes to access the spe dataset in humanHippocampus2024
package will
be updated after the official release of Bioconductor 3.21.)
ehub <- ExperimentHub()
# Load the datasets of the package
# myfiles <- query(ehub, "humanHippocampus2024")
# Resulting humanHippocampus2024 datasets from ExperimentHub query
# myfiles
# ExperimentHub with 2 records
# # snapshotDate(): 2024-10-24
# # $dataprovider: Lieber Institute for Brain Development (LIBD)
# # $species: Homo sapiens
# # $rdataclass: SpatialExperiment, SingleCellExperiment
# # additional mcols(): taxonomyid, genome, description,
# # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# # rdatapath, sourceurl, sourcetype
# # retrieve records with, e.g., 'object[["EH9605"]]'
#
# title
# EH9605 | spe
# EH9606 | sce
#
# spe <- myfiles[["EH9605"]]
spe <- ehub[["EH9605"]]
spe
class: SpatialExperiment
dim: 31483 150917
metadata(1): Obtained_from
assays(2): counts logcounts
rownames(31483): MIR1302-2HG AL627309.1 ... AC007325.4 AC007325.2
rowData names(7): source type ... gene_type gene_search
colnames(150917): AAACAACGAATAGTTC-1_V10B01-086_D1
AAACAAGTATCTCCCA-1_V10B01-086_D1 ... TTGTTTCCATACAACT-1_Br2720_B1
TTGTTTGTATTACACG-1_Br2720_B1
colData names(150): sample_id in_tissue ... nmf99 nmf100
reducedDimNames(3): 10x_pca 10x_tsne 10x_umap
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor
We will use the spatially variable genes set generated from spatial_HPC project. The result is generated from nnSVG package.
We will select four samples from the raw data as an example:
V11L05-333_B1
V11L05-333_D1
V11L05-335_D1
V11L05-336_A1.
fix_order <- distinct(
as.data.frame(colData(spe)), slide, array, brnum, sample_id,
position, sex) %>%
arrange(slide, array)
sub4 <- fix_order$sample_id[c(14,16, 20,21)]
spe_sub4 <- spe[,spe$sample_id %in% sub4]
spe_sub4 # 31483, 18945
class: SpatialExperiment
dim: 31483 18945
metadata(1): Obtained_from
assays(2): counts logcounts
rownames(31483): MIR1302-2HG AL627309.1 ... AC007325.4 AC007325.2
rowData names(7): source type ... gene_type gene_search
colnames(18945): AAACAACGAATAGTTC-1_V11L05-333_B1
AAACAAGTATCTCCCA-1_V11L05-333_B1 ... TTGTTTGTATTACACG-1_V11L05-336_A1
TTGTTTGTGTAAATTC-1_V11L05-336_A1
colData names(150): sample_id in_tissue ... nmf99 nmf100
reducedDimNames(3): 10x_pca 10x_tsne 10x_umap
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor
We will refine our selection to include only the top 2,000 ranked features (rank\(\leq\) 2000) and only genes that appear in more than one sample (n > 1).
After applying these criteria, we obtain 2,082 spatially variable genes across the four samples.
# res_ranks: SVGs results with rank values
res_df_sub <- pivot_longer(
rownames_to_column(as.data.frame(res_ranks), var<-"gene_id"),
colnames(res_ranks),
names_to="sample_id",
values_to="rank",
values_drop_na=TRUE)
res_df_sub <- filter(res_df_sub,
sample_id %in%
c("V11L05-333_B1", "V11L05-333_D1", "V11L05-335_D1", "V11L05-336_A1"),
rank <= 2000) # top 2k sig features
svgs_sub4 <- group_by(res_df_sub, gene_id) |>
tally() |>
filter(n>1)
nrow(svgs_sub4)
[1] 2082
featureSelect()
We will perform feature selection on a subset of spatial transcriptomics data (input) using a predefined set of spatially variable genes (VGs). Specifically, we will compute the number of standard deviations for the relative change in deviance (nSD_dev_{batch effect}) and rank difference (nSD_rank_{batch effect}) before and after adjusting for batch effects.
The featureSelect()
function enables feature selection while accounting for
multiple batch effects. It returns a list of data frames, where each batch
effect is associated with a corresponding data frame containing key results,
including:
Relative change in deviance before and after batch effect adjustment
Rank differences between the batch-corrected and uncorrected results
Number of standard deviations (nSD) for both relative change in deviance and rank difference
We will use the example of applying featureSelect()
to a four sample dataset
while adjusting for the batch effect sample_id and sex.
spe_sub4 <- spe_sub4[rowData(spe_sub4)$gene_id %in% svgs_sub4$gene_id,]
rownames(spe_sub4) <- rowData(spe_sub4)$gene_id
SVGs <- svgs_sub4$gene_id
list_batch_df <- featureSelect(input = spe_sub4,
batch_effect = c("sample_id", "sex"), VGs = SVGs)
Running feature selection without batch...
Batch Effect: sample_id
Running feature selection without batch...
Calculating deviance and rank difference...
Batch Effect: sex
Running feature selection without batch...
Calculating deviance and rank difference...
class(list_batch_df)
[1] "list"
head(list_batch_df$sample_id)
gene_id gene_name dev_default rank_default dev_sample_id
1 ENSG00000131584 ACAP3 16125.31 1262 15900.14
2 ENSG00000175756 AURKAIP1 17344.09 1060 17167.86
3 ENSG00000242485 MRPL20 17629.33 1023 17517.05
4 ENSG00000179403 VWA1 12860.93 1726 12825.66
5 ENSG00000160075 SSU72 16145.20 1255 16136.31
6 ENSG00000078369 GNB1 22402.83 516 22271.32
rank_sample_id d_diff nSD_dev_sample_id r_diff nSD_rank_sample_id
1 1269 0.0141612453 -0.09513109 7 0.16380252
2 1058 0.0102651525 -0.16945410 -2 -0.04680072
3 1004 0.0064098269 -0.24299943 -19 -0.44460683
4 1702 0.0027493688 -0.31282741 -24 -0.56160863
5 1220 0.0005506572 -0.35477067 -35 -0.81901258
6 497 0.0059049925 -0.25262980 -19 -0.44460683
head(list_batch_df$sex)
gene_id gene_name dev_default rank_default dev_sex rank_sex
1 ENSG00000131584 ACAP3 16125.31 1262 16118.48 1250
2 ENSG00000175756 AURKAIP1 17344.09 1060 17247.44 1064
3 ENSG00000242485 MRPL20 17629.33 1023 17585.70 1013
4 ENSG00000179403 VWA1 12860.93 1726 12860.90 1709
5 ENSG00000160075 SSU72 16145.20 1255 16141.12 1243
6 ENSG00000078369 GNB1 22402.83 516 22314.17 509
d_diff nSD_dev_sex r_diff nSD_rank_sex
1 4.234208e-04 -0.2615600 -12 -0.3080188
2 5.603690e-03 -0.1013769 4 0.1026729
3 2.480783e-03 -0.1979427 -10 -0.2566824
4 1.811106e-06 -0.2745969 -17 -0.4363600
5 2.527558e-04 -0.2668373 -12 -0.3080188
6 3.973515e-03 -0.1517848 -7 -0.1796776
svg_nSD
for Batch EffectsThe svg_nSD()
function generates visualizations to assess batch effects in
spatially variable genes (SVGs). It produces bar charts showing the distribution
of SVGs based on relative change in deviance and rank difference, with colors
representing different nSD intervals. Additionally, scatter plots compare
deviance and rank values with and without batch effects.
By interpreting these plots, we can determine appropriate nSD thresholds for filtering biased features. The left panels illustrate the distribution of SVGs in terms of deviance and rank difference, while the right panels compare values before and after accounting for batch effects.
plots <- svg_nSD(list_batch_df = list_batch_df,
sd_interval_dev = c(5,4), sd_interval_rank = c(4,6))
plots$sample_id
plots$sex
We can also apply svg_nSD()
to a single batch effect. Note that the function
requires the input to be a list of data frames, even when analyzing only one
batch.
plots <- svg_nSD(list_batch_df = list_batch_df[1],
sd_interval_dev = 5, sd_interval_rank = 7)
plots$sample_id
biasDetect()
The function biasDetect()
is designed to identify and filter out biased genes
across different batch effects. Using threshold values selected from
the visualization results generated by svg_nSD()
, this function systematically
detects outliers that exceed a specified normalized standard deviation (nSD)
threshold in either relative deviance change, rank difference, or both.
The function outputs visualizations comparing deviance and rank values with and without batch effects. Genes with high deviations, highlighted in color, are identified as potentially biased and can be excluded based on the selected nSD thresholds.
The function offers flexibility in customizing the plot aesthetics, allowing users to adjust the data point size (plot_point_size), shape (plot_point_shape), annotated text size (plot_text_size), and data point color pallete (plot_pallete). Default values are provided for these parameters if not specified. Users should refer to ggplot2 aesthetic guidelines to ensure appropriate values are assigned for each parameter.
We will use nSD_dev = 7
and nSD_rank = 6
as the example. The user should
adjust the value based on their dataset features.
Usage of Different Threshold Options
threshold = "dev"
: Filters biased genes based only on the relative change
in deviance. Genes with deviance changes exceeding the specified nSD_dev
threshold are identified as batch-affected and can be removed.bias_dev <- biasDetect(list_batch_df = list_batch_df,
threshold = "dev", nSD_dev = 7)
head(bias_dev$sample_id$Table)
gene_id gene_name dev_default rank_default dev_sample_id
1 ENSG00000174576 NPAS4 35003.31 125 24629.414
2 ENSG00000123358 NR4A1 23299.81 457 16115.928
3 ENSG00000170345 FOS 42305.65 73 27146.089
4 ENSG00000256618 MTRNR2L1 69206.34 28 24876.086
5 ENSG00000118271 TTR 4719046.58 1 3292127.945
6 ENSG00000229807 XIST 15223.50 1408 8819.689
rank_sample_id d_diff nSD_dev_sample_id r_diff nSD_rank_sample_id
1 363 0.4211996 7.669653 238 5.569286
2 1226 0.4457631 8.138234 769 17.994876
3 263 0.5584435 10.287758 190 4.446068
4 351 1.7820430 33.629502 323 7.558316
5 1 0.4334335 7.903030 0 0.000000
6 2050 0.7260812 13.485664 642 15.023031
nSD_bin_dev dev_outlier
1 [7,14) TRUE
2 [7,14) TRUE
3 [7,14) TRUE
4 [28,35] TRUE
5 [7,14) TRUE
6 [7,14) TRUE
bias_dev$sample_id$Plot
We can change the data point size using plot_point_size.
# size default = 3
bias_dev_size <- biasDetect(list_batch_df = list_batch_df,
threshold = "dev", nSD_dev = 7, plot_point_size = c(2,4))
plot_grid(bias_dev_size$sample_id$Plot,bias_dev_size$sex$Plot)
threshold = "rank"
: Identifies biased genes based solely on rank difference.
Genes with rank shifts exceeding nSD_rank
are considered biased.bias_rank <- biasDetect(list_batch_df = list_batch_df,
threshold = "rank", nSD_rank = 6)
head(bias_rank$sex$Table)
gene_id gene_name dev_default rank_default dev_sex rank_sex
1 ENSG00000159388 BTG2 18311.28 926 14257.70 1543
2 ENSG00000135625 EGR4 20336.84 705 17851.54 972
3 ENSG00000120738 EGR1 19882.54 752 17444.96 1037
4 ENSG00000120129 DUSP1 25054.85 365 19007.41 834
5 ENSG00000204388 HSPA1B 19085.73 841 16440.69 1197
6 ENSG00000130222 GADD45G 16565.74 1191 13815.91 1592
d_diff nSD_dev_sex r_diff nSD_rank_sex nSD_bin_rank rank_outlier
1 0.2843081 8.516661 617 15.837301 [12,18) TRUE
2 0.1392207 4.030301 267 6.853419 [6,12) TRUE
3 0.1397297 4.046039 285 7.315447 [6,12) TRUE
4 0.3181621 9.563487 469 12.038402 [12,18) TRUE
5 0.1608836 4.700155 356 9.137892 [6,12) TRUE
6 0.1990332 5.879808 401 10.292962 [6,12) TRUE
bias_rank$sex$Plot
We can change the data point shape using plot_point_shape.
# shape default = 16
bias_rank_shape <- biasDetect(list_batch_df = list_batch_df,
threshold = "rank", nSD_rank = 6, plot_point_shape = c(2, 18))
plot_grid(bias_rank_shape$sample_id$Plot,bias_rank_shape$sex$Plot)
threshold = "both"
: Detects biased genes based on both deviance change and
rank difference, providing a more stringent filtering approach.bias_both <- biasDetect(list_batch_df = list_batch_df, threshold = "both",
nSD_dev = 7, nSD_rank = 6)
bias_both$sample_id$Plot
head(bias_both$sex$Table)
gene_id gene_name dev_default rank_default dev_sex rank_sex
1 ENSG00000173110 HSPA6 9887.942 2011 7867.831 2074
2 ENSG00000159388 BTG2 18311.285 926 14257.704 1543
3 ENSG00000135625 EGR4 20336.842 705 17851.538 972
4 ENSG00000120738 EGR1 19882.541 752 17444.962 1037
5 ENSG00000120129 DUSP1 25054.848 365 19007.410 834
6 ENSG00000204389 HSPA1A 52523.899 47 41069.968 75
d_diff nSD_dev_sex r_diff nSD_rank_sex nSD_bin_dev dev_outlier
1 0.2567557 7.664691 63 1.6170988 [7,14) TRUE
2 0.2843081 8.516661 617 15.8373012 [7,14) TRUE
3 0.1392207 4.030301 267 6.8534188 [0,7) FALSE
4 0.1397297 4.046039 285 7.3154471 [0,7) FALSE
5 0.3181621 9.563487 469 12.0384024 [7,14) TRUE
6 0.2788882 8.349068 28 0.7187106 [7,14) TRUE
nSD_bin_rank rank_outlier
1 [0,6) FALSE
2 [12,18) TRUE
3 [6,12) TRUE
4 [6,12) TRUE
5 [12,18) TRUE
6 [0,6) FALSE
We can change the data point color using plot_pallete. The color pallete
here can be
referenced on since the function uses RColorBrewer
to generate colors.
# color default = "YlOrRd"
bias_both_color <- biasDetect(list_batch_df = list_batch_df,
threshold = "both", nSD_dev = 7, nSD_rank = 6, plot_pallete = "Greens")
plot_grid(bias_both_color$sample_id$Plot,bias_both_color$sex$Plot,nrow = 2)
We can change the text size using plot_text_size. We also specify the color palletes for both batch effects at the same time.
# text size default = 3
bias_both_color_text <- biasDetect(list_batch_df = list_batch_df,
threshold = "both", nSD_dev = 7, nSD_rank = 6,
plot_pallete = c("Blues","Greens"), plot_text_size = c(2,4))
plot_grid(bias_both_color_text$sample_id$Plot,
bias_both_color_text$sex$Plot,nrow = 2)
Finally, we obtain a refined set of spatially variable genes (SVGs) by removing
the identified outliers based on user-defined thresholds for nSD_dev
and
nSD_rank
.
Here, we use the results from bias_both, which applied threshold = "both"
to
account for both deviance and rank differences, with the batch effect set to
sample ID.
bias_both_df <- bias_both$sample_id$Table
svgs_filt <- setdiff(svgs_sub4$gene_id, bias_both_df$gene_id)
svgs_sub4_filt <- svgs_sub4[svgs_sub4$gene_id %in% svgs_filt, ]
nrow(svgs_sub4_filt)
[1] 2067
After obtaining the refined set of SVGs, these genes can be further analyzed using established spatial transcriptomics clustering algorithms to explore tissue layers and spatial organization.
R
session information## Session info
sessionInfo()
#> R Under development (unstable) (2025-02-19 r87757)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] humanHippocampus2024_0.99.8 cowplot_1.1.3
#> [3] tibble_3.2.1 dplyr_1.1.4
#> [5] tidyr_1.3.1 SpatialExperiment_1.17.0
#> [7] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#> [9] Biobase_2.67.0 GenomicRanges_1.59.1
#> [11] GenomeInfoDb_1.43.4 IRanges_2.41.3
#> [13] S4Vectors_0.45.4 MatrixGenerics_1.19.1
#> [15] matrixStats_1.5.0 ExperimentHub_2.15.0
#> [17] AnnotationHub_3.15.0 BiocFileCache_2.15.1
#> [19] dbplyr_2.5.0 BiocGenerics_0.53.6
#> [21] generics_0.1.3 BatchSVG_0.99.2
#> [23] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 farver_2.1.2 blob_1.2.4
#> [4] filelock_1.0.3 Biostrings_2.75.4 fastmap_1.2.0
#> [7] digest_0.6.37 rsvd_1.0.5 mime_0.12
#> [10] lifecycle_1.0.4 KEGGREST_1.47.0 RSQLite_2.3.9
#> [13] magrittr_2.0.3 compiler_4.5.0 rlang_1.1.5
#> [16] sass_0.4.9 tools_4.5.0 yaml_2.3.10
#> [19] knitr_1.49 labeling_0.4.3 S4Arrays_1.7.3
#> [22] bit_4.6.0 curl_6.2.1 DelayedArray_0.33.6
#> [25] RColorBrewer_1.1-3 abind_1.4-8 BiocParallel_1.41.2
#> [28] withr_3.0.2 purrr_1.0.4 grid_4.5.0
#> [31] beachmat_2.23.6 colorspace_2.1-1 ggplot2_3.5.1
#> [34] scales_1.3.0 tinytex_0.56 cli_3.6.4
#> [37] rmarkdown_2.29 crayon_1.5.3 rjson_0.2.23
#> [40] httr_1.4.7 DBI_1.2.3 cachem_1.1.0
#> [43] parallel_4.5.0 AnnotationDbi_1.69.0 BiocManager_1.30.25
#> [46] XVector_0.47.2 vctrs_0.6.5 Matrix_1.7-2
#> [49] jsonlite_1.9.1 bookdown_0.42 BiocSingular_1.23.0
#> [52] bit64_4.6.0-1 ggrepel_0.9.6 scry_1.19.0
#> [55] irlba_2.3.5.1 magick_2.8.5 jquerylib_0.1.4
#> [58] glue_1.8.0 codetools_0.2-20 gtable_0.3.6
#> [61] BiocVersion_3.21.1 UCSC.utils_1.3.1 ScaledMatrix_1.15.0
#> [64] munsell_0.5.1 pillar_1.10.1 rappdirs_0.3.3
#> [67] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13 R6_2.6.1
#> [70] evaluate_1.0.3 lattice_0.22-6 png_0.1-8
#> [73] memoise_2.0.1 bslib_0.9.0 Rcpp_1.0.14
#> [76] SparseArray_1.7.6 xfun_0.51 pkgconfig_2.0.3