scHiCcompare 0.99.16
scHiCcompare is designed for the imputation, joint normalization, and detection of differential chromatin interactions between two groups of chromosome-specific single-cell Hi-C datasets (scHi-C). The groups can be pre-defined based on biological conditions or created by clustering cells according to their chromatin interaction patterns. Clustering can be performed using methods like Higashi, scHiCcluster methods, etc.
scHiCcompare works with processed Hi-C data, specifically chromosome-specific chromatin interaction matrices, and accepts five-column tab-separated text files in a sparse matrix format.
The package provides two key functionalities:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("scHiCcompare")
# For the latest version install from GitHub
# devtools::install_github("dozmorovlab/scHiCcompare")
library(scHiCcompare)
library(HiCcompare)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(lattice)
library(data.table)
library(mclust)
scHiCcompare()
function conducts a differential analysis workflow, including
imputation and normalization, using chromosome-specific chromatin interaction
matrices split into two conditions (or two cell type groups).
Imputation - First, scHiC data in each group can optionally undergo
imputation
to address data sparsity. As resolution increases, the percentage
of ‘0’ values or missing data also rises drastically. To address this,
scHiCompare applies the random forest (RF
) imputation method on each genomic
distance (no pooling
) or pooled bands with the option of a pooling technique.
There are two pooling strategies:
Progressive pooling
, where each subsequent band combines interaction
frequencies (IFs) within a linearly increasing range of distances.
Fibonacci pooling
, where each subsequent band combines IFs within
increasing distance ranges, and this increase follows the Fibonacci
sequence.
Figure 1: An example of how different pooling styles assign genomic distance into each band
Due to extreme sparsity at higher genomic distances and randomness of long-range
interactions, scHiCcompare, by default, workflow focuses on a distance range of
1 to 10MB (main.Distance
). If any pool band falls outside this range and has
a percentage of missing values exceeding the specified threshold (default
missPerc.threshold is 95%), the missing interaction frequencies (IFs) values
are imputed using the mean IF values from the available data within that pool
band.
Pseudo-bulk scHi-C - Second, imputed single-cell Hi-C data in each group
are transformed into group-specific pseudo-bulk
scHi-C matrices by summing
all group-specific single-cell Hi-C matrices.
Normalization - After obtaining two group-specific pseudo-bulk matrices,
normalization
removes global and local biases between two pseudo-bulk
matrices. The scHiCcompare workflow applies a LOESS
regression model from HiCcompare to jointly
normalize two imputed pseudo-bulk matrices. Briefly,
Differential analysis - differential analysis
is performed on the
processed pseudo-bulk matrices to identify differential chromatin interactions
between the two cell types or conditions. This involves separating the
normalized log fold changes of interaction frequencies (the M-values) into
difference and non-difference groups. This analysis is performed on a
per-distance basis.
fprControl.logfc
is applied to
values in the difference group, excluding low log fold changes.To use scHiCcompare, you’ll need to define two groups of cells to compare and save cell-specific scHi-C data (individual files in .txt format) in two folders.
Each cell-specific scHi-C .txt file should be formatted as modified sparse upper triangular matrices in R, which consist of five columns (chr1, start1, chr2, start2, IF). Since the full matrix of chromatin interactions is symmetric, only the upper triangular portion, including the diagonal and excluding any 0, is stored in a sparse matrix format. The required sparse matrix format of each single-cell Hi-C is:
The ‘.txt’ files need to be saved in tab-separated columns and no row names, column names, or quotes around character strings with the example format below.
## chr1 start1 chr2 start2 IF
## 17669 chr20 0 chr20 0 128
## 17670 chr20 0 chr20 1000000 1
## 17671 chr20 1000000 chr20 1000000 179
## 17672 chr20 0 chr20 2000000 1
## 17673 chr20 1000000 chr20 2000000 1
## 17674 chr20 2000000 chr20 2000000 174
To run scHiCcompare()
, you need two folders with condition-specific scHiC
‘.txt’ files. The condition-specific groups of cells should be pre-defined
based on criteria such as experimental conditions, clustering results, or
biological characteristics.
In section Others, we shows examples of steps to download and import scHi-C data into R. User can refer to it for more information.
Here is an example workflow using scHiC human brain datasets (Lee et al., 2019) with ODC and MG cell types at chromosome 20 with a 1MB resolution.
For the following example sections, we will load samples of 10 single-cell Hi-C
data (in ‘.txt’) for each cell type group in two example folders (ODCs_example
and MGs_axample
). The files follow the same format as those downloaded via
download_schic()
of Bandnorm. You can extract the
folder path by the code below, which could be used as input for scHiCcompare()
function.
## Load folder of ODC file path
ODCs_example_path <- system.file("extdata/ODCs_example",
package = "scHiCcompare"
)
## Load folder of MG file path
MGs_example_path <- system.file("extdata/MGs_example",
package = "scHiCcompare"
)
Since the data downloaded by Bandnorm has the required input format (5 columns of [chr1, start1, chr2, start2, IF]), we don’t need an extra step for data modification. If, after importing your data into R, its format does not follow the sparse upper triangular input format requirement, you need to modify the data.
scHiCcompare(
file.path.1, file.path.2,
select.chromosome = "chr1",
imputation = "RF",
normalization = "LOESS",
differential.detect = "MD.cluster",
save.output.path = "results/",
...
)
Core Parameter :
The core workflow consists of the following steps:
file.path.1, file.path.2
- Character strings specifying paths to folders
containing scHi-C data for the first and second cell type or condition groups.select.chromosome
- Integer or character indicating the chromosome to be
analyzed (e.g., ‘chr1’ or ‘chr10’.)imputation
: Handle missing values in sparse single-cell matrices. 'RF'
enables Random Forest-based imputation.normalization
: Apply 'LOESS'
normalization to correct for systematic biases.differential.detect
: Identify significant changes in chromatin interactions. "MD.cluster"
indicates scHiCcompare’s differential detection test on MD plot.Optional Parameter :
In addition to core functionalities, scHiCcompare provides multiple optional parameters to customize imputation, normalization, differential analysis, and output handling.
pool.style
), set the number of imputations (n.imputation
), and control iteration limits (maxit
). Additionally, users can include/exclude extreme interaction frequency (IF) values (outlier.rm
) and set a missing data threshold (missPerc.threshold
), which determines the maximum allowable percentage of missing data in pool bands outside the focal genomic distances.A.min
parameter helps filter low interaction frequencies, ensuring robust outlier detection during differential analysis.fprControl.logfc
and alpha
regulate the false positive rate for GMM difference clusters and define the significance level for outlier detection.save.output.path
allow external storage of results (see
output example), while visualization settings (Plot
, Plot.normalize
) generate MD plots for differential detection and normalization effects.For further detail, user can refer to ?scHiCcompare()
In the following example, we will work with scHi-C data from 10 single cells in both ODC and MG cell types at a 1 MG resolution. We will focus on chromosome 20, applying the full workflow of scHiCcompare, which includes imputation, pseudo-bulk normalization, and differential analysis. Our goal is to detect differences for loci with genomic distances ranging from 1 to 10,000,000 bp. The progressive pooling style will be selected to create pool bands for the random forest imputation. For the differential analysis step, we will set the log fold change - false positive control threshold to 0.8.
The input file path was included in the package and conducted in the Prepare input folders section.
## Imputation with 'progressive' pooling
result <- scHiCcompare(
file.path.1 = ODCs_example_path,
file.path.2 = MGs_example_path,
select.chromosome = "chr20",
main.Distances = 1:10000000,
imputation = "RF",
normalization = "LOESS",
differential.detect = "MD.cluster",
pool.style = "progressive",
fprControl.logfc = 0.8,
Plot = TRUE
)
Figure 2: Chromatin differential detection between ODC and MG in chromosome 20 in example above
From the visualizations above, normalization effectively reduces the irregular trend in the M values between the imputed pseudo-bulk matrices of the two cell types. At a 1MB resolution, the differential analysis reveals that most of the detected differences occur at closer genomic distances, particularly below 5MB.
The scHiCcompare()
function will return an object that contains plots,
differential results, pseudo-bulk matrices, normalized results, and imputation
tables. The full differential results are available in $Differential_Analysis
.
Intermediate results can be accessed with $Intermediate
, including the
imputation result table ($Intermediate$Imputation
), the pseudo-bulk matrix in
sparse format ($Intermediate$PseudoBulk
), and the normalization table ($Intermediate$Bulk.Normalization
). These output table objects have the
following structure:
$Intermediate$PseudoBulk
for each condition group ($condition1
and
$condition2
) has a standard sparse upper triangular format with 3 columns
of [region1, region2, IF].
$Intermediate$Imputation
for each condition group ($condition1
and
$condition2
) has modified sparse upper triangular format:
$Intermediate$Bulk.Normalization
has 15 columns
$Differential_Analysis
has same structure as
$Intermediate$Bulk.Normalization
with addition of 2 differential detection
results columns
You also can have the option to save the results into the chosen directory by
a parameter in scHiCcompare()
function. This will
save the normalization result table, differential result table, and imputed cell
scHi-C data (each group is a sub-folder). The sample of the saved output folder
structure is:
|– Bulk_normalization_table.txt
|– Differential_analysis_table.txt
|– Imputed_{group 1’s name}/
|– Imputed_{group 2’s name}/
The normalization result Bulk_normalization_table.txt
has the same format as
the output object from the scHiCcompare()
function,
$Intermediate$Bulk.Normalization
, which is shown in the structure
example below.
The differential result table Differential_analysis_table.txt
also has the
same format as the output object $Differential_Analysis
from the function.
The imputed cell’s scHiC data is saved in a folder for each group, which has a modified sparse upper triangular format of five columns [chr1, start1, chr2, start2, IF].
Below is a continuous example from
Example of real anlysis above, showing how you can
extract different result options from the scHiCcompare()
function.
### Summary the analysis
print(result)
## -----------------------------------------------------------------
## ScHiCcompare - Differential Analysis for single cell Hi-C
## -----------------------------------------------------------------
## ScHiCcompare analyzes 10 cells of condition 1 group and 10 cells of condition 2 group at chromosome chr20.
##
## The process includes:
## Imputation, Bulk Normalization, Differential Analysis
##
## Note: See full differential result in $Differential_Analysis. Intermediate results can be accessed with $Intermediate.
### Extract imputed differential result
diff_result <- result$Differential_Analysis
DT::datatable(head(diff_result), options = list(scrollX = TRUE), width = 700)
### Extract imputed pseudo bulk matrices normalization
norm_result <- result$Intermediate$Bulk.Normalization
DT::datatable(head(norm_result), options = list(scrollX = TRUE), width = 700)
### Extract imputed ODC cell type table
imp_ODC_table <- result$Intermediate$Imputation$condition1
DT::datatable(head(imp_ODC_table), options = list(scrollX = TRUE), width = 700)
## Extract Pseudo-bulk matrix from imputed scHi-C data
## Pseudo bulk matrix in standard sparse format
psudobulk_result <- result$Intermediate$PseudoBulk$condition1
DT::datatable(head(psudobulk_result),
options = list(scrollX = TRUE), width = 700
)
Furthermore, you also have some parameter options in the function to indicate which plots to output and an option to save the results in a given directory.
There are several other functions included in scHiCcompare
package.
plot_HiCmatrix_heatmap()
produces a heatmap visualization for HiC and scHiC
matrices. It requires, as input, a modified sparse matrix, the same format from
scHiCcompare()
Input with five columns of chr1, start1, chr2 start2,
IF. More information can be found in its help document and the example below.
data("ODC.bandnorm_chr20_1")
plot_HiCmatrix_heatmap(
scHiC.sparse = ODC.bandnorm_chr20_1,
main = "Figure 3. Heatmap of a single cell matrix", zlim = c(0, 5)
)
## Matrix dimensions: 63x63
plot_imputed_distance_diagnostic()
generates a diagnostic visualization of
imputation across genomic distances for all single cells. It compares the
distribution of all cells’ interaction frequency at a given distance data before
and after imputation. It requires, as input, the scHiC table format of the
original and imputed scHiC datasets. ScHiC table format includes columns of
genomic loci coordinates and interaction frequencies (IF) of each cell (cell,
chromosome, start1, end1, IF1, IF2, IF3, etc).
The output of $Intermediate$Imputation
of scHiCcompare()
function is
directly compatible with this format. For more details, see the sections
on Output)
# Extract imputed table result
imp_MG_table <- result$Intermediate$Imputation$condition2
imp_ODC_table <- result$Intermediate$Imputation$condition1
We need to create the table input for original IFs values in the same format. Below is a continuous example from Example of real anlysis above, showing how you can construct scHiC table for original IF values and compare them with the output of imputed IF values.
# Create scHiC table object for original ODC interaction frequencies (IF)
scHiC.table_ODC <- imp_ODC_table[c("region1", "region2", "cell", "chr")]
# List all files in the specified directory for original ODC data
file.names <- list.files(
path = ODCs_example_path,
full.names = TRUE, recursive = TRUE
)
# Loop through each file to read and merge data
for (i in 1:length(file.names)) {
# Read the current file into a data frame
data <- read.delim(file.names[[i]])
names(data) <- c("chr", "region1", "chr2", "region2", paste0("IF_", i))
data <- data[, names(data) %in%
c("chr", "region1", "region2", paste0("IF_", i))]
# Merge the newly read data with the existing scHiC.table_ODC
scHiC.table_ODC <- merge(scHiC.table_ODC, data,
by = c("region1", "region2", "chr"), all = TRUE
)
}
# Create scHiC table object for original MG interaction frequencies (IF)
scHiC.table_MG <- imp_MG_table[c("region1", "region2", "cell", "chr")]
# List all files in the specified directory for original MG data
file.names <- list.files(
path = MGs_example_path,
full.names = TRUE, recursive = TRUE
)
# Loop through each file to read and merge data
for (i in 1:length(file.names)) {
# Read the current file into a data frame
data <- read.delim(file.names[[i]])
names(data) <- c("chr", "region1", "chr2", "region2", paste0("IF_", i))
data <- data[, names(data) %in%
c("chr", "region1", "region2", paste0("IF_", i))]
# Merge the newly read data with the existing scHiC.table_MG
scHiC.table_MG <- merge(scHiC.table_MG, data,
by = c("region1", "region2", "chr"), all = TRUE
)
}
# plot imputed Distance Diagnostic of MG
plot1 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 1
)
plot2 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 2
)
plot3 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 3
)
plot4 <- plot_imputed_distance_diagnostic(
raw_sc_data = scHiC.table_MG,
imp_sc_data = imp_MG_table, D = 4
)
gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)
Figure 3: Distribution diagnostic plot of imputed MG cells in some genomic distance
The diagnostic visualizations demonstrate that with a sample of only 10 single cells per group (note: this small sample size is for demonstration purposes only), the imputed values for MG closely match the original distribution only at shorter genomic distances (e.g., D1, D2). Increasing the number of single cells per group enhances imputation accuracy across distances. We recommend using a minimum of 80 single cells per group for optimal imputation performance.
To find and download single-cell Hi-C (scHi-C) data, you can use publicly available repositories and databases that host this type of data. Common sources include the Gene Expression Omnibus (GEO), the 4D Nucleome Data Portal (4DN), or data published in research articles, etc. Below are some examples of how to access and download scHi-C data.
You can search data on GEO using queries
such as single-cell Hi-C
, scHi-C
, or a GEO (GSE) series numbers (e.g., GSE80006), etc.
Once you select a dataset, processed data (.txt, .csv, .bed, .hic, etc.)
can be found under the “Supplementary files” section and downloaded via the
FTP
links at the bottom of the page. Additionally, GEO offers various download formats using
different mechanisms. For more details about downloading data in different
formats, visit the GEO download guide:
https://www.ncbi.nlm.nih.gov/geo/info/download.html.
You can also download these data by R. The example below shows steps to download the mouse scHi-C dataset (Flyamer et al.2017) on GEO.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("GEOquery")
library(GEOquery)
# For example, we want to download (Flyamer et al.2017) data
geo_id <- "GSE80006"
# Download the information, notation, feature data, etc
gse <- getGEO(geo_id, GSEMatrix = TRUE)
# You can read more about this function
?getGEO()
# Download and extract the supplementary files (processed data)
getGEOSuppFiles(geo_id, baseDir = "path/to/save")
Other sources:
Some research papers provide data from external sources, which are usually mentioned in the paper. For example, human brain datasets (Lee et al., 2019) are available through a public box directory located https://salkinstitute.box.com/s/fp63a4j36m5k255dhje3zcj5kfuzkyj1.
Additionally, some tools collect scHi-C data from various studies, like https://noble.gs.washington.edu/proj/schic-topic-model/. Below is an example of an R function from Bandnorm, which also accesses existing single-cell Hi-C data at a 1mbp resolution
To download human brain oligodendrocytes (ODC) and microglia (MG) cell type
(Lee et al., 2019), we used the download_schic()
function of Bandnorm package to download the scHiC data of ODC and
MG cell types groups in 1MB resolution.
install.packages(c("ggplot2", "dplyr", "data.table", "Rtsne", "umap"))
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("immunogenomics/harmony")
devtools::install_github("sshen82/BandNorm", build_vignettes = FALSE)
library(BandNorm)
### Download scHiC data of ODC and MG
download_schic("Lee2019", cell_type = "ODC", cell_path = "ODCs_example")
download_schic("Lee2019", cell_type = "MG", cell_path = "MGs_example")
After downloading scHi-C data, the next step is to import the data into R for analysis.
ScHi-C data is available in various formats from different sources. Below are examples of how to extract chromosome-specific data for analysis in R.
.bedepe, .csv, or .txt
formats :If your raw scHiC data has been processed to ‘.bedepe’, ‘.csv’, or ‘.txt’
formats, it can be read using read.delim()
, read.table()
, etc. Once the data
is loaded, if you have full Hi-C contact matrices, you can convert them to a
sparse upper triangular format using the full2sparse()
function of the HiCcompare package, then reformatting the columns to
achieve the sparse upper triangular input format.
.hic
format:To access and import .hic
files into R, you can use tools such as strawr for reading and processing .hic
files. You can
read other similar example in the HiCcompare package vignette
An example of reading .hic
with strawr is shown
below. straw()
reads the .hic
file of each single-cell Hi-C and outputs a
data.frame in a sparse upper triangular format. This step must be repeated for
single-cell Hi-C of the same cell type (condition) group.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("strawr")
library(strawr)
# Example to read the contact matrix from a .hic file of a single-cell
filepath <- "path/to/your/schic.hic"
contact_matrix <- straw(
norm = "NONE", # Normalization method (KR, VC, or NONE)
filepath, # Path to .hic file
chr1loc = "chr1", # Chromosome 1
chr2loc = "chr1", # Chromosome 2 (intra-chromosomal interactions)
unit = "BP", # Base pair (BP) resolution or fragment (FRAG) resolution
binsize = 200000 # Bin size (e.g., 200kb)
)
.cool
format:To access and import .cool
files into R, you can use cooler2bedpe()
function of HiCcompare package or cooler to access the
data. You can read example of
cooler on HiCcompare vignette
For example, the files can be read directly into R by cooler2bedpe()
function,
which will return a list object in the format of BEDPE, containing two elements:
“cis” - Contains the intra-chromosomal contact matrices, one per chromosome;
“trans” - Contains the inter-chromosomal contact matrix. You can read about this
in more detail by ?cooler2bedpe()
.
# Install BiocManager if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
# Install HiCcompare package from Bioconductor
BiocManager::install("HiCcompare")
# Load the HiCcompare package
library(HiCcompare)
cool.file <- read_files("path/to/schic.cool")
## R Under development (unstable) (2025-03-13 r87965)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mclust_6.1.1 data.table_1.17.0 lattice_0.22-7 gridExtra_2.3 ggplot2_3.5.1 tidyr_1.3.1 HiCcompare_1.29.0 dplyr_1.1.4 scHiCcompare_0.99.16 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.3 DBI_1.2.3 rlang_1.1.5 magrittr_2.0.3 miceadds_3.17-44 matrixStats_1.5.0 compiler_4.6.0 mgcv_1.9-3 vctrs_0.6.5 pkgconfig_2.0.3 shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0 magick_2.8.6 backports_1.5.0 XVector_0.47.2 labeling_0.4.3 rmarkdown_2.29 UCSC.utils_1.3.1 nloptr_2.2.1 tinytex_0.56 purrr_1.0.4 xfun_0.52 glmnet_4.1-8 jomo_2.7-6 cachem_1.1.0 GenomeInfoDb_1.43.4 jsonlite_2.0.0 rhdf5filters_1.19.2 DelayedArray_0.33.6 pan_1.9 Rhdf5lib_1.29.2 BiocParallel_1.41.5 broom_1.0.8 parallel_4.6.0
## [36] R6_2.6.1 bslib_0.9.0 RColorBrewer_1.1-3 ranger_0.17.0 car_3.1-3 boot_1.3-31 rpart_4.1.24 GenomicRanges_1.59.1 jquerylib_0.1.4 Rcpp_1.0.14 bookdown_0.42 SummarizedExperiment_1.37.0 iterators_1.0.14 knitr_1.50 IRanges_2.41.3 Matrix_1.7-3 splines_4.6.0 nnet_7.3-20 tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10 codetools_0.2-20 tibble_3.2.1 withr_3.0.2 InteractionSet_1.35.0 Biobase_2.67.0 evaluate_1.0.3 survival_3.8-3 pillar_1.10.2 BiocManager_1.30.25 carData_3.0-5 MatrixGenerics_1.19.1 mice_3.17.0 KernSmooth_2.23-26 DT_0.33
## [71] foreach_1.5.2 stats4_4.6.0 reformulas_0.4.0 generics_0.1.3 S4Vectors_0.45.4 munsell_0.5.1 scales_1.3.0 minqa_1.2.8 gtools_3.9.5 glue_1.8.0 pheatmap_1.0.12 tools_4.6.0 lme4_1.1-37 rhdf5_2.51.2 grid_4.6.0 mitools_2.4 crosstalk_1.2.1 rbibutils_2.3 colorspace_2.1-1 nlme_3.1-168 GenomeInfoDbData_1.2.14 Formula_1.2-5 cli_3.6.4 S4Arrays_1.7.3 gtable_0.3.6 rstatix_0.7.2 sass_0.4.9 digest_0.6.37 BiocGenerics_0.53.6 SparseArray_1.7.7 htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
## [106] mitml_0.4-5 MASS_7.3-65