‘Site2Target’ is an R package to associate sets of sites/peaks to target genes. It provides peakwise-associations to associate target genes for a given set of peaks. It also provides genewise-associations which start from genes (ex. differentially expressed genes) and associate peaks/sites to these genes.
Site2Target 0.99.5
To install this package, start R (version “4.4” or above) and either use BiocManager:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Site2Target")
or directly install from github:
devtools::install_github("fls-bioinformatics-core/Site2Target")
The goal of the Site2Target package is to provide bioinformaticians and computational biologists with computational tools to study the regulatory networks by integrating high-throughput omics datasets. Site2Target facilitates associating peaks from different high-throughput experiments to target genes. We have submitted Site2Target to Bioconductor because of its strong alignment with the mission of the Bioconductor project. Bioconductor provides a rich ecosystem for tools designed to analyze and interpret regulatory networks by integrating various high-throughput experiments. Our package fits well within this ecosystem, as it enables users to perform integrative analyses that combine various omics datasets that describe the regulatory networks with the differential gene expression.
Associating peaks/sites from high-throughput experiments and target genes is a fundamental question in gene regulatory studies. Peaks can be transcription factor binding sites from ChIP-seq experiments, or can be open chromatin regions derived from ATAC-seq experiments as well as histone modification marks derived from ChIP-seq experiments. Gene regulatory perturbation often measured by standard high-throughput expression measurement such as RNA-seq. The effect of sites/peaks on genes expression have been implemented in peakwise-association of Site2Target. The peaks are associated to genes by given distance (default 50K base pairs). Site2Target can also associate sites/peaks to genes with extra user provided DNA-DNA interactions. This can be known as enhancer-promoter interactions or high-throughput interactions from HiC or similar experiments. Alternatively, Site2Target can associate peaks to genes inside user given genomic ranges. These genomic ranges can be TADs, subTADs, loops, or any similar associations to genomic regions. Site2Target work with both peaks/sites with and without intensity values. For example, peaks may only consist of differential accessibility of open chromatin or binding sites of transcription factors. On the other hand, peaks/sites can also have intensity values such transcription factor binding sites intensities. In cases of binary associations, the number of associated peaks to genes follow negative binomial distribution similar to any other count data. Therefore, Site2Target provided negative binomial tests (default) as well as Poisson test to detect target genes of given peaks. Generally, peaks such as transcription factor binding sites, and to some extent histone marks and open chromatin accessibility follow log-normal distribution. Therefore, Site2Target models the intensity values assigned to each gene with log-normal distribution. A similar approach has been used to model upstream regulator and target gene interaction (Krämer et al. 2014). Site2Target used GenomicRanges (Lawrence et al. 2013) to perform all the functions related to genome interactions/associations, and used MASS (Venables and Ripley 2002) to perform statistical modeling. Site2Target also performs genewise-associations. For this aim it associates a set of user provided genes such as differential expressed genes to a set of user provided peaks/sites regions such as (differential or total) open chromatin regions, TF binding regions, etc. The associations can be done by genomic distance (ex. 50k or 100k base pairs), user provided interactions (ex. HiC interactions), or user provided regions (ex. TADs, subTADs, loops, etc). Three tables are the output of genewise-association: genes, peaks, and interaction tables. Functions are developed in Site2Target User to add further columns to these tables such as fold changes, intensity values of genes or peaks as well as interactome values such as HiC intensity values in interaction table.
First install and load the libraries needed to run the examples of this document:
library(Site2Target)
The functions, implemented in Site2Target, perform two major tasks: 1. peakwise-association: To predict target genes of given peaks/sites. 2. genewise-association: To associate given genes (ex. differential genes) to given peaks/sites (ex. differential open chromatins).
Site2Target predicts the target genes of given peaks/sites. getTargetGenesPvals
and getTargetGenesPvalsWithDNAInteractions
functions predict the target genes for peaks without using peaks intensities. getTargetGenesPvalsWithIntensities
and getTargetGenesPvalsWithIntensitiesAndDNAInteractions
functions predict the target genes for peaks using peak intensity values.
As an example, we use binding sites of a transcription factor MEIS as peak sets in human cardiomyocyte progenerators, and associate them to genes located up to 500000 base pairs. Only chr21 is used to make the data smaller. getTargetGenesPvals
function associates peaks to genes in this example. The number of associated peaks to each gene follow negative binomial distribution, so by default the negative binomial test is performed. Alternatively, users can choose a Poisson test to be performed. In this example, five top predicted targets based on the test results, and the expression log fold change WT vs MEIS KO are presented as a data frame.
# Read gene coordinates
geneFile=system.file("extdata", "gene_expression.tsv",
package="Site2Target")
geneCoords <- Table2Granges(geneFile)
# Read gene table
geneTable <- read.table(geneFile, header=TRUE)
# Read peak coordinates
tfFile =system.file("extdata", "MEIS_binding.tsv",
package="Site2Target")
TFCoords <- Table2Granges(tfFile)
# Predict targets
pvals <- getTargetGenesPvals( geneCoordinates=geneCoords,
sites=TFCoords, distance = 50000)
topTargetNum <- 5
topTargetIndex <- order(pvals)[1:topTargetNum]
# Make a data frame of peak targets pvalues and expression logFCs
dfTopTarget <-
data.frame(name=geneTable$name[topTargetIndex],
pvalue=pvals[topTargetIndex],
exprLogC=geneTable$logFC[topTargetIndex]
)
dfTopTarget
## name pvalue exprLogC
## 1 RUNX1 2.225948e-06 0.2180413
## 2 TIAM1 3.182140e-03 -0.7820419
## 3 TSPEAR 1.425856e-02 0.7062724
## 4 AP001043.1 1.720710e-02 -1.2158644
## 5 BX322557.1 2.507016e-02 0.7213611
getTargetGenesPvals
can also associate peaks and genes in user-provided genomic ranges. In the next example, we will associate peaks and genes by topologically associating domains (TADs) up to 1M bp distance.
# Read TAD regions
TADsFile =system.file("extdata", "TADs.tsv",
package="Site2Target")
TADs <- Table2Granges(TADsFile)
# Predict targets
pvals <-
getTargetGenesPvals(geneCoordinates=geneCoords,
sites=TFCoords, givenRegions=TADs, distance = 1000000)
topTargetNum <- 5
topTargetIndex <- order(pvals)[1:topTargetNum]
# Make a data frame of peak targets pvalues and expression logFCs
dfTopTarget <-
data.frame(name=geneTable$name[topTargetIndex],
pvalue=pvals[topTargetIndex],
exprLogC=geneTable$logFC[topTargetIndex]
)
dfTopTarget
## name pvalue exprLogC
## 1 RUNX1 0.05636843 0.21804131
## 2 BX322557.1 0.08365992 0.72136111
## 3 LINC00334 0.08696297 2.52135425
## 4 POFUT2 0.08696297 0.26165277
## 5 ADARB1 0.09038317 0.02902255
getTargetGenesPvalsWithDNAInteractions
associates peaks and genes using both distance (default 50k bp) and user provided DNA-DNA interactions. In the next example, we will associate peaks and genes by both distance (up to 50K bp), and HiC interactions.
# Read HiC interactions
HiCFile =system.file("extdata", "HiC_intensities.tsv",
package="Site2Target")
HiCstr1 <- Table2Granges(HiCFile, chrColName="Strand1_chr",
startColName="Strand1_start", endColName="Strand1_end")
HiCstr2 <- Table2Granges(HiCFile, chrColName="Strand2_chr",
startColName="Strand2_start", endColName="Strand2_end")
# Predict targets
pvals <-
getTargetGenesPvalsWithDNAInteractions(
geneCoordinates=geneCoords,
sites=TFCoords,
strand1=HiCstr1,
strand2=HiCstr2)
topTargetNum <- 5
topTargetIndex <- order(pvals)[1:topTargetNum]
# Make a data frame of peak targets pvalues and expression logFCs
dfTopTarget <-
data.frame(name=geneTable$name[topTargetIndex],
pvalue=pvals[topTargetIndex],
exprLogC=geneTable$logFC[topTargetIndex]
)
dfTopTarget
## name pvalue exprLogC
## 1 RUNX1 2.225948e-06 0.2180413
## 2 TIAM1 3.182140e-03 -0.7820419
## 3 TSPEAR 1.425856e-02 0.7062724
## 4 AP001043.1 1.720710e-02 -1.2158644
## 5 BX322557.1 2.507016e-02 0.7213611
Similar functions have been implemented in Site2Target to predict the target genes for peaks with intensity values. intensity values of ChIP-seq, ATAC-seq, and similar experiments generally follows log-normal distribution. Therefore, normal test of log-scaled intensity values associated to genes have been implemented for peaks with intensity values.
getTargetGenesPvalsWithIntensities
is similar to getTargetGenesPvals
, but it also take the intensity values of peaks. It associates peaks by distance or by distance and user provided regions such as TADs. In the following example, MEIS target genes are predicted by associating peaks to genes by default distance 50K bp.
# Read MEIS binding intensities
tfTable <- read.table(tfFile, header=TRUE)
tfIntensities <- tfTable$intensities
# Predict targets
pvals <-
getTargetGenesPvalsWithIntensities(
geneCoordinates=geneCoords,
sites=TFCoords,
intensities=tfIntensities
)
topTargetNum <- 5
topTargetIndex <- order(pvals)[1:topTargetNum]
# Make a data frame of peak targets pvalues and expression logFCs
dfTopTarget <-
data.frame(name=geneTable$name[topTargetIndex],
pvalue=pvals[topTargetIndex],
exprLogC=geneTable$logFC[topTargetIndex]
)
dfTopTarget
## name pvalue exprLogC
## 1 RUNX1 0.001699371 0.2180413
## 2 BX322557.1 0.006635404 0.7213611
## 3 LINC00334 0.008806410 2.5213542
## 4 POFUT2 0.008831260 0.2616528
## 5 ABCG1 0.019641655 -0.9459097
getTargetGenesPvalsWithIntensitiesAndDNAInteractions
is similar to getTargetGenesPvalsWithDNAInteractions
, but it also takes the intensity values of peaks. It associates peaks by distance or by distance and user provided DNA-DNA interactions such as HiC. In the following example, MEIS target genes are predicted by associating peaks to genes (default 50K bp) and HiC interactions.
# Predict targets
pvals <-
getTargetGenesPvalsWithIntensitiesAndDNAInteractions(
geneCoordinates=geneCoords,
sites=TFCoords,
intensities=tfIntensities,
strand1=HiCstr1,
strand2=HiCstr2
)
topTargetNum <- 5
topTargetIndex <- order(pvals)[1:topTargetNum]
# Make a data frame of peak targets pvalues and expression logFCs
dfTopTarget <-
data.frame(name=geneTable$name[topTargetIndex],
pvalue=pvals[topTargetIndex],
exprLogC=geneTable$logFC[topTargetIndex]
)
dfTopTarget
## name pvalue exprLogC
## 1 RUNX1 0.001699371 0.2180413
## 2 BX322557.1 0.006635404 0.7213611
## 3 LINC00334 0.008806410 2.5213542
## 4 POFUT2 0.008831260 0.2616528
## 5 ABCG1 0.019641655 -0.9459097
Site2Target also provides functions to start with genes such as differential genes and associate peaks/sites to these genes. genewiseAssociation
is the main function to associate peaks to the user provided genes. Extra information regarding genes (ex. expression logFC) or peaks (ex. peak intensities) can be added using addColumn2geneWiseAssociation
function. Extra information regarding interactions (ex. HiC interaction intensities) can be added using addRelation2geneWiseAssociation
function.
genewiseAssociation
function detects associate peaks to genes and saves the results in three tables: 1. gene.tsv
consists of gene information. 2. peak.tsv
consists of peak information. 3. link.tsv
consists of linking gene-peak information such as gene and peak names as well as their distances. These files can be found in a folder with the user-provided name. In the following example, we start with differential genes WT vs MEIS KO, and associate them to MEIS binding sites by using distance (50k bp), TADs (up to 1M bp distance), and HiC interactions. The function provides two values: the ratio of genes which are associated to peaks called geneCoverage
as well as the ratio of peaks which are associated to genes called peakCoverage
.
# Take genes with Log fold change larger than one
geneDEIndices <- which((abs(geneTable$logFC)>1)==TRUE)
indicesLen <- length(geneDEIndices)
if(indicesLen >0)
{
geneTable <- geneTable[geneDEIndices,]
geneCoords <- geneCoords[geneDEIndices]
}
geneDENames <- geneTable$name
geneDElogFC <- geneTable$logFC
geneCoordsDE <- geneCoords
# Associate differential genes to TF binding by 50K distance
statsDist <-
genewiseAssociation(associationBy="distance",
geneCoordinates=geneCoordsDE,
geneNames=geneDENames,
peakCoordinates=TFCoords,
distance=50000,
outFile="Gene_TF_50K"
)
statsDist
## geneCoverage peakCoverage
## 1 0.8064516 0.1721698
# Associate differential genes to TF binding by TADs (up to 1M bp)
statsTADs <-
genewiseAssociation(associationBy="regions",
geneCoordinates=geneCoordsDE,
geneNames=geneDENames,
peakCoordinates=TFCoords,
givenRegions=TADs,
distance=1000000,
outFile="Gene_TF_TADs"
)
statsTADs
## geneCoverage peakCoverage
## 1 0.9677419 0.4681604
# Associate differential genes to TF binding by HiC
statsHiC <-
genewiseAssociation(associationBy="DNAinteractions",
geneCoordinates=geneCoordsDE,
geneNames=geneDENames,
peakCoordinates=TFCoords,
distance=50000,
strand1=HiCstr1,
strand2=HiCstr2,
outFile="Gene_TF_HiC"
)
statsHiC
## geneCoverage peakCoverage
## 1 0.8064516 0.1721698
addColumn2geneWiseAssociation
function adds information regarding genes and peaks to the tables. In the first example, this function adds log fold changes values of gene expression WT vs MEIS KO to gene.tsv
and link.tsv
tables. In the second example, this function adds MEIS binding intensities to peak.tsv
and link.tsv
tables.
# Add log fold changes of gene expressoin
addColumn2geneWiseAssociation(type="gene", name=geneDENames,
columnName="Expr_logFC",
column=geneDElogFC,
inFile="Gene_TF_50K",
outFile="Gene_TF_50K"
)
# Add binding intensities of MEIS
addColumn2geneWiseAssociation(type="peak",
coordinates=TFCoords,
columnName="Binding_Intensities",
column=tfIntensities,
inFile="Gene_TF_50K",
outFile="Gene_TF_50K"
)
addRelation2geneWiseAssociation
function adds a column of information only to link.tsv
table. In the following example, HiC values between genes and peak regions are added to the link.tsv
table.
HiCTable <- read.table(HiCFile, header=TRUE)
HiCintensities <- HiCTable$intensities
addRelation2geneWiseAssociation(strand1=HiCstr1,
strand2=HiCstr2,
columnName="HiC_Intensities",
column=HiCintensities,
inFile="Gene_TF_50K",
outFile="Gene_TF_50K"
)
sessionInfo()
## R Under development (unstable) (2025-03-13 r87965)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Site2Target_0.99.5 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.7 cli_3.6.4 knitr_1.50
## [4] rlang_1.1.5 xfun_0.51 UCSC.utils_1.3.1
## [7] generics_0.1.3 jsonlite_2.0.0 S4Vectors_0.45.4
## [10] htmltools_0.5.8.1 sass_0.4.9 stats4_4.6.0
## [13] rmarkdown_2.29 evaluate_1.0.3 jquerylib_0.1.4
## [16] MASS_7.3-65 fastmap_1.2.0 yaml_2.3.10
## [19] IRanges_2.41.3 lifecycle_1.0.4 GenomeInfoDb_1.43.4
## [22] bookdown_0.42 BiocManager_1.30.25 compiler_4.6.0
## [25] XVector_0.47.2 digest_0.6.37 R6_2.6.1
## [28] GenomeInfoDbData_1.2.14 GenomicRanges_1.59.1 bslib_0.9.0
## [31] tools_4.6.0 BiocGenerics_0.53.6 cachem_1.1.0
Krämer, A., J. Green, J. Jr. Pollard, and S. Tugendreich. 2014. “Causal Analysis Approaches in Ingenuity Pathway Analysis.” Bioinformatics 30 (4): 523–30. https://doi.org/10.1093/bioinformatics/btt703.
Lawrence, M., W. Huber, H. Pagès, P. Aboyoun, M. Carlson, R. Gentleman, M. Morgan, and V. Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology 9 (8). https://doi.org/10.1371/journal.pcbi.1003118.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.