crupR Vignette

Persia Akbari Omgba1*, Verena Laupert and Martin Vingron2

1Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, 14195, Germany
2Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, 14195, Germany

*omgba@molgen.mpg.de

15 April 2025

Abstract

crupR is the R package implementation of the enhancer prediction pipeline CRUP (Condition-specific Regulatory Units Prediction) (Ramisch, A., Heinrich, V., Glaser, L.V. et al.). It aims to facilitate a streamlined analysis of histone modification ChIP-seq data in R and offers four main steps and two side functions. The main steps include normalization of ChIP-seq counts, application of a machine learning based approach to predict enhancer activity, grouping of condition-specific enhancer clusters, and detection of putative target genes. Additionally, it improves the original approach for the identification of enhancer clusters by utilising a sliding window method for the pairwise comparisons of conditions.

Package

crupR 1.0.0

1 Installation
2 Overview
3 Getting started
- 3.1 What will I need?
- 3.2 Prepare the metadata file
4 Run the crupR pipeline
5 Exporting the files
6 Session Info

1 Installation

if(!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("crupR")

2 Overview

crupR is a pipeline which aims to efficiently combine all its functions into a simple workflow. The main steps are:

normalize() - Input normalizes the enrichment counts derived from histone modification (HM) ChIP-seq experiments
getPrediction() - Uses the normalized ChIP-seq counts to predict the active enhancers by applying a random forest based classifier
getDynamics() - Uses the predicted probabilities to find dynamically changing, condition-specific enhancer clusters
getTargets() - Uses RNA-seq experiments to find possible target genes for the dynamic enhancers

Beyond that, crupR offers an additional function to summarize the enhancer probabilities into single enhancers/enhancer peaks and to detect super-enhancer(super-enhancers), clusters of proximal enhancers,(getSE()), a function to visualize the condition-specific enhancer clusters (plotSummary()) and a function (saveFiles()) to export all the objects produced by each crupR step as appropriate formats (i.e. BED, BigWig or bedGraph).

For a more detailed description of the computational method behind every step, please check the original CRUP publication: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1860-7

3 Getting started

3.1 What will I need?

In order to run crupR, you need ChIP-seq experiments for the histone modifications H3K27ac, H3K4me1 and H3K4me3. Additionally, it is recommended to also include an input experiment for those experiments. However, if input experiments aren’t available, you can run crupR without them. The samples need to be in the BAM format and indexed.

3.2 Prepare the metadata file

The metadata contains the paths to all the necessary BAM files. It also contains further information about them, such as condition, replicate, histone modification and input file.

Let’s look at the metadata for the example files. First, we’ll need the paths to the BAM files of the histone modifications and the input experiments.

library(crupR)
files <- c(system.file("extdata", "Condition1.H3K4me1.bam", package="crupR"),
           system.file("extdata", "Condition1.H3K4me3.bam", package="crupR"),
           system.file("extdata", "Condition1.H3K27ac.bam", package="crupR"),
           system.file("extdata", "Condition2.H3K4me1.bam", package="crupR"),
           system.file("extdata", "Condition2.H3K4me3.bam", package="crupR"),
           system.file("extdata", "Condition2.H3K27ac.bam", package="crupR"))
                     
inputs <- c(rep(system.file("extdata", "Condition1.Input.bam",
                            package="crupR"), 3), 
            rep(system.file("extdata", "Condition2.Input.bam", 
                            package="crupR"), 3))

Now, we can build the data frame with the metadata by adding the information about the histone modification, condition and replicate for each file. In this example there is only one replicate per condition, thus all of the files get replicate “1”.

metaData <- data.frame(HM = rep(c("H3K4me1","H3K4me3","H3K27ac"),2),
                       condition = c(1,1,1,2,2,2), replicate = c(1,1,1,1,1,1),
                       bamFile = files, inputFile = inputs)

metaData
#>        HM condition replicate
#> 1 H3K4me1         1         1
#> 2 H3K4me3         1         1
#> 3 H3K27ac         1         1
#> 4 H3K4me1         2         1
#> 5 H3K4me3         2         1
#> 6 H3K27ac         2         1
#>                                                                    bamFile
#> 1 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K4me1.bam
#> 2 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K4me3.bam
#> 3 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K27ac.bam
#> 4 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.H3K4me1.bam
#> 5 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.H3K4me3.bam
#> 6 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.H3K27ac.bam
#>                                                                inputFile
#> 1 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 2 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 3 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 4 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.Input.bam
#> 5 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.Input.bam
#> 6 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.Input.bam

After creating the data frame with the metadata, you are ready to run crupR.

4 Run the crupR pipeline

4.1 Step 0: Normalize the ChIP-seq counts

This is a preparatory function that generates input-normalized genome-wide enrichment profiles of the 3 HMs and computes the ratio of H3K4me1 and H3K4me3 counts for a genome of choice. The genome builds mm9, mm10, hg19 and hg38 are already supported by normalize() and custom genomes can be used if provided by the user as a Seqinfo object. normalize() first creates a binned genome object (binsize = 100bp) and then computes the ChIP-seq counts for every bin derived from the BAM files. Low quality reads are filtered during this process (default min. quality: 10). For the input-normalization the log2 fold change between the ChIP-seq counts of the histone modifications and of the input experiments is formed: \(Count^{HM}_{norm}(bin) = log_2 \frac{Count^{HM}_{raw}(bin)}{Count^{Input}_{norm}(bin)}\). The ratio is computed as \(ratio(bin) = log2 \frac{Count^{H3K4me1}_{norm}(bin) + |min^{H3K4me1}_{norm}| + 1}{Count^{H3K4me3}_{norm}(bin) + |min^{H3K4me1}_{norm}| + 1}\), with \(min^{HM}\) being the minimum input-normalized count over all bins for histone modification \(HM\). The ratio column is necessary as it is used as a feature for the classifier in the next step getEnhancers().

The function needs to be run for every sample which is identified by specifying condition and replicate.

normalized_1 <- normalize(metaData = metaData, condition = 1, replicate = 1, 
                          genome = "mm10", sequencing = "paired")
#> 
#> Prepare the binned genome ...
#> Get summarized counts from ChIP-seq experiments ...
#> Normalize histone modifications by Input ...
#> Create summarized data matrix ...
#> time: 2.138133 mins
normalized_2 <- normalize(metaData = metaData, condition = 2, replicate = 1, 
                          genome = "mm10", sequencing = "paired")
#> 
#> Prepare the binned genome ...
#> Get summarized counts from ChIP-seq experiments ...
#> Normalize histone modifications by Input ...
#> Create summarized data matrix ...
#> time: 1.658465 mins

The output of normalize() is a GRanges object containing the normalized ChIP-seq counts and the H3K4me1/H3K4me3 ratio for the binned genome of choice and the metadata of the samples that were normalized.

normalized_1 #the object with the normalized counts
#> GRanges object with 26337756 ranges and 4 metadata columns:
#>              seqnames              ranges strand |   H3K4me1   H3K4me3
#>                 <Rle>           <IRanges>  <Rle> | <numeric> <numeric>
#>          [1]     chr1               1-100      * |         0         0
#>          [2]     chr1             101-200      * |         0         0
#>          [3]     chr1             201-300      * |         0         0
#>          [4]     chr1             301-400      * |         0         0
#>          [5]     chr1             401-500      * |         0         0
#>          ...      ...                 ...    ... .       ...       ...
#>   [26337752]     chrX 171030701-171030800      * |         0         0
#>   [26337753]     chrX 171030801-171030900      * |         0         0
#>   [26337754]     chrX 171030901-171031000      * |         0         0
#>   [26337755]     chrX 171031001-171031100      * |         0         0
#>   [26337756]     chrX 171031101-171031200      * |         0         0
#>                H3K27ac     ratio
#>              <numeric> <numeric>
#>          [1]         0 -0.228071
#>          [2]         0 -0.228071
#>          [3]         0 -0.228071
#>          [4]         0 -0.228071
#>          [5]         0 -0.228071
#>          ...       ...       ...
#>   [26337752]         0 -0.228071
#>   [26337753]         0 -0.228071
#>   [26337754]         0 -0.228071
#>   [26337755]         0 -0.228071
#>   [26337756]         0 -0.228071
#>   -------
#>   seqinfo: 20 sequences from GRCm38.p6 genome
S4Vectors::metadata(normalized_1) #meta data of the samples
#>        HM condition replicate
#> 1 H3K4me1         1         1
#> 2 H3K4me3         1         1
#> 3 H3K27ac         1         1
#>                                                                    bamFile
#> 1 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K4me1.bam
#> 2 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K4me3.bam
#> 3 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K27ac.bam
#>                                                                inputFile
#> 1 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 2 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 3 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam

By default, the files are input normalized. However, if input files are not available, crupR also offers the possibility of an input free mode. In this case, the raw counts are returned and used for the ratio.

normalized_1_inputFree <- normalize(metaData = metaData, condition = 1, 
                                    replicate = 1, genome = "mm10", 
                                    sequencing = "paired", input.free = TRUE)
#> 
#> Prepare the binned genome ...
#> Get summarized counts from ChIP-seq experiments ...
#> Create summarized data matrix ...
#> time: 2.140712 mins
normalized_1_inputFree
#> GRanges object with 26337756 ranges and 4 metadata columns:
#>              seqnames              ranges strand |   H3K4me1   H3K4me3
#>                 <Rle>           <IRanges>  <Rle> | <integer> <integer>
#>          [1]     chr1               1-100      * |         0         0
#>          [2]     chr1             101-200      * |         0         0
#>          [3]     chr1             201-300      * |         0         0
#>          [4]     chr1             301-400      * |         0         0
#>          [5]     chr1             401-500      * |         0         0
#>          ...      ...                 ...    ... .       ...       ...
#>   [26337752]     chrX 171030701-171030800      * |         0         0
#>   [26337753]     chrX 171030801-171030900      * |         0         0
#>   [26337754]     chrX 171030901-171031000      * |         0         0
#>   [26337755]     chrX 171031001-171031100      * |         0         0
#>   [26337756]     chrX 171031101-171031200      * |         0         0
#>                H3K27ac     ratio
#>              <integer> <numeric>
#>          [1]         0         0
#>          [2]         0         0
#>          [3]         0         0
#>          [4]         0         0
#>          [5]         0         0
#>          ...       ...       ...
#>   [26337752]         0         0
#>   [26337753]         0         0
#>   [26337754]         0         0
#>   [26337755]         0         0
#>   [26337756]         0         0
#>   -------
#>   seqinfo: 20 sequences from GRCm38.p6 genome

Additionally, crupR offers the possibility to reduce the analysis to chromosomes specified by the user, if other chromosomes are not of interest. This setting can improve runtime and yields a smaller GRanges object. To specify the chromosomes the parameter chroms needs to be set to a vector that contains the relevant chromosome names. Please note that the style of the chromosome names needs to match the style of the chromosome names in the respective BAM file. For example, if a BAM file uses the prefix “chr”, then the chromosome names in the vector also need to include the prefix. For the subsequent pipeline steps only the specified chromosomes will be considered. Here is an example:

normalized_1_chr8 <- normalize(metaData = metaData, condition = 1, 
                               replicate = 1, genome = "mm10", 
                               sequencing = "paired", chroms = c("chr8"))
#> 
#> Prepare the binned genome ...
#> Get summarized counts from ChIP-seq experiments ...
#> Normalize histone modifications by Input ...
#> Create summarized data matrix ...
#> time: 3.195976 secs

4.2 Step 1: Predict active enhancers with crupR

Now, the normalized ChIP-seq counts can be used to predict the occurrence of active enhancers on the genome of your choice.

prediction_1 <- getEnhancers(data = normalized_1)
#> 
#> Get classifier and empirical distribution function ...
#> Quantile normalize counts for features ...
#> Extend data matrix ...
#> Predict enhancers for each 100 bp bin ...
#> time: 56.80134 secs
prediction_2 <- getEnhancers(data = normalized_2)
#> 
#> Get classifier and empirical distribution function ...
#> Quantile normalize counts for features ...
#> Extend data matrix ...
#> Predict enhancers for each 100 bp bin ...
#> time: 54.24032 secs

By default, this function uses a classifier consisting of two random forests to predict the probability of a bin being an active enhancer. They use the histone modification patterns of the current bin and the 5 flanking bins (11 bins in total) to make their predictions. One classifier has been trained to predict the probability of the bin being active (\(P(bin = active)\)) and the other has been trained to distinguish between active enhancers and promoters. So, it predicts the probability of a bin being an enhancer, given that it is active (\(P(bin + enhancer| bin = active)\)). The final predictions is the product of the two probabilities.

The two default random forest classifiers provided by crupR are objects of class ‘randomForest’ as implemented by the randomForest R package. It is possible to use custom classifiers instead. They must be objects of the same class (saved as RDS objects). For more information on the format, please check the documentation of the function randomForest() of the randomForest package. Additionally, the custom classifiers should be trained to predict the same cases and use the same feature sets as the default random forests.

getEnhancers() expects a string containing the directory in which the two classifiers are stored. The two classifiers also need to be named “active_vs_inactive.rds” and “enhancer_vs_active_promoter.rds” respectively.

prediction_1_own_class <- getEnhancers(data = normalized_1, 
                                       classifier = "path/to/classifier")

The output of this step is a GRanges object containing the binned genome with the enhancer prediction values for each bin (and the same metadata as the input object). By adding the parameter setting all = TRUE, the GRanges object will also contain the individual probabilities computed by the two classifiers.

4.3 Step 1.5: Find enhancer peaks and super enhancers with crupR

As optional intermediate step, crupR offers the getSE() function. In this step, the genome-wide enhancer probabilities computed in the prior step (getEnhancers()) are summarized into single enhancer peaks. The enhancer peaks are then further grouped together into super-enhancers. Super-enhancers are defined as large domains with a high enhancer density. Here, spatially proximate enhancer peaks are considered to form super-enhancers.

To find the enhancer peaks, all bins with an enhancer probability over a certain threshold \(t\) (default \(0.5\)) are ranked according to their probability. The bins are then expanded by 5 bins on every side, resulting in enhancer regions of size 1100bp (11 bins). In case several enhancer regions are overlapping, the higher-ranking region is chosen and the remaining ones are discarded. The parameter cutoff can be used to set the probability threshold and i.e. make the enhancer peak calling process stricter. called single enhancers within a certain distance from each other are then clustered together as super enhancers (default distance: 12500bp). The parameter distance can be used to adjust the maximum distance between enhancers.

se <- getSE(data = prediction_2)
#> 
#> 2 single enhancer peak(s) found.
#> 1 enhancer cluster(s) found.
#> time: 1.128917 secs
se_strict <- getSE(data = prediction_2, cutoff = 0.9)
#> 
#> 0 single enhancer peak(s) found.
#> 0 enhancer cluster(s) found.
#> time: 0.9404922 secs
se_close <- getSE(data = prediction_2, distance=10000)
#> 
#> 2 single enhancer peak(s) found.
#> 1 enhancer cluster(s) found.
#> time: 1.061118 secs

Here, increasing the threshold from 0.5 to 0.9 resulted in less enhancer peaks being identified.

The output of this step is a list containing the same GRanges object that was produced by getEnhancers(), a GRanges object containing the single enhancer peak calls (can be saved as a bedGraph file) and a GRanges object containing the super-enhancers or rather peak clusters (can be saved as a BED file).

This step is not necessary for the next ones and can be skipped if one is not interested in the single enhancers or super-enhancers.

4.4 Step 2: Find conditon-specific enhancer clusters

The crupR function getDynamics() defines dynamic enhancer regions by applying a Kolmogorov-Smirnov (KS) test directly on the enhancer probabilities in a pairwise manner.

Before running the actual function, the predictions of the prior step must all be put into one list.

predictions <- list(prediction_1, prediction_2)

Now, everything is ready for getDynamics() to run.

clusters <- getDynamics(data = predictions)
#> 
#> Merge enhancer probabilites for all samples and conditions ...
#> Calculate pairwise p-values ...
#> Get condition-specific enhancer peaks ...
#> Combine peaks by significance pattern ...
#> time: 2.040225 mins

The output of this step is a GRanges objects with the condition-specific enhancers (and the full meta data of the samples).

#clusters
clusters
#> GRanges object with 1 range and 3 metadata columns:
#>     seqnames            ranges strand |     cluster   cond1_1   cond2_1
#>        <Rle>         <IRanges>  <Rle> | <character> <numeric> <numeric>
#>   1     chr8 26843401-26846500      * |          c1     0.273      0.86
#>   -------
#>   seqinfo: 20 sequences from GRCm38.p6 genome

#meta data
S4Vectors::metadata(clusters)
#>        HM condition replicate
#> 1 H3K4me1         1         1
#> 2 H3K4me3         1         1
#> 3 H3K27ac         1         1
#> 4 H3K4me1         2         1
#> 5 H3K4me3         2         1
#> 6 H3K27ac         2         1
#>                                                                    bamFile
#> 1 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K4me1.bam
#> 2 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K4me3.bam
#> 3 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.H3K27ac.bam
#> 4 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.H3K4me1.bam
#> 5 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.H3K4me3.bam
#> 6 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.H3K27ac.bam
#>                                                                inputFile
#> 1 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 2 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 3 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition1.Input.bam
#> 4 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.Input.bam
#> 5 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.Input.bam
#> 6 /tmp/RtmpGZ0ie1/Rinst168b1e66d7f627/crupR/extdata/Condition2.Input.bam

4.4.1 The pairwise comparisons

To identify condition-specific (or differential) enhancers, the genome-wide probabilities of the samples are first merged per condition. Merging is performed by taking the mean of all replicates in one condition and normalizing it by the variance. Then, the genome is scanned with a sliding window approach. The window contains the current bin and the \(w\) flanking bins, resulting in having the size \(2w +1\) (default: \(w = 10\)). For every possible pair of conditions, the merged enhancer probabilities within the window are compared. If they have a mean difference surpassing a threshold (default: 0.5) a KS test is performed on the two enhancer distributions. The resulting p-values are corrected using the Bonferroni method.

4.4.2 The clustering

Regions with significant p-values (default: \(\alpha = 0.05\)) are grouped according to their activity patterns. These patterns are derived from the pairwise comparisons and encoded in a binary manner. For every pairwise combination between conditions there are two bits in the final pattern code, one in case the enhancer is activate in the first condition but not in the second and one for the opposite case. For the list of differential enhancers resulting from the prior step, the results of every pairwise comparison is checked: If the enhancer probabilities show no significant differences, the two bits in the activity pattern code are set to 0. If they show significant differences according to the KS test, the direction is inferred by using the sign of the difference between the mean probabilities of the enhancer in the two conditions. Enhancers with the same patterns are grouped together and the clusters are enumerated.

The window size \(w\), the threshold for the mean probability difference and the significance level \(\alpha\) can be manipulated by using the parameters W, w_0 and cutoff respectively.

4.4.3 Visualization of the clusters using `plotSummary()`

As trying to decode the patterns for each cluster can be complicated and time-consuming, crupR offers the visualization function plotSummary() for the detected enhancer clusters. The function shows the boxplots of the maximum probabilities of the condition-specific enhancers in each condition for every cluster. This is a simple way of seeing which enhancers are active in which conditions.

Let’s take a look at the summary plot of the example files.

crupR::plotSummary(clusters)

In the example, there is only one cluster containing one enhancer region. The plot shows the maximum enhancer probability of this enhancer in both conditions. The enhancer seems to be active in condition 2, but not in condition 1.

4.5 Step 3: Find target genes of the condition-specific enhancer clusters

The last step in the pipeline is the identification of the target genes that are regulated by the condition-specific enhancers found in the previous step. In this step, the expression counts of a set of candidate target genes are correlated to the enhancer probability of the differential enhancers. If the correlation surpassed a threshold (default: 0.9) the gene is considered as a target. There are two possible settings for defining the candidate target genes. Either genes falling within the same topologically associated domain (TAD) as the enhancer are considered as candidates or the nearest gene is used.

4.5.1 Gene expression counts

For the function getTargets() the output of the step before is not sufficient. Additionally, crupR requires the annotated gene expression counts for each condition and replicate in a SummarizedExperiment object.

crupR offers such an expression file for the the example sets.

expression <- readRDS(file = system.file("extdata", "expressions.rds", 
                                         package="crupR"))
expression
#> class: RangedSummarizedExperiment 
#> dim: 1028 2 
#> metadata(0):
#> assays(1): expression
#> rownames(1028): 100037278 100038489 ... 97487 99375
#> rowData names(0):
#> colnames(2): cond1_1 cond2_1
#> colData names(0):

However, there is no in-built function that calculates the gene expression counts from the RNA-seq experiments. That means the user has to generate this object by themselves. Please make sure that the column names for the columns that contain the gene expression counts of each sample are the same as the column names of the columns that contain the enhancer probabilities of each sample (the getDynmics() output), i.e. “cond1_1”, “cond1_2”, “cond2_1” and so on. Additionally, the start and end sites of every gene needs to be included in the object.

4.5.2 Running `getTargets()`

Looking for the enhancer targets requires - besides the gene expression counts - a BED file containing TADs for the genome of your choice. In case mm10 is used, crupR provides a suitable BED file which is also used by default for mm10 data. In this case, simply running the code below is sufficient:

targets <- crupR::getTargets(data=clusters, expr = expression, genome = "mm10")
#> 
#> time: 0.2383614 secs

In case another genome is used or a different BED file is preferred, providing the path to this BED file is necessary.

path_to_bed <- system.file("extdata", "mESC_mapq30_KR_all_TADs.bed", 
                           package="crupR")
targets <- getTargets(data = clusters, expr = expression, 
                             genome = "mm10", TAD.file = path_to_bed)
#> 
#> time: 0.1987033 secs

4.5.2.1 Using the nearest genes

By default crupR chooses the genes that are in the same TAD as the enhancer cluster as candidate genes. Then, the expression values of those candidate genes and enhancer probabilities are correlated to identify putative target genes. However, crupR also offers an alternative approach that identifies the nearest gene as target gene. This approach does not require TAD regions or gene expression counts and is convenient when those are not available.

targets_nearest <- getTargets(data = clusters, expr = NULL, 
                              genome = "mm10", nearest = TRUE)
#> 
#> Warning in getTargets(data = clusters, expr = NULL, genome = "mm10", nearest =
#> TRUE): No gene expression counts provided... setting nearest to TRUE
#> time: 1.046195 secs

4.5.2.2 Output

The output of getTargets() is a GRanges object containing the computed units.

targets
#> GRanges object with 2 ranges and 9 metadata columns:
#>          seqnames            ranges strand |     cluster   cond1_1   cond2_1
#>             <Rle>         <IRanges>  <Rle> | <character> <numeric> <numeric>
#>   353310     chr8 26843401-26846500      * |          c1     0.273      0.86
#>    72316     chr8 26843401-26846500      * |          c1     0.273      0.86
#>                 TAD_COORDINATES CORRELATED_GENE CORRELATED_GENE_CHR
#>                     <character>     <character>            <factor>
#>   353310 chr8:26200000-27000000          353310                chr8
#>    72316 chr8:26200000-27000000           72316                chr8
#>          CORRELATED_GENE_PROMOTER_START CORRELATED_GENE_PROMOTER_END
#>                               <integer>                    <integer>
#>   353310                       26975325                     26977524
#>    72316                       26882582                     26880383
#>          CORRELATION
#>            <numeric>
#>   353310           1
#>    72316           1
#>   -------
#>   seqinfo: 20 sequences from an unspecified genome; no seqlengths

As shown above, there are 9 metadata columns in this GRanges, thus 12 columns in total:

seqnames: chr of the dynamic enhancer region
ranges: start and end of the dynamic enhancer region
strand: strand of the dynamic enhancer region
cluster: associated clusters of the dynamic enhancer region
cond1_1 & cond1_2: the max. probability values for each region per sample
TAD_COORDINATES: the chromosome and coordinates of the TAD in which the dynamic enhancer region is located
CORRELATED_GENE: the ID of the gene that is correlated with the dynamic enhancer region
CORRELATED_GENE_CHR: the chromosome of the gene that is correlated with the dynamic enhancer region
CORRELATED_GENE_PROMOTER_START: start of the promoter of the gene that is correlated with the dynamic enhancer region
CORRELATED_GENE_PROMOTER_END: end of the promoter of the gene that is correlated with the dynamic enhancer region
CORRELATION: correlation value

5 Exporting the files

After running the crupR pipeline, you might want to save the results. crupR provides the function saveFiles()that exports the different files in suitable formats.

In total, there are 6 possible formats: “bigWig”, “bedGraph”, “rds”, “bed”, “beds” and “UCSC”. However, the format to export the object to depends on the crupR step:

Output of getEnhancers(): The GRanges file that is produced in this step can be saved as bigWig file or an .rds file. Thus, the options “bigWig” and/or “rds” can be chosen.
Output of getSE(): This step produces two additional GRanges objects - single enhancer peak calls and peak clusters/super enhancers. The single enhancer peak calls can be export as a bedGraph file and the peak clusters can be exported as a BED file.
Output of getDynamics(): The GRanges object produced in this step can be exported as multiple BED files (“beds”). Each BED file contains the dynamic enhancer regions of one cluster.
Output of getTargets(): The GRanges object produced in this step can be exported in the UCSC interaction format (“UCSC”).

out_dir <- file.path(tempdir(), 'crupR') #let's use a temporary direcotry for the outputs
dir.create(out_dir) #create the directory
#save the GRanges object of the getEnhancers() step
saveFiles(data = prediction_1, modes = c("bigWig", "rds"), outdir = out_dir)
#save the GRanges object of the getSE() step
saveFiles(data = se, modes = c("bedGraph", "bed"), outdir = out_dir)
#save the GRanges object of the getDynamics() step
saveFiles(data = clusters, modes = "beds", outdir = out_dir)
#save the GRanges object of the getTargets() step
saveFiles(data = targets, modes = "UCSC", outdir = out_dir)

Note that the UCSC format needs a specification regarding whether or not the “nearest” mode was used when running getTargets().

saveFiles(data = targets_nearest, modes = "UCSC", outdir = out_dir, 
          nearest = TRUE)

The “nearest” parameter is FALSE by default and only needs to be changed when the “UCSC” mode was chosen and the units were predicted in the “nearest” mode.

Note that you could theoretically also use the modes “rds” and “bigWig” for the list se, as the output of getSE()also contains the same GRanges object as the output of getEnhancers(). Also note that “bed” can only be used for a getSE()output, while “beds” is exclusively used for a getDynamics() output.

6 Session Info

sessionInfo()
#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] crupR_1.0.0      BiocStyle_2.36.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] tidyselect_1.2.1                         
#>   [2] farver_2.1.2                             
#>   [3] dplyr_1.1.4                              
#>   [4] blob_1.2.4                               
#>   [5] Biostrings_2.76.0                        
#>   [6] bitops_1.0-9                             
#>   [7] fastmap_1.2.0                            
#>   [8] RCurl_1.98-1.17                          
#>   [9] GenomicAlignments_1.44.0                 
#>  [10] bamsignals_1.40.0                        
#>  [11] XML_3.99-0.18                            
#>  [12] digest_0.6.37                            
#>  [13] lifecycle_1.0.4                          
#>  [14] KEGGREST_1.48.0                          
#>  [15] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0 
#>  [16] RSQLite_2.3.9                            
#>  [17] magrittr_2.0.3                           
#>  [18] compiler_4.5.0                           
#>  [19] rlang_1.1.6                              
#>  [20] sass_0.4.10                              
#>  [21] tools_4.5.0                              
#>  [22] yaml_2.3.10                              
#>  [23] rtracklayer_1.68.0                       
#>  [24] knitr_1.50                               
#>  [25] labeling_0.4.3                           
#>  [26] S4Arrays_1.8.0                           
#>  [27] bit_4.6.0                                
#>  [28] curl_6.2.2                               
#>  [29] DelayedArray_0.34.0                      
#>  [30] plyr_1.8.9                               
#>  [31] abind_1.4-8                              
#>  [32] BiocParallel_1.42.0                      
#>  [33] withr_3.0.2                              
#>  [34] BiocGenerics_0.54.0                      
#>  [35] grid_4.5.0                               
#>  [36] stats4_4.5.0                             
#>  [37] preprocessCore_1.70.0                    
#>  [38] colorspace_2.1-1                         
#>  [39] ggplot2_3.5.2                            
#>  [40] scales_1.3.0                             
#>  [41] tinytex_0.57                             
#>  [42] SummarizedExperiment_1.38.0              
#>  [43] cli_3.6.4                                
#>  [44] rmarkdown_2.29                           
#>  [45] crayon_1.5.3                             
#>  [46] generics_0.1.3                           
#>  [47] reshape2_1.4.4                           
#>  [48] httr_1.4.7                               
#>  [49] rjson_0.2.23                             
#>  [50] DBI_1.2.3                                
#>  [51] cachem_1.1.0                             
#>  [52] stringr_1.5.1                            
#>  [53] zlibbioc_1.54.0                          
#>  [54] parallel_4.5.0                           
#>  [55] AnnotationDbi_1.70.0                     
#>  [56] BiocManager_1.30.25                      
#>  [57] XVector_0.48.0                           
#>  [58] restfulr_0.0.15                          
#>  [59] matrixStats_1.5.0                        
#>  [60] vctrs_0.6.5                              
#>  [61] Matrix_1.7-3                             
#>  [62] jsonlite_2.0.0                           
#>  [63] bookdown_0.43                            
#>  [64] IRanges_2.42.0                           
#>  [65] S4Vectors_0.46.0                         
#>  [66] bit64_4.6.0-1                            
#>  [67] magick_2.8.6                             
#>  [68] GenomicFeatures_1.60.0                   
#>  [69] jquerylib_0.1.4                          
#>  [70] glue_1.8.0                               
#>  [71] codetools_0.2-20                         
#>  [72] stringi_1.8.7                            
#>  [73] gtable_0.3.6                             
#>  [74] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
#>  [75] GenomeInfoDb_1.44.0                      
#>  [76] BiocIO_1.18.0                            
#>  [77] GenomicRanges_1.60.0                     
#>  [78] UCSC.utils_1.4.0                         
#>  [79] munsell_0.5.1                            
#>  [80] tibble_3.2.1                             
#>  [81] pillar_1.10.2                            
#>  [82] htmltools_0.5.8.1                        
#>  [83] randomForest_4.7-1.2                     
#>  [84] GenomeInfoDbData_1.2.14                  
#>  [85] R6_2.6.1                                 
#>  [86] evaluate_1.0.3                           
#>  [87] Biobase_2.68.0                           
#>  [88] lattice_0.22-7                           
#>  [89] png_0.1-8                                
#>  [90] Rsamtools_2.24.0                         
#>  [91] memoise_2.0.1                            
#>  [92] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
#>  [93] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  
#>  [94] bslib_0.9.0                              
#>  [95] Rcpp_1.0.14                              
#>  [96] SparseArray_1.8.0                        
#>  [97] xfun_0.52                                
#>  [98] MatrixGenerics_1.20.0                    
#>  [99] fs_1.6.6                                 
#> [100] pkgconfig_2.0.3