--- title: "diffloop: Identifying differential DNA loops from ChIA-PET data" author: "Caleb Lareau & Martin Aryee" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{diffloop: Identifying differential DNA loops from ChIA-PET data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, message=FALSE, warning=FALSE} library(diffloop) library(diffloopdata) library(ggplot2) library(GenomicRanges) ``` ## Preprocessing The raw FASTQ read files were preprocessed with the `dnaloop` (https://github.com/aryeelab/dnaloop) package, a `PyPi` package that relies on `samtools`, `bedtools`, `cutadapt`, and `MACS2` to align reads, call anchor peaks, and summarize PETs per sample in the ChIA-PET experiment.

To use the `loopsMake` function from a different preprocessing step, have files `X.loop_counts.bedpe`,`Y.loop_counts.bedpe`, `Z.loop_counts.bedpe` in `bed_dir` for `samples = (X,Y,Z)` where the first lines should resemble:

1 10002272 10004045 10 120968807 120969483 . 1
1 10002272 10004045 10 99551498 99552470 . 1
1 10002272 10004045 1 10002272 10004045 . 17


where the first three columns specify the position (chr:start:stop) of the first anchor, the second three columns specify the position (chr:start:stop) of the secnd anchor, the 7th column is "." and the 8th column is the number of paired-end reads support that particular PET. ## Loading data We read in data from the preprocessing pipeline using the `loopsMake` function. The example below uses sample data included in the `diffloopdata` package. It contains interactions involving chromosome 1 from a set of six Cohesin ChIA-PET samples.$^{1,2}$ You would normally set `bed_dir` to your real data directory. ``` {r load_data} bed_dir <- system.file("extdata", "esc_jurkat", package="diffloopdata") bed_dir samples <- c("naive_esc_1", "naive_esc_2", "primed_esc_1", "primed_esc_2", "jurkat_1", "jurkat_2") ``` Here, we use 1500 bp as a value to merge nearby anchors into a single peak. In the `loopsMake` function, the default is zero and should be parameterized based on the resolution of the ChIA-PET Experiment. ```{r read_data} full <- loopsMake(bed_dir, samples, 1500) celltypes <- c("naive", "naive", "primed", "primed", "jurkat", "jurkat") full <- updateLDGroups(full, celltypes) head(full, 3) ``` Alternatively, one could run `full <- loopsMake(bed_dir)` and the `samples` would be inferred from the directory contents. The default call to `loopsMake` does not merge nearby anchors but instead keeps them distinct. If we want to group nearby anchors after the initial data read, one can run the `mergeAnchors` command with the appropriate gap value.

We'll now look at a PCA plot of the raw data: ```{r pca_prenorm, fig.width=7, fig.height=4, fig.cap ="PCA plot of**un-normalized** ChIA-PET loop data"} pcaPlot(full) + geom_text(aes(label=samples)) ``` From this initial PCA plot, it's not clear how the samples separate based on their topologies, suggesting some quality control may be warranted. ## QC First, we normalize the `loops` object, which updates the `sizeFactor` of each sample by the number of reads identical to the normalization technique employed in `DESeq2`. We recommend computing the size factors on the original `loops` object to generate the best normalization factors. ```{r sizeFactors} full <- calcLDSizeFactors(full) head(full,3) ``` Looking at the full dataset, we can see that a majority of the counts were within self-loops: ```{r qc} loopMetrics(full) ``` We define "valid" looping in a combined experiment as those supported by at least 3 replicates in 2 or more samples. We can generate valid loops by using the `filterLoops` command ```{r qc2} full_5kb <- filterLoops(full, width=5000, nreplicates=3, nsamples = 1) dim(full_5kb) valid_full <- filterLoops(full, width=5000, nreplicates=3, nsamples = 2) dim(valid_full) ``` We can also see the proportion of loops that occur within the same chromosome (intrachromosomal) or have loops that bridge chromosomes (interchromosomal). ```{r qc3} dim(intrachromosomal(valid_full)) dim(interchromosomal(valid_full)) ``` Finally, we can determine the number of anchors supported by each sample as well as the loops with different thresholds of counts. ```{r qc4} numLoops(valid_full,1:5) numAnchors(valid_full) ``` ## Post-Normalization ```{r pca_postnorm, fig.width=7, fig.height=4, fig.cap = "PCA plot of normalized ChIA-PET loop data"} pcaPlot(valid_full) + geom_text(aes(label=samples)) + expand_limits(x = c(-20, 34)) ``` Here, the PC plot is much better as the first principal component separates the stem cells and the jurkat samples while the second PC separates the two stem cell samples. ## Differential Looping Here, we describe the most general association method first fitting the model then doing pairwise comparisons by our choice of parameterizing the loopTest function. ```{r geneAssoc} # Fit edgeR Models Pairwise between cell types fit <- loopFit(valid_full) jn_res <- loopTest(fit,coef=2) jp_res <- loopTest(fit,coef=3) np_res <- loopTest(fit,contrast=c(0,-1,1)) head(np_res) ``` We could also have computed pairwise associations by subsetting the data and using quickAssoc, which only allows loops objects with two group classes. ```{r quickAssoc} np <- valid_full[,1:4] summary(np[1:3,]) np_res2 <- quickAssoc(np) ``` Instead of determining the loops that distinguish these conditions from each other pairwise, this framework allows us to determine the loops that are significantly different between all of these conditions: ```{r assocResults} jnp_res <- loopTest(fit,coef=2:3) head(jnp_res) ``` ## Differential Looping Regions What regions are significantly different between jurkat and naive? We can answer this by doing a sliding window test by specifying the window size and the step size in the `slidingWindowTest` function. ```{r swAssoc} sw_jn <- slidingWindowTest(jn_res,200000,200000) head(sw_jn) ``` What gene bodies have significantly different looping between the cell lines? Whereas the `slidingWindowTest` makes no prior assumptions about the regions to be tested, `featureTest` requires a `GRanges` object of coordinates to determine what regions will be tested for association. ```{r featAssoc} chr1 <- getHumanGenes(c("1")) ft_jnp <- featureTest(jnp_res,chr1) head(ft_jnp,10) ``` ## Quantifying Associations How many differential loops are present pairwise between these conditions? ```{r pairwiseSum, message=FALSE, warning=FALSE} np_sig_res <- topLoops(np_res, FDR = 0.2) dim(np_sig_res) dim(topLoops(jp_res, FDR = 0.2)) dim(topLoops(jn_res, FDR = 0.2)) summary(np_sig_res) ``` At FDR < 0.2, only 7 loops are significantly different from the naive and primed stem cells whereas the jurkat differs between the primed stem cells by 169 significant loops and the naive by 208 significant loops on chromosome 1. These results are in line with the underlying biology of these cell types. \newline \newline Now we can visualize a PC plot using only loops that are significantly different between celltypes ```{r pca_sigfix, fig.width=7, fig.height=4, fig.cap = "PCA plot of differential loops"} pcaPlot(topLoops(jnp_res, FDR = 0.05)) + geom_text(aes(label=samples)) ``` Here, we see a much tighter cluster of the cell types, which may be expected since the only variables introduced into the principal components are the loops that are statistically different between the cell types. ## Loop Plots We can easily visualize the looping structure of a region specfied by a `GRanges` object in each sample. We also afford the annotation of the region with the genes for a particular organism. Currently, mouse and human (default) are supported. ```{r dloplots2,fig.width=7, fig.height=10} regA <- GRanges(c("1"),ranges=IRanges(c(36000000),c(36300000))) full.plot <- loopPlot(jn_res, regA) ``` $^1$Hnisz, Denes, et al. "Activation of proto-oncogenes by disruption of chromosome neighborhoods." Science (2016): aad9024.

$^2$Ji, Xiong, et al. "3D Chromosome Regulatory Landscape of Human Pluripotent Cells." Cell stem cell (2015).