%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{A Sample ChIP-Seq analysis workflow}
%\VignetteDepends{IRanges,lattice, TxDb.Mmusculus.UCSC.mm9.knownGene}
%\VignetteKeywords{chip-seq, sequencing, visualization}
%\VignettePackage{chipseq}


\documentclass[11pt]{article}
\title{Some Basic Analysis of ChIP-Seq Data}


\usepackage{times}
\usepackage[text={178mm,230mm},centering]{geometry}
\usepackage{Sweave}

\date{July 23, 2010}

\SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,width=9,height=6,prefix.string=figs-islands}
\setkeys{Gin}{width=0.98\textwidth}

\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\code}[1]{\texttt{#1}}

\begin{document}

\maketitle

\raggedright


Our goal is to describe the use of Bioconductor software to perform
some basic tasks in the analysis of ChIP-Seq data.  We will use
several functions in the as-yet-unreleased \Rpackage{chipseq} package,
which provides convenient interfaces to other powerful packages such
as \Rpackage{ShortRead} and \Rpackage{IRanges}.  We will also use the
\Rpackage{lattice} and \Rpackage{rtracklayer} packages for visualization.

<<setup,echo=TRUE,results=hide>>=

library(chipseq)
library(GenomicFeatures)
library(lattice)

@

\section*{Example data}

\noindent The \code{cstest} data set is included in the
\Rpackage{chipseq} package to help demonstrate its capabilities.  The
dataset contains data for three chromosomes from Solexa lanes, one
from a CTCF mouse ChIP-Seq, and one from a GFP mouse ChIP-Seq.  The
raw reads were aligned to the reference genome (mouse in this case)
using an external program (MAQ), and the results read in using the the
\code{readAligned} function in the \Rpackage{ShortRead}, in
conjunction with a filter produced by the \code{chipseqFilter}
function.  This step filtered the reads to remove duplicates, to
restrict mappings to the canonical, autosomal chromosomes and
ensure that only a single read maps to a given position. A quality
score cutoff was also applied.  The remaining data were reduced to a
set of aligned intervals (including orientation). This saves a great
deal of memory, as the sequences, which are unnecessary, are
discarded. Finally, we subset the data for chr10 to chr12, simply for
convenience in this vignette.

We outline this process with this unevaluated code block:
<<preprocess, eval=FALSE>>=
qa_list <- lapply(sampleFiles, qa)
report(do.call(rbind, qa_list))
## spend some time evaluating the QA report, then procede
filter <- compose(chipseqFilter(), alignQualityFilter(15))
cstest <- GenomicRangesList(lapply(sampleFiles, function(file) {
  as(readAligned(file, filter), "GRanges")
}))
cstest <- cstest[seqnames(cstest) %in% c("chr10", "chr11", "chr12")]
@ 
%
The above step has been performed in advance, and the output has been
included as a dataset in this package. We load it now:
<<>>=
data(cstest)
cstest
@
%
<<convert-cstest, echo=FALSE, eval=FALSE>>=
## code used to convert the GenomeDataList to a GRangesList
cstest <- GenomicRangesList(lapply(cstest, function(gd) {
  gr <- do.call(c, lapply(names(gd), function(chr) {
    pos <- gd[[chr]]
    starts <- c(pos[["-"]] - 23L, pos[["+"]])
    GRanges(chr, IRanges(starts, width = 24), 
            rep(c("-", "+"), elementLengths(pos)))
  }))
}))
@
%
\code{cstest} is an object of class \textit{GRangesList}, and has a
list-like structure, each component representing the alignments from
one lane, as a \textit{GRanges} object of stranded intervals.
<<>>=
cstest$ctcf
@ 
%

\section*{Extending reads}


Solexa gives us the first few (24 in this example) bases of each
fragment it sequences, but the actual fragment is longer.  By design,
the sites of interest (transcription factor binding sites) should be
somewhere in the fragment, but not necessarily in its initial part.
Although the actual lengths of fragments vary, extending the alignment
of the short read by a fixed amount in the appropriate direction,
depending on whether the alignment was to the positive or negative
strand, makes it more likely that we cover the actual site of
interest.
\\

It is possible to estimate the fragment length, through a variety of
methods. There are several implemented by the
\Rfunction{estimate.mean.fraglen} function. Generally, this only needs
to be done for one sample from each experimental protocol. Here, we
use SSISR, the default method:
<<estimate.mean.fraglen>>=
fraglen <- estimate.mean.fraglen(cstest$ctcf, method="correlation")
fraglen[!is.na(fraglen)]
@
%
Given the suggestion of $~190$ nucleotides, we extend all reads to be
200 bases long.  This is done using the \code{resize} function, which
considers the strand to determine the direction of extension:
<<>>=
ctcf.ext <- resize(cstest$ctcf, width = 200)
ctcf.ext
@ 
%
We now have intervals for the CTCF lane that represent the original
fragments that were precipitated.

\section*{Coverage, islands, and depth}

A useful summary of this information is the
\emph{coverage}, that is, how many times each base in the genome was
covered by one of these intervals.
<<>>=
cov.ctcf <- coverage(ctcf.ext)
cov.ctcf
@ 
%
For efficiency, the result is stored in a run-length encoded form.


\newpage

The regions of interest are contiguous segments of non-zero coverage,
also known as \emph{islands}.
<<>>=
islands <- slice(cov.ctcf, lower = 1)
islands
@ 
%
For each island, we can compute the number of reads in the island, and
the maximum coverage depth within that island.
<<>>=
viewSums(islands)
viewMaxs(islands)

nread.tab <- table(viewSums(islands) / 200)
depth.tab <- table(viewMaxs(islands))

nread.tab[,1:10]
depth.tab[,1:10]
@ 


\newpage

\section*{Processing multiple lanes}

Although data from one lane is often a natural analytical unit, we
typically want to apply any procedure to all lanes. 
Here is a simple summary function that
computes the frequency distribution of the number of reads.
<<>>=
islandReadSummary <- function(x)
{
    g <- resize(x, 200)
    s <- slice(coverage(g), lower = 1)
    tab <- table(viewSums(s) / 200)
    df <- DataFrame(tab)
    colnames(df) <- c("chromosome", "nread", "count")
    df$nread <- as.integer(df$nread)
    df
}
@
%
Applying it to our test-case, we get
<<>>=
head(islandReadSummary(cstest$ctcf))
@ 
%
We can now use it to summarize the full dataset, flattening the
returned \Rclass{DataFrameList} with the \code{stack} function.
<<>>=
nread.islands <- DataFrameList(lapply(cstest, islandReadSummary))
nread.islands <- stack(nread.islands, "sample")
nread.islands
@ 


\newpage

<<>>=
xyplot(log(count) ~ nread | sample, as.data.frame(nread.islands), 
       subset = (chromosome == "chr10" & nread <= 40), 
       layout = c(1, 2), pch = 16, type = c("p", "g"))
@ 

<<fig=TRUE,height=10,echo=FALSE>>=
plot(trellis.last.object())
@ 

\newpage


If reads were sampled randomly from the genome, then the null
distribution number of reads per island would have a geometric
distribution; that is,
\[
P(X = k) = p^{k-1} (1-p)
\]
In other words, $\log P(X = k)$ is linear in $k$.  Although our
samples are not random, the islands with just one or two reads may be
representative of the null distribution.
%
<<>>=
xyplot(log(count) ~ nread | sample, as.data.frame(nread.islands), 
       subset = (chromosome == "chr10" & nread <= 40), 
       layout = c(1, 2), pch = 16, type = c("p", "g"),
       panel = function(x, y, ...) {
           panel.lmline(x[1:2], y[1:2], col = "black")
           panel.xyplot(x, y, ...)
       })
@ 

<<fig=TRUE,height=8,echo=FALSE>>=
plot(trellis.last.object())
@ 

\newpage

We can create a similar plot of the distribution of depths.
%
<<>>=
islandDepthSummary <- function(x)
{
  g <- resize(x, 200)
  s <- slice(coverage(g), lower = 1)
  tab <- table(viewMaxs(s) / 200)
  df <- DataFrame(tab)
  colnames(df) <- c("chromosome", "depth", "count")
  df$depth <- as.integer(df$depth)
  df
}

depth.islands <- DataFrameList(lapply(cstest, islandDepthSummary))
depth.islands <- stack(depth.islands, "sample")

xyplot(log(count) ~ depth | sample, as.data.frame(depth.islands), 
       subset = (chromosome == "chr10" & depth <= 20), 
       layout = c(1, 2), pch = 16, type = c("p", "g"),
       panel = function(x, y, ...) {
           lambda <- 2 * exp(y[2]) / exp(y[1])
           null.est <- function(xx) {
               xx * log(lambda) - lambda - lgamma(xx + 1)
           }
           log.N.hat <- null.est(1) - y[1]
           panel.lines(1:10, -log.N.hat + null.est(1:10), col = "black")
           panel.xyplot(x, y, ...)
       })

## depth.islands <- summarizeReads(cstest, summary.fun = islandDepthSummary)

@ 

<<fig=TRUE,height=10,echo=FALSE>>=
plot(trellis.last.object())
@ 

The above plot is very useful for detecting peaks, discussed in the
next section. As a convenience, it can be created for the coverage
over all chromosomes for a single sample by calling the
\code{islandDepthPlot} function:
<<islandDepthPlot>>=
islandDepthPlot(cov.ctcf)
@

\newpage

\section*{Peaks}

To obtain a set of putative binding sites, i.e., peaks, we need to
find those regions that are significantly above the noise level.
Using the same Poisson-based approach for estimating the noise
distribution as in the plot above, the \Rfunction{peakCutoff} function
returns a cutoff value for a specific FDR:
<<peakCutoff>>=
peakCutoff(cov.ctcf, fdr = 0.0001)
@ 
%
Considering the above calculation of $7$ at an FDR of $0.0001$, and
looking at the above plot, we might choose $8$ as a conservative peak
cutoff:
<<>>=
peaks.ctcf <- slice(cov.ctcf, lower = 8)
peaks.ctcf
@ 

To summarize the peaks for exploratory analysis, we call the
\Rfunction{peakSummary} function:
<<peakSummary>>=
peaks <- peakSummary(peaks.ctcf)
@ 
%
The result is a \Rclass{RangedData} object with two columns: the view
maxs and the view sums. Beyond that, this object is often useful as a
scaffold for adding additional statistics.

It is meaningful to ask about the contribution of each strand to each
peak, as the sequenced region of pull-down fragments would be on
opposite sides of a binding site depending on which strand it matched.
We can compute strand-specific coverage, and look at the individual
coverages under the combined peaks as follows:
<<>>=
peak.depths <- viewMaxs(peaks.ctcf)

cov.pos <- coverage(ctcf.ext[strand(ctcf.ext) == "+"])
cov.neg <- coverage(ctcf.ext[strand(ctcf.ext) == "-"])

peaks.pos <- Views(cov.pos, ranges(peaks.ctcf))
peaks.neg <- Views(cov.neg, ranges(peaks.ctcf))

wpeaks <- tail(order(peak.depths$chr10), 4)
wpeaks

@ 
%
Below, we plot the four highest peaks on chromosome 10.

\newpage

<<>>=
coverageplot(peaks.pos$chr10[wpeaks[1]], peaks.neg$chr10[wpeaks[1]])
@ 
<<fig=TRUE,height=5,echo=FALSE>>=
plot(trellis.last.object())
@ 

<<>>=
coverageplot(peaks.pos$chr10[wpeaks[2]], peaks.neg$chr10[wpeaks[2]])
@ 
<<fig=TRUE,height=5,echo=FALSE>>=
plot(trellis.last.object())
@ 

\newpage

<<>>=
coverageplot(peaks.pos$chr10[wpeaks[3]], peaks.neg$chr10[wpeaks[3]])
@ 
<<fig=TRUE,height=5,echo=FALSE>>=
plot(trellis.last.object())
@ 

<<>>=
coverageplot(peaks.pos$chr10[wpeaks[4]], peaks.neg$chr10[wpeaks[4]])
@ 
<<fig=TRUE,height=5,echo=FALSE>>=
plot(trellis.last.object())
@ 

\newpage

\section*{Differential peaks}


One common question is: which peaks are different in two samples?  One
simple strategy is the following: combine the two peak sets, and
compare the two samples by calculating summary statistics for
the combined peaks on top of each coverage vector.
<<>>=

## find peaks for GFP control
cov.gfp <- coverage(resize(cstest$gfp, 200))
peaks.gfp <- slice(cov.gfp, lower = 8)

peakSummary <- diffPeakSummary(peaks.gfp, peaks.ctcf)

xyplot(asinh(sums2) ~ asinh(sums1) | space,
       data = as.data.frame(peakSummary), 
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.abline(median(y - x), 1)
       },
       type = c("p", "g"), alpha = 0.5, aspect = "iso")

@ 
<<fig=TRUE,height=5,echo=FALSE>>=
plot(trellis.last.object())
@ 


We use a simple cutoff to flag peaks that are different.
<<>>=
peakSummary <- 
    within(peakSummary,
       {
           diffs <- asinh(sums2) - asinh(sums1)
           resids <- (diffs - median(diffs)) / mad(diffs)
           up <- resids > 2
           down <- resids < -2
           change <- ifelse(up, "up", ifelse(down, "down", "flat"))
       })
@ 


\section*{Placing peaks in genomic context}


Locations of individual peaks may be of interest.  Alternatively, a
global summary might look at classifying the peaks of interest in the
context of genomic features such as promoters, upstream regions, etc.
The \Rpackage{GenomicFeatures} package facilitates obtaining gene
annotations from different data sources. We could download the UCSC
gene predictions for the mouse genome and generate a \Rclass{GRanges}
object with the transcript regions (from the first to last exon,
contiguous) using \Rfunction{makeTxDbFromUCSC}; here we use
a library containing a recent snapshot.
<<>>=
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
gregions <- transcripts(TxDb.Mmusculus.UCSC.mm9.knownGene)
gregions
@

We can now estimate the promoter for each transcript: 
<<>>=
promoters <- flank(gregions, 1000, both = TRUE)
@ 
%
And count the peaks that fall into a promoter:
<<>>=
peakSummary$inPromoter <- peakSummary %over% promoters
xtabs(~ inPromoter + change, peakSummary)
@ 
Or somewhere upstream or in a gene:
<<>>=
peakSummary$inUpstream <- peakSummary %over% flank(gregions, 20000)
peakSummary$inGene <- peakSummary %over% gregions
@ 

<<>>=
sumtab <- 
    as.data.frame(rbind(total = xtabs(~ change, peakSummary),
                        promoter = xtabs(~ change, 
                          subset(peakSummary, inPromoter)),
                        upstream = xtabs(~ change, 
                          subset(peakSummary, inUpstream)),
                        gene = xtabs(~ change, subset(peakSummary, inGene))))
##cbind(sumtab, ratio = round(sumtab[, "down"] /  sumtab[, "up"], 3))
@ 

\section*{Visualizing peaks in genomic context}

While it is generally informative to calculate statistics
incorporating the genomic context, eventually one wants a picture. The
traditional genome browser view is an effective method of visually
integrating multiple annotations with experimental data along the
genome. 

Using the \Rpackage{rtracklayer} package, we can upload our coverage
and peaks for both samples to the UCSC Genome Browser:
<<rtracklayer-upload, eval=FALSE>>=
library(rtracklayer)
session <- browserSession()
genome(session) <- "mm9"
session$gfpCov <- cov.gfp
session$gfpPeaks <- peaks.gfp
session$ctcfCov <- cov.ctcf
session$ctcfPeaks <- peaks.ctcf
@ 
%
Once the tracks are uploaded, we can choose a region to view, such as
the tallest peak on chr10 in the CTCF data:
<<rtracklayer-view, eval=FALSE>>=
peak.ord <- order(unlist(peak.depths), decreasing=TRUE)
peak.sort <- as(peaks.ctcf, "GRanges")[peak.ord]
view <- browserView(session, peak.sort[1], full = c("gfpCov", "ctcfCov"))
@ 
%
We coerce to \Rclass{GRanges} so that we can sort the ranges across
chromosomes.  By passing the \code{full} parameter to
\Rfunction{browserView} we instruct UCSC to display the coverage
tracks as a bar chart.  Next, we might programmatically display a view
for the top 5 tallest peaks:
<<rtracklayer-view-5, eval=FALSE>>=
views <- browserView(session, head(peak.sort, 5), full = c("gfpCov", "ctcfCov"))
@ 

\section*{Version information}

<<>>=
sessionInfo()
@ 

\end{document}