\documentclass{article}

\usepackage[colorlinks=true,linkcolor=blue]{hyperref}
\usepackage{theorem}
\usepackage{underscore}
\usepackage{color}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\file}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\bioc}{\software{BioConductor}}
\newcommand{\ELAND}{\software{ELAND}}
\newcommand{\MAQ}{\software{MAQ}}
\newcommand{\Bowtie}{\software{Bowtie}}

%% Excercises and Questions
\theoremstyle{break} \newtheorem{Ex}{Exercise}
\theoremstyle{break} \newtheorem{Q}{Question}
%% And solution or answer
\newenvironment{solution}{\color{blue}}{\bigskip}

\usepackage{Sweave}

\SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,width=9,height=6,prefix.string=WorkflowFig}


\title{Basic ChIP-Seq Data Analyis}

\usepackage[text={170mm,220mm},centering]{geometry}
\setkeys{Gin}{width=0.98\textwidth}

\date{June 6, 2009}


\begin{document}

\maketitle

\tableofcontents

%% \raggedright

\section{Introduction}

Our goal in this section of the course is to describe the use of
Bioconductor software to perform some basic tasks in the analysis of
ChIP-Seq data.  We will use tools from the \Rpackage{Iranges} and
\Rpackage{ShortRead} packages, and also use the \Rpackage{lattice}
package for visualization.  The next release of Bioconductor is set to
include a new package called \Rpackage{chipseq} that will provide a
more high-level interface to common tasks relevant for ChIP-Seq data
analysis.

<<echo=FALSE,results=hide>>=
library("ShortRead")
@ 


<<setup,echo=TRUE,results=hide>>=
library("ShortRead")
library("lattice")
@

\subsection{Example data}

\noindent The \code{data} folder contains two data files, each
containing data for three chromosomes from one Solexa lane, one from a
CTCF mouse ChIP-Seq, and one from a GFP mouse ChIP-Seq (a background
control).  
% FIXME (wh 6 June 2009): since the mouse genome does not contain GFP,
%   would it be more precise to say that these 
%   were ChIP-Seq experiments done with CTCF and GFP antibodies? 
%   (btw, why is the GFP antibody considered a useful control for the 
%    CTCF antibody?)
The raw reads were aligned to the reference genome (mouse
in this case) using an external program (MAQ), and the results read in
using the \Rfunction{readAligned} function in the \Rpackage{ShortRead}
package.  All duplicate reads were removed and a quality score cutoff
of 5 was used.
<<>>=
load("../data/ctcf.rda")
load("../data/gfp.rda")
@ 


\noindent
\code{ctcf} and \code{gfp} are objects of class \Rclass{\Sexpr{class(ctcf)}}.
<<>>=
ctcf
gfp
@ 
%
Further information on each alignment can be obtained using various
accessor functions whose names are hinted at in the summarized
display.  For example,
<<>>=
table(strand(ctcf))
table(chromosome(gfp))
@


\subsection{The mouse genome}

The data we have refer to alignments to a genome, and only makes sense
in that context.  Bioconductor has genome packages containing the full
sequences of several genomes.  The one relevant for us is
<<>>=
library("BSgenome.Mmusculus.UCSC.mm9")
mouse.chromlens <- seqlengths(Mmusculus)
head(mouse.chromlens)
@ 
%
We will only make use of the chromosome lengths, but the actual
sequence will be needed for motif finding, etc.


\section{Coverage, islands, and depth}

\paragraph{Extending reads}
Solexa gives us the first few (24 in this example) bases of each
fragment it sequences, but the actual fragment is longer.  By design,
the sites of interest (transcription factor binding sites) should be
somewhere in the fragment, but not necessarily in its initial part.
Although the actual lengths of fragments vary, extending the alignment
of the short read by a fixed amount in the appropriate direction,
depending on whether the alignment was to the positive or negative
strand, makes it more likely that we cover the actual site of
interest.  We will extend the aligned regions to a length of 150 bases.  
% The result is essentially a collection of intervals (ranges) over the
% reference genome. 

The extended regions can be summarized by their \emph{coverage},
that is, how many times each base in the genome was covered by one of
them.
<<coverage>>=
cov.ctcf <- coverage(ctcf, width = mouse.chromlens, extend = 126L)
cov.ctcf
cov.ctcf$chr10
@ 
%
For efficiency, the result is stored in a run-length encoded 
(\Rclass{Rle}) form.

The regions of interest are contiguous segments of non-zero coverage,
also known as \emph{islands}.  Islands can be identified by
\emph{slicing} the coverage at a depth of 1:
<<>>=
islands <- slice(cov.ctcf$chr10, lower = 1)
islands
@ 
%
For each island, we can compute its area, i.\,e.\ the sum
of the coverage values within the island, and
the maximum coverage value (here, we use the function
\Rfunction{head} to display results only for the first few islands).
%
<<viewSums>>=
viewSums(head(islands))
viewMaxs(head(islands))

nread.tab <- table(viewSums(islands) / 150L)
depth.tab <- table(viewMaxs(islands))

head(nread.tab, 10)
head(depth.tab, 10)

@ 

\begin{Ex}
  Repeat these steps for the \code{gfp} dataset.
\end{Ex}


\subsection{Processing multiple lanes}

Although data from one chromosome within one lane is often the natural
unit to work with, we typically want to apply any procedure to the data
from all chromosomes and from all lanes.  
We can recursively apply a summary function
to all chromosomes using the \Rfunction{lapply} function.  Here is a
simple summary function that computes the frequency distribution of
the number of reads per island.
<<>>=
islandReadSummary <- function(cov)
{
    s <- slice(cov, lower = 1)
    tab <- table(viewSums(s) / 150)
    ans <- data.frame(nread = as.numeric(names(tab)), 
                      count = as.numeric(tab))
    ans
}
@
%
Applying it to our test-case, we get
<<>>=
head(islandReadSummary(cov.ctcf$chr10))
@ 
%
We can now use it to summarize the full dataset.
<<make.groups>>=
nread.islands <- lapply(cov.ctcf, islandReadSummary)
nread.islands <- do.call(make.groups, nread.islands)
head(nread.islands)
@ 
%
Note the use of the \Rfunction{make.groups} function from the
\Rpackage{lattice} package, which combines several data frames into a
single data frame that includes a further column \Robject{which}
indicating which of the data frames each row came from.
%
<<xyplot>>=
xyplot(log(count) ~ nread | which, data = nread.islands, 
       subset = (nread <= 20), pch = 16, type = c("p", "g"))
@ 
%
<<fig=TRUE,height=3,echo=FALSE>>=
plot(trellis.last.object())
@ 


If reads were sampled randomly from the genome, then the null
distribution of the number of reads per island would have a geometric
distribution; that is,
\[
P(X = k) = p^{k-1} (1-p)
\]
where $p$ is the probability a randomly drawn read starts within a 
given interval of length 150.  
In other words, $\log P(X = k)$ is linear in $k$.
Although our samples are not random, we can estimate $p$ if we assume
that the islands with just one or two reads are representative of the
null distribution.
%
<<>>=
xyplot(log(count) ~ nread | which, data = nread.islands, 
       subset = (nread <= 20),
       pch = 16, 
       panel = function(x, y, ...) {
           panel.grid(h = -1, v = -1)
           panel.lmline(x[1:2], y[1:2], col = "black")
           panel.xyplot(x, y, ...)
       })
@ 

<<fig=TRUE,height=3,echo=FALSE>>=
plot(trellis.last.object())
@ 


\noindent
We can create a similar plot of the distribution of depths.
%
<<>>=
islandDepthSummary <- function(cov)
{
    s <- slice(cov, lower = 1)
    tab <- table(viewMaxs(s))
    ans <- data.frame(depth = as.numeric(names(tab)), count = as.numeric(tab))
    ans
}

depth.islands <- lapply(cov.ctcf, islandDepthSummary)
depth.islands <- do.call(make.groups, depth.islands)

xyplot(log(count) ~ depth | which, depth.islands, 
       subset = (depth <= 20), pch = 16,
       panel = function(x, y, ...) {
           panel.grid(h = -1, v = -1)
           lambda <- 2 * exp(y[2]) / exp(y[1])
           null.est <- function(xx) {
               xx * log(lambda) - lambda - lgamma(xx + 1)
           }
           log.N.hat <- null.est(1) - y[1]
           panel.lines(1:10, -log.N.hat + null.est(1:10), col = "black")
           panel.xyplot(x, y, ...)
       })

@ 
%
This assumes that the null distribution of depths has a Poisson
distribution, which is not strictly true, but seems to give a
reasonble fit.


<<fig=TRUE,height=3,echo=FALSE>>=
plot(trellis.last.object())
@ 


\begin{Ex}
  Produce similar plots for the \code{gfp} dataset.  What qualitatitve
  differences do you see?  Based on your findings, what would be a
  reasonbale cutoff for deciding that the depth of an island is too
  high to be explained by chance, and hence is likely to contain a
  CTCF binding site?
\end{Ex}


\subsection{Peaks}

Going back to our example of chr10 of the first sample, let us define
``peaks'' to be contiguous regions of the genome where coverage is 8
or more. (More sophisticated, model-based or adaptive algorithms exist,
and we refer to the literature in this active area of research). 
%
<<peaks>>=
peaks <- slice(cov.ctcf$chr10, lower = 8)
peaks
@ 
%
Interesting properties of peaks are their maximum depth and area under
the peak (a relative measure of how localized the peak is).
<<>>=
peak.depths <- viewMaxs(peaks)
peak.areas <- viewSums(peaks)
xyplot(peak.areas ~  peak.depths)

@ 
%
<<fig=TRUE,echo=FALSE>>=
plot(trellis.last.object())
@ 
\begin{Ex}
  Produce a similar plot for the \code{gfp} dataset.  What differences
  do you see, particularly in terms of the number of peaks and the
  distribution of depths?
\end{Ex}


\noindent
We can order the peaks by depth
<<>>=
wpeaks <- tail(order(peak.depths), 4)
peaks[wpeaks]
@ 
%
and plot individual peaks using this function:
<<>>=

coverageplot <- function (peaks, xlab = "Position", ylab = "Coverage", ...)
{
    pos1 <- seq(start(peaks[1]), end(peaks[1]))
    cov1 <- as.integer(peaks[[1]])
    pos1 <- c(head(pos1, 1), pos1, tail(pos1, 1))
    cov1 <- c(0, cov1, 0)
    xyplot(cov1 ~ pos1, ..., panel = panel.polygon, 
           col = "lightgrey", xlab = xlab, ylab = ylab)

}

@ 


<<>>=
coverageplot(peaks[wpeaks[1]])
@ 
<<fig=TRUE,height=5,echo=FALSE>>=
plot(trellis.last.object())
@ 
\begin{Ex}
  How does the amount by which each read is extended affect the
  analysis?  In calls to \Rfunction{coverage}, we have used
  \Rfunarg{extend=126L} to get a total length of 150 for each read.
  Try lengths of 100 and 200 and see how the results change.
\end{Ex}


\section{Version information}

<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@ 

\end{document}