\documentclass{article} \usepackage[colorlinks=true,linkcolor=blue]{hyperref} \usepackage{theorem} \usepackage{underscore} \usepackage{color} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\bioc}{\software{BioConductor}} \newcommand{\ELAND}{\software{ELAND}} \newcommand{\MAQ}{\software{MAQ}} \newcommand{\Bowtie}{\software{Bowtie}} %% Excercises and Questions \theoremstyle{break} \newtheorem{Ex}{Exercise} \theoremstyle{break} \newtheorem{Q}{Question} %% And solution or answer \newenvironment{solution}{\color{blue}}{\bigskip} \usepackage{Sweave} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,width=9,height=6,prefix.string=WorkflowFig} \title{Basic ChIP-Seq Data Analyis} \usepackage[text={170mm,220mm},centering]{geometry} \setkeys{Gin}{width=0.98\textwidth} \date{June 6, 2009} \begin{document} \maketitle \tableofcontents %% \raggedright \section{Introduction} Our goal in this section of the course is to describe the use of Bioconductor software to perform some basic tasks in the analysis of ChIP-Seq data. We will use tools from the \Rpackage{Iranges} and \Rpackage{ShortRead} packages, and also use the \Rpackage{lattice} package for visualization. The next release of Bioconductor is set to include a new package called \Rpackage{chipseq} that will provide a more high-level interface to common tasks relevant for ChIP-Seq data analysis. <>= library("ShortRead") @ <>= library("ShortRead") library("lattice") @ \subsection{Example data} \noindent The \code{data} folder contains two data files, each containing data for three chromosomes from one Solexa lane, one from a CTCF mouse ChIP-Seq, and one from a GFP mouse ChIP-Seq (a background control). % FIXME (wh 6 June 2009): since the mouse genome does not contain GFP, % would it be more precise to say that these % were ChIP-Seq experiments done with CTCF and GFP antibodies? % (btw, why is the GFP antibody considered a useful control for the % CTCF antibody?) The raw reads were aligned to the reference genome (mouse in this case) using an external program (MAQ), and the results read in using the \Rfunction{readAligned} function in the \Rpackage{ShortRead} package. All duplicate reads were removed and a quality score cutoff of 5 was used. <<>>= load("../data/ctcf.rda") load("../data/gfp.rda") @ \noindent \code{ctcf} and \code{gfp} are objects of class \Rclass{\Sexpr{class(ctcf)}}. <<>>= ctcf gfp @ % Further information on each alignment can be obtained using various accessor functions whose names are hinted at in the summarized display. For example, <<>>= table(strand(ctcf)) table(chromosome(gfp)) @ \subsection{The mouse genome} The data we have refer to alignments to a genome, and only makes sense in that context. Bioconductor has genome packages containing the full sequences of several genomes. The one relevant for us is <<>>= library("BSgenome.Mmusculus.UCSC.mm9") mouse.chromlens <- seqlengths(Mmusculus) head(mouse.chromlens) @ % We will only make use of the chromosome lengths, but the actual sequence will be needed for motif finding, etc. \section{Coverage, islands, and depth} \paragraph{Extending reads} Solexa gives us the first few (24 in this example) bases of each fragment it sequences, but the actual fragment is longer. By design, the sites of interest (transcription factor binding sites) should be somewhere in the fragment, but not necessarily in its initial part. Although the actual lengths of fragments vary, extending the alignment of the short read by a fixed amount in the appropriate direction, depending on whether the alignment was to the positive or negative strand, makes it more likely that we cover the actual site of interest. We will extend the aligned regions to a length of 150 bases. % The result is essentially a collection of intervals (ranges) over the % reference genome. The extended regions can be summarized by their \emph{coverage}, that is, how many times each base in the genome was covered by one of them. <>= cov.ctcf <- coverage(ctcf, width = mouse.chromlens, extend = 126L) cov.ctcf cov.ctcf$chr10 @ % For efficiency, the result is stored in a run-length encoded (\Rclass{Rle}) form. The regions of interest are contiguous segments of non-zero coverage, also known as \emph{islands}. Islands can be identified by \emph{slicing} the coverage at a depth of 1: <<>>= islands <- slice(cov.ctcf$chr10, lower = 1) islands @ % For each island, we can compute its area, i.\,e.\ the sum of the coverage values within the island, and the maximum coverage value (here, we use the function \Rfunction{head} to display results only for the first few islands). % <>= viewSums(head(islands)) viewMaxs(head(islands)) nread.tab <- table(viewSums(islands) / 150L) depth.tab <- table(viewMaxs(islands)) head(nread.tab, 10) head(depth.tab, 10) @ \begin{Ex} Repeat these steps for the \code{gfp} dataset. \end{Ex} \subsection{Processing multiple lanes} Although data from one chromosome within one lane is often the natural unit to work with, we typically want to apply any procedure to the data from all chromosomes and from all lanes. We can recursively apply a summary function to all chromosomes using the \Rfunction{lapply} function. Here is a simple summary function that computes the frequency distribution of the number of reads per island. <<>>= islandReadSummary <- function(cov) { s <- slice(cov, lower = 1) tab <- table(viewSums(s) / 150) ans <- data.frame(nread = as.numeric(names(tab)), count = as.numeric(tab)) ans } @ % Applying it to our test-case, we get <<>>= head(islandReadSummary(cov.ctcf$chr10)) @ % We can now use it to summarize the full dataset. <>= nread.islands <- lapply(cov.ctcf, islandReadSummary) nread.islands <- do.call(make.groups, nread.islands) head(nread.islands) @ % Note the use of the \Rfunction{make.groups} function from the \Rpackage{lattice} package, which combines several data frames into a single data frame that includes a further column \Robject{which} indicating which of the data frames each row came from. % <>= xyplot(log(count) ~ nread | which, data = nread.islands, subset = (nread <= 20), pch = 16, type = c("p", "g")) @ % <>= plot(trellis.last.object()) @ If reads were sampled randomly from the genome, then the null distribution of the number of reads per island would have a geometric distribution; that is, \[ P(X = k) = p^{k-1} (1-p) \] where $p$ is the probability a randomly drawn read starts within a given interval of length 150. In other words, $\log P(X = k)$ is linear in $k$. Although our samples are not random, we can estimate $p$ if we assume that the islands with just one or two reads are representative of the null distribution. % <<>>= xyplot(log(count) ~ nread | which, data = nread.islands, subset = (nread <= 20), pch = 16, panel = function(x, y, ...) { panel.grid(h = -1, v = -1) panel.lmline(x[1:2], y[1:2], col = "black") panel.xyplot(x, y, ...) }) @ <>= plot(trellis.last.object()) @ \noindent We can create a similar plot of the distribution of depths. % <<>>= islandDepthSummary <- function(cov) { s <- slice(cov, lower = 1) tab <- table(viewMaxs(s)) ans <- data.frame(depth = as.numeric(names(tab)), count = as.numeric(tab)) ans } depth.islands <- lapply(cov.ctcf, islandDepthSummary) depth.islands <- do.call(make.groups, depth.islands) xyplot(log(count) ~ depth | which, depth.islands, subset = (depth <= 20), pch = 16, panel = function(x, y, ...) { panel.grid(h = -1, v = -1) lambda <- 2 * exp(y[2]) / exp(y[1]) null.est <- function(xx) { xx * log(lambda) - lambda - lgamma(xx + 1) } log.N.hat <- null.est(1) - y[1] panel.lines(1:10, -log.N.hat + null.est(1:10), col = "black") panel.xyplot(x, y, ...) }) @ % This assumes that the null distribution of depths has a Poisson distribution, which is not strictly true, but seems to give a reasonble fit. <>= plot(trellis.last.object()) @ \begin{Ex} Produce similar plots for the \code{gfp} dataset. What qualitatitve differences do you see? Based on your findings, what would be a reasonbale cutoff for deciding that the depth of an island is too high to be explained by chance, and hence is likely to contain a CTCF binding site? \end{Ex} \subsection{Peaks} Going back to our example of chr10 of the first sample, let us define ``peaks'' to be contiguous regions of the genome where coverage is 8 or more. (More sophisticated, model-based or adaptive algorithms exist, and we refer to the literature in this active area of research). % <>= peaks <- slice(cov.ctcf$chr10, lower = 8) peaks @ % Interesting properties of peaks are their maximum depth and area under the peak (a relative measure of how localized the peak is). <<>>= peak.depths <- viewMaxs(peaks) peak.areas <- viewSums(peaks) xyplot(peak.areas ~ peak.depths) @ % <>= plot(trellis.last.object()) @ \begin{Ex} Produce a similar plot for the \code{gfp} dataset. What differences do you see, particularly in terms of the number of peaks and the distribution of depths? \end{Ex} \noindent We can order the peaks by depth <<>>= wpeaks <- tail(order(peak.depths), 4) peaks[wpeaks] @ % and plot individual peaks using this function: <<>>= coverageplot <- function (peaks, xlab = "Position", ylab = "Coverage", ...) { pos1 <- seq(start(peaks[1]), end(peaks[1])) cov1 <- as.integer(peaks[[1]]) pos1 <- c(head(pos1, 1), pos1, tail(pos1, 1)) cov1 <- c(0, cov1, 0) xyplot(cov1 ~ pos1, ..., panel = panel.polygon, col = "lightgrey", xlab = xlab, ylab = ylab) } @ <<>>= coverageplot(peaks[wpeaks[1]]) @ <>= plot(trellis.last.object()) @ \begin{Ex} How does the amount by which each read is extended affect the analysis? In calls to \Rfunction{coverage}, we have used \Rfunarg{extend=126L} to get a total length of 150 for each read. Try lengths of 100 and 200 and see how the results change. \end{Ex} \section{Version information} <>= toLatex(sessionInfo()) @ \end{document}