%% LyX 1.6.5 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[english]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{babel} \usepackage{varioref} \usepackage[unicode=true, pdfusetitle, bookmarks=true,bookmarksnumbered=false,bookmarksopen=false, breaklinks=true,pdfborder={0 0 0},backref=false,colorlinks=false] {hyperref} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{Introduction to the les package: Identifying Loci of Enhanced Significance in Tiling Array Data} %\VignettePackage{les} %\usepackage{fancyvrb} %\fvset{listparameters={\setlength{\topsep}{0pt}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \makeatother \begin{document} \title{Introduction to the les package:\\ Identifying Loci of Enhanced Significance in Tiling Array Data} \author{Julian Gehring} \maketitle \begin{abstract} In this vignette we describe how to find Loci of Enhanced Significance (LES) in tiling microarray data using the \Rpackage{les} package. With an example of a general framework we illustrate how to apply the package for exploring data to identify regions of differential regulation. \end{abstract} \section{Introduction} Tiling microarrays have become an important platform for the investigation of regulation in expression and DNA-protein interaction. Due to their lower bias towards annotation compared to other microarray platforms they provide an powerful tool for biological research. Beside the analysis of transcription for single conditions investigation of regulation between multiple experimental conditions is important for current research. A common approach consists in applying some suitable statistical tests on the level of single probes and thereby computing p-values $p_{i}$ for each probe $i$ independently. Since the targets of such experiments cover regions with several probes a reasonable next step involves combining information from neighboring probes. Many approaches use a smoothing window to obtain this. While these methods may work well most of them produce a feature that lacks a statistical interpretation. In regions with differential effects the test statistics change their distribution and are referred to as \emph{Loci of Enhanced Significance (LES)}. The changes of the test statistics depend on the underlying test applied. The \Rpackage{les} package provides the ability to detect such \emph{LES} independent of the underlying statistical test and can therefore be used for a wide range of applications. This vignette illustrates how \emph{LES} can be identified with a general framework in tiling microarray data sets. <>= set.seed(1) options(width=65) @ \section{Data and statistics on probe level} For the analysis in this vignette we will use a simulated data set describing differential expression between two conditions. Please note that this data does not reflect real biological data. It is rather meant to illustrate the workflow than to provide real life data. The data set consists of 1000 probes and two groups, e.g. treatment and control, each measured with three replicates. The expression values are stored in an expression set. We will extract the position of the probes, the conditions of the samples and the expression intensities. There are two regions of regulation present in the data each 50 bp long. For experimental data, the positions of the probes are given by the design of the microarray. For simplicity we assume regularly spaced probes with 1 bp resolution in this illustration. <>= library(les) library(Biobase) data(simTile) treatment <- as.factor(phenoData(simTile)$condition == "treatment") pos <- featureData(simTile)$position exprs <- exprs(simTile) regions <- c(100, 150, 600, 650) cols <- rep(c("lightblue3", "lightgreen"), each=2) @ Next we have a look at the expression values. \begin{center} <>= matplot(exprs, pch=".", xlab="Probe position", ylab="Expression") abline(v=regions, col=cols) @ \end{center} In the next step we will compute compute a test statistics assessing regulatory changes between the two conditions for each probe. Since the sample size as for most tiling microarray experiments is small we will use a modified t-test provided by the \Rpackage{limma} package \cite{smyth_limma:_2005}. We will also plot the p-values $p_{i}$ against the probe positions $i$. <>= library(limma) design <- cbind(mean=1, diff=treatment) fit <- lmFit(exprs, design) fit <- eBayes(fit) pval <- as.numeric(fit$p.value[, "diff"]) @ \begin{center} <>= plot(pos, pval, pch=".", xlab="Probe position", ylab="P-value") abline(v=regions, col=cols) @ \end{center} \section{Incorporation information of neighboring probes} In tiling array experiments regions of regulation extend over several neighboring probes. Thereby p-values belonging to neighboring probes matching of the same region should contain mutual information and incorporation of such will be beneficial. In the \Rpackage{les} package this in done in the following manner: For each probe $i$ the surrounding p-values $p_{i}$ get weights assigned by a windowing function. A weighted cumulative density is then computed and the fraction of significant $p_{i}$ is estimated by iterative linear fitting. The method is based on the fact that p-values are uniformly distributed under the null hypothesis $H_{0}$ whereas p-values violating $H_{0}$ are shifted towards smaller values \cite{bartholom_estimation_2009}. This results in $\Lambda_{i}$ constituting an estimate of the fraction of p-values violating $H_{0}$ around the evaluated position and therefore the degree of significance in the local surrounding. It should be noted that this approach is closely related to the estimation of a false discovery rate and $\Lambda_{i}$ can be interpreted in a similar way. For the analysis we will first store our data in an object of class \Rclass{Les} by calling the \Rfunction{Les} function. The only data required for the analysis are the position of the probes $i$, the corresponding p-values $p_{i}$ from the statistical test and optionally their chromosomal location. <>= res <- Les(pos, pval) @ Then we can compute our first estimate of $\Lambda_{i}$ for which we have to specify a window size. The power of detecting a region will be high if the size of the window matches approximately the size of the regulated region. In many experiments a rough prior knowledge on the region size is available which can be sufficient for choosing a window size. By default a triangular weighting function will be chosen. We can also take different weighting functions such as a rectangular one or write our own function and pass it with the \Rfunarg{weighting} argument. This step is described below in section \vref{sec:Specification-of-own}. Further we can specify whether we want to include the Grenander correction for the cumulative density or use multicore processing. <>= win <- 30 res <- estimate(res, win) @ All data, results and parameters are now stored in the object named \Robject{res}. We can get a short summary on the results by calling \Rmethod{print}, \Rmethod{summary} or by plotting it. \begin{center} <>= res plot(res) abline(v=regions, col=cols) @ \end{center} For comparison we will analyze and plot the same data with a different window size. This allows us to explore our data set. <>= win2 <- 50 res <- estimate(res, win2) @ \begin{center} <>= res plot(res) abline(v=regions, col=cols) @ \end{center} We can already see two distinct peaks that correspond well to the simulated regions of regulation. The \Rmethod{plot} method provides additional arguments that help customizing the figure. \section{Parameter estimation from the data} To turn the continuous $\Lambda_{i}$ into distinct regions of interest we have to define a threshold $\Theta$. It can be derived from the data by estimating the number of probes with a significant effect $R$ on the whole array. Then $\Theta$ can be chosen such that $\mid\Lambda_{i}\geq\Theta\mid=R$. The content of any slot can be accessed by using the \Rmethod{[}-method. <>= res <- threshold(res, grenander=TRUE, verbose=TRUE) @ Based on $\Theta$ we can look for regions that have a continuous $\Lambda_{i}\geq\Theta$. The \Rmethod{regions} method takes by default the estimated $\hat{\Theta}$ as shown before. We can also pass our own estimate for $\Theta$ with the \Rfunarg{limit} argument. Further restrictions can be imposed on the regions such as the minimal length of a region and the maximum gap allowed between probes of one region. A data frame with the estmated regions can be accessed with the \Rmethod{[}-method. This can also be used to access any other data slot of a \Robject{LES} object. <>= theta <- 0.3 res <- regions(res, limit=theta, verbose=TRUE) res res["regions"] @ \begin{center} <>= region <- res["regions"] borders <- c(region$start, region$end) plot(res) abline(v=regions, col=cols) abline(v=borders, col="darkgray", lty=2) @ \end{center} \section{Calculation of confidence intervals} In some cases it is also useful to provide confidence intervals (CI) for $\Lambda_{i}$. These are computed by bootstrapping the probes in the window. Since bootstrapping is by its nature computationally demanding and CI are primarily interesting in regions of interest it it possible to compute CI for a subset of probes and to specify the number of bootstraps. <>= subset <- pos >= 580 & pos <= 670 res <- ci(res, subset, nBoot=50) @ \begin{center} <>= plot(res, error="ci", limit=theta, xlim=c(580, 670)) @ \end{center} \section{Plotting capabilities} The \Rmethod{plot} method provides many arguments for customizing figures. The following command plots a smaller region of the chromosome with confidence intervals and the estimated region. Further the positions of the probes are shown at the top and the representation of the probes are changed. \begin{center} <>= plot(res, error="ci", region=TRUE, rug=TRUE, rugSide=3, main="LES for simulated data", probePch="*", probeCol="blue", errorCol="lightgray", regionCol="lightgreen", xlim=c(580, 670) , ylim=c(0, 0.8), limit=FALSE, lty=0) @ \end{center} \section{Exporting result to external software} The estimated regions as well as $\Lambda$ can be saved to a file with the \Rmethod{export} method. The regions can be exported to the \emph{bed} and \emph{gff} formats, the test statistic $\Lambda$ to the \emph{wig} format. These formats can be directly loaded into many genome software packages and browsers. \section{Specification of own window functions\label{sec:Specification-of-own}} With the \Rfunction{triangWeight}, \Rfunction{rectangWeight}, \Rfunction{epWeight} and \Rfunction{gaussWeight} four window functions are already included in the \Rpackage{les} package, providing a triangular, rectangular, Epanechnikov and Gaussian window, respectively. We can also specify own window functions and pass it via the \Rfunarg{weighting} argument in the \Rmethod{estimate} method. They have to be specified in the following format, here illustrated with a triangular weighting. <>= weightFoo <- function(distance, win) { weight <- 1 - distance/win return(weight) } res2 <- estimate(res, 20, weighting=weightFoo) @ \newpage{} \bibliographystyle{plain} \bibliography{ref_les} \section*{Session information} <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}