%\VignetteIndexEntry{Analysis of NimbleGen Expression Data with the oligo Package } %\VignetteKeywords{Expression, NimbleGen, Oligonucleotide Arrays} %\VignettePackage{oligo} \documentclass{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\oligo}{\Rpackage{oligo}} \usepackage{graphicx} \begin{document} \title{Analysis of NimbleGen Expression Data with the oligo Package} \author{Benilton Carvalho} \maketitle <>= options(width=60) options(continue=" ") options(prompt="R> ") options(width=45, digits=2) @ \section{Introduction} \label{sec:intro} This document presents a non-trivial use of the \oligo{} Package for the analysis of NimbleGen Expression data. This vignette follows the structure of the chapter \textbf{From CEL files to a list of interesting genes} by R. A. Irizarry \textit{in} \textit{Bioinformatics and Computational Biology Solutions Using R and Bioconductor}, which shows a case study for Affymetrix Expression arrays. In order to analyze microarray data using \oligo, the user is expected to have installed on the system a package with the annotation for the particular array design on which the experiment was performed. For the example in question here, the design is hg18\_60mer\_expr and the annotation package associated to it is \Rpackage{pd.hg18.60mer.expr}, which is built by using the \Rpackage{pdInfoBuilder} package. \section{Initialization of the environment} \label{sec:init} We start by loading the packages that are going to be used in this session. The \Rpackage{maqcExpression4plex} package provides a set of six samples on the MAQC Study; the set is comprised of samples on two groups: universal reference and brain. The remaining packages offer additional functionality, like tools for filtering, plotting and visualization. <>= library(oligo) library(maqcExpression4plex) library(genefilter) library(limma) library(RColorBrewer) palette(brewer.pal(8, "Dark2")) @ Once the package is loaded, we can easily get the location of the XYS files that contain the intensities by calling \Rfunction{list.xysfiles}, which takes the same arguments as \Rfunction{list.files}. To minimize the chance of problems, we strongly recommend the use of \Rcode{full.names=TRUE}. <>= extdata <- system.file("extdata", package="maqcExpression4plex") xys.files <- list.xysfiles(extdata, full.names=TRUE) basename(xys.files) @ %def To read the XYS files, we provide the \Rfunction{read.xysfiles} function, which also takes \Robject{phenoData}, \Robject{experimentData} and \Robject{featureData} objects and returns an appropriate subclass of \Rclass{FeatureSet}. <>= pd <- dir(extdata, pattern="phenoData", full.names=TRUE) pd <- read.AnnotatedDataFrame(pd) maqc <- read.xysfiles(xys.files, phenoData=pd) class(maqc) @ %def \section{Exploring the feature-level data} \label{sec:exploring} The \Rfunction{read.xysfiles} function returns, in this case, an instance of \Rclass{ExpressionFeatureSet} and the intensities of these files are stored in its \Robject{exprs} slot, which can be accessed with a method with the same name. <>= exprs(maqc)[10001:10010, 1:2] @ %def The \Rmethod{boxplot} method can be used to produce boxplots for the feature-level data. <>= boxplot(maqc, main="MAQC Sample Data") @ %def Similarly, a smoothed histogram for the feature-level data can be obtained with the \Rmethod{hist} method. <>= hist(maqc, main="MAQC Sample Data") @ %def \section{RMA algorithm} \label{sec:rma} The RMA algorithm can be applied to the raw data of expression arrays. It is available via the \Rmethod{rma} method. The algorithm will perform background subtraction, quantile normalization and summarization via median polish. The result of \Rmethod{rma} is an instance of \Rclass{ExpressionSet} class, which also contains an \Rmethod{exprs} slot and method. <>= eset <- rma(maqc) class(eset) show(eset) exprs(eset)[1:10, 1:2] @ %def The \Rmethod{boxplot} and \Rmethod{hist} methods are also implemented for \Rclass{ExpressionSet} objects. Note that \Rmethod{rma}'s output is in the $\log_2$ scale, so we call such methods using the argument \Rcode{transfo=identity}, so the data are not transformed in any way. <>= boxplot(eset, transfo=identity, main="After RMA") @ %def <>= hist(eset, transfo=identity, main="After RMA") @ %def \section{Assessing differential expression} \label{sec:diffexp} One simple approach to assess differential expression is to flag units with log-ratios greater (in absolute value) than 1, i.e. a change greater than 2-fold when comparing brain vs. universal reference. <>= e <- exprs(eset) index <- which(eset[["Key"]] == "brain") d <- rowMeans(e[, index])-rowMeans(e[, -index]) a <- rowMeans(e) sum(abs(d)>1) @ Another approach is to use $t$-tests to infer whether or not there is differential expression. <>= tt <- rowttests(e, factor(eset[["Key"]])) lod <- -log10(tt[["p.value"]]) @ The MA plot can be used to visualize the behavior of the log-ratio as a function of average log-intensity. Features with log-ratios greater (in absolute value) than 1 are candidates for being classified as differentially expressed. <>= smoothScatter(a, d, xlab="Average Intensity", ylab="Log-ratio", main="MAQC Sample Data") abline(h=c(-1, 1), col=2) @ The use of $t$-tests allows us to use the volcano plot to visualize candidates for differential expression. Below, we highlight, in blue, the top 25 in log-ratio and, in red, the top 25 in effect size. <>= o1 <- order(abs(d),decreasing=TRUE)[1:25] o2 <- order(abs(tt[["statistic"]]),decreasing=TRUE)[1:25] o <- union(o1,o2) smoothScatter(d, lod, main="A Better view") points(d[o1], lod[o1], pch=18, col="blue") points(d[o2], lod[o2], pch=1, col="red") abline(h=2, v=c(-1, 1), col=2) @ The \Rpackage{limma} Package can also be used to assess difference in expression between the two groups. <>= design <- model.matrix(~factor(eset[["Key"]])) fit <- lmFit(eset, design) ebayes <- eBayes(fit) lod <- -log10(ebayes[["p.value"]][,2]) mtstat<- ebayes[["t"]][,2] @ The Empirical Bayes approach implemented in \Rpackage{limma} provides moderated $t$-statistic, shown to have a better performance when compared to the standard $t$-statistic. Below, we reconstruct the volcano plot, but using the moderated $t$-statisic. <>= o1 <- order(abs(d), decreasing=TRUE)[1:25] o2 <- order(abs(mtstat), decreasing=TRUE)[1:25] o <- union(o1, o2) smoothScatter(d, lod, main="Moderated t", xlab="Log-ratio", ylab="LOD") points(d[o1], lod[o1], pch=18,col="blue") points(d[o2], lod[o2], pch=1,col="red") abline(h=2, v=c(-1, 1)) @ The \Rmethod{topTable} command provides us a way of ranking genes for further evaluation. In the case below, we adjust for multiple testing by FDR and look at the Top-10 genes. <>= tab <- topTable(ebayes, coef=2, adjust="fdr", n=10) tab @ \section{Session Info} \label{sec:sessionInfo} This document was created using the following: <>= sessionInfo() @ \end{document}