%\VignetteIndexEntry{4. XPS Vignette: Function express()} %\VignetteDepends{xps} %\VignetteKeywords{Expression, Affymetrix, Oligonucleotide Arrays} %\VignettePackage{xps} \documentclass{article} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\xps}{\Rpackage{xps}} \newcommand{\ROOT}{\Robject{ROOT}} \begin{document} %----------------------------------------------------------------------- \title{Introduction to the xps Package: Function \Rfunction{express()}} %----------------------------------------------------------------------- \date{September, 2010} \author{Christian Stratowa} \maketitle \tableofcontents %----------------------------------------------------------------------- \section{Introduction} %----------------------------------------------------------------------- This document describes how to use function \Rfunction{express} to combine different preprocessing methods, which are currently available in package \xps\, i.e. methods for (i) background correction, (ii) probe-level normalization, and (iii) summarization. These different methods, or a subset thereof, can be computed with a single call to function \Rfunction{express}, or in a stepwise manner by calling \Rfunction{express} consecutively with the result of one call used as input for the next call. Applying \Rfunction{express} stepwise has certain advantages, which will be explained later. Since \Rfunction{express} is a wrapper function to method \Rmethod{xpsPreprocess} both function calls can be used. \\ In addition to the mentioned preprocessing steps it is sometimes advisable to add a further preprocessing step after the summarization step, i.e. (iv) probeset-level normalization. One example is the MAS5 algorithm where the expression measures are usually scaled to a predefined target intensity \citep{affy:tech:2002}. This additional step is handled by function \Rfunction{normalize}, which is a wrapper to method \Rmethod{xpsNormalize}. \\ Using a subset of the Affymetrix Exon Array Data Set "Tissue Mixture" for the expression array "Human Genome U133 Plus 2.0", examples will be presented for all available algorithms for the different preprocessing methods. The aim of this document is to explain all parameters for all methods in detail based on the examples presented. {\bf Note:} In order to keep this document short, only example code will be presented here, the full source code is available in file "script4xpsPreprocess.R" located in directory "xps/examples". %----------------------------------------------------------------------- \section{Computing RMA using function \Rfunction{express()}} %----------------------------------------------------------------------- In the following subsections we will demonstrate how to compute RMA using the general function \Rfunction{express}. However, first it is necessary to load the \ROOT\ \Rclass{scheme} and \ROOT\ \Rclass{data} files, as explained in vignette "xps.pdf" (Overview): <<eval=FALSE>>= scheme.u133p2 <- root.scheme(file.path(scmdir, "Scheme_HGU133p2_na28.root")) data.u133p2 <- root.data(scheme.u133p2, file.path(datdir, "HuTissuesU133P2_cel.root")) @ Usually, RMA is computed with a simple call to function \Rfunction{rma}: <<eval=FALSE>>= data.rma <- rma(data.u133p2, "MixU133P2RMA", filedir=outdir) @ %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{Compute RMA with a single call to \Rfunction{express}} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Using a single call to function \Rfunction{express} RMA is computed as follows: <<eval=FALSE>>= expr.rma <- express(data.u133p2, "U133P2Exprs", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="rma", bgcorrect.select="none", bgcorrect.option="pmonly:epanechnikov", bgcorrect.params=c(16384), normalize.method="quantile", normalize.select="pmonly", normalize.option="transcript:together:none", normalize.logbase="0", normalize.params=c(0.0), summarize.method="medianpolish", summarize.select="pmonly", summarize.option="transcript", summarize.logbase="log2", summarize.params=c(10, 0.01, 1.0)) @ Valid expression levels can be obtained by calling <<eval=FALSE>>= expr <- validExpr(expr.rma) @ %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{Compute RMA stepwise} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Alternatively, function \Rfunction{express} can be used to compute RMA step-by-step. This can have the following advantages: \begin{itemize} \item different algorithms can be tested at each step and combined w/o the need to re-calculate the earlier step(s) \item quality control tests for each array can be done separately at each step \item stepwise computation is useful when computing RMA for 10,000 CEL-files or more \\ \end{itemize} \begin{tabbing} {\bf 1. step:} background correction for \Rclass{DateTreeSet} \Robject{data.u133p2}: \end{tabbing} The code for computing \Rcode{rma} background correction is as follows: <<eval=FALSE>>= bgrd.rma <- express(data.u133p2, "BgrdRMA", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="rma", bgcorrect.select="none", bgcorrect.option="pmonly:epanechnikov", bgcorrect.params=c(16384)) @ {\bf Note:} For background correction is is not allowed to define a \Rcode{tmpdir}, otherwise all trees would be stored in a temporary \ROOT\ file and file \Rcode{BgrdRMA.root} would be empty. \\ Now we can do some quality controls such as drawing images, density plots and boxplots. \\ First we draw the images for background values and background-corrected intensities, respectively, for e.g. array "ProstateA": <<eval=FALSE>>= bgrdnames <- colnames(validBgrd(bgrd.rma)) datanames <- colnames(validData(bgrd.rma)) image(bgrd.rma, bg=TRUE, transfo=log2, col=heat.colors(12), names=bgrdnames[4]) image(bgrd.rma, transfo=log2, col=heat.colors(12), names=datanames[4]) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Image_BgrdRMA_bgrd4.png} \includegraphics[width=70mm]{Image_BgrdRMA_data4.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak Then we draw the density plots for raw and background-corrected intensities, respectively: <<eval=FALSE>>= hist(data.u133p2, which="pm") hist(bgrd.rma, which="pm") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_DataU133P2.png} \includegraphics[width=70mm]{DensityPlot_BgrdRMA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} Finally we draw the boxplots for raw and background-corrected intensities, respectively: <<eval=FALSE>>= boxplot(data.u133p2, which="pm") boxplot(bgrd.rma, which="pm") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{BoxPlot_DataU133P2.png} \includegraphics[width=70mm]{BoxPlot_BgrdRMA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} Alternatively, it is also possible to draw boxplots for background values and background-corrected intensities, respectively, as follows: <<eval=FALSE>>= bgrd <- validBgrd(bgrd.rma, which="pm") data <- validData(bgrd.rma, which="pm") boxplot(log2(bgrd), las=2) boxplot(log2(data), las=2) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{BoxPlot_BgrdRMA_bgrd.png} \includegraphics[width=70mm]{BoxPlot_BgrdRMA_data.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak \begin{tabbing} {\bf 2. step:} quantile normalization for resulting \Rclass{DateTreeSet} \Robject{bgrd.rma}: \end{tabbing} Using the background-corrected data \Rcode{bgrd.rma} as input the code for computing \Rcode{quantile} normalization is as follows: <<eval=FALSE>>= norm.qu <- express(bgrd.rma, "NormQuan", filedir=outdir, tmpdir="", update=FALSE, normalize.method="quantile", normalize.select="pmonly", normalize.option="transcript:together:none", normalize.logbase="0", normalize.params=c(0.0)) @ {\bf Note:} For the normalization step is is also not allowed to define a \Rcode{tmpdir}, otherwise all trees would be stored in a temporary \ROOT\ file and file \Rcode{NormQuan.root} would be empty. \\ Now we draw density plots and boxplots of the normalized intensities as quality controls: <<eval=FALSE>>= hist(norm.qu, which="pm") boxplot(norm.qu, which="pm") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_NormQuant.png} \includegraphics[width=70mm]{BoxPlot_NormQuant.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak In addition we can draw scatter plots between normalized intensities at the probe-level: <<eval=FALSE>>= data <- validData(norm.qu, which="pm") plot(log2(data[,1]), log2(data[,2]), xlab="BreastA", ylab="BreastB") plot(log2(data[,1]), log2(data[,4]), xlab="BreastA", ylab="ProstateA") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{ScatterPlot_NormQuant_PM_BrABrB.png} \includegraphics[width=70mm]{ScatterPlot_NormQuant_PM_BrAPrA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \begin{tabbing} {\bf 3. step:} median-polish summarization for resulting \Rclass{DateTreeSet} \Robject{norm.qu}: \end{tabbing} Finally we do \Rcode{medianpolish} summarization, using the normalized data \Rcode{norm.qu} as input: <<eval=FALSE>>= expr.mp <- express(norm.qu, "ExprMedpol", filedir=outdir, tmpdir="", update=FALSE, summarize.method="medianpolish", summarize.select="pmonly", summarize.option="transcript", summarize.logbase="log2", summarize.params=c(10, 0.01, 1.0), bufsize=32000) @ {\bf Note 1:} For the summarization step is is possible to define a \Rcode{tmpdir}; when handling hundreds of trees on computers with 1 GB RAM only, it is even the recommended way for multichip summarization methods such as \Rcode{medianpolish}. \\ {\bf Note 2:} When using a multichip summarization method with data consisting of many thousand trees it could also be necessary to reduce the basket size of each \ROOT\ tree, e.g. to a size of \Rcode{bufsize=2000}. (Internally, each \ROOT\ tree consists of many tree baskets, and when reading a tree, one basket after the other is loaded into memory and not the whole tree, thus reducing memory consumption.) \\ Now we can draw density plots and boxplots of the final expression values: <<eval=FALSE>>= hist(expr.mp) boxplot(expr.mp) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_ExprMedpol.png} \includegraphics[width=70mm]{BoxPlot_ExprMedpol.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak In addition we can draw scatter plots between expression values at the transcript-level: <<eval=FALSE>>= expr <- validData(expr.mp) plot(log2(expr[,1]), log2(expr[,2]), xlab="BreastA", ylab="BreastB") plot(log2(expr[,1]), log2(expr[,4]), xlab="BreastA", ylab="ProstateA") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{ScatterPlot_ExprMedpol_BrABrB.png} \includegraphics[width=70mm]{ScatterPlot_ExprMedpol_BrAPrA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak %----------------------------------------------------------------------- \section{Background Correction using \Rfunction{express()}} %----------------------------------------------------------------------- Currently the following background correction methods are available as \Rcode{bgcorrect.method}: \begin{itemize} \item {\tt rma}: background is calculated using a global model for the distribution of probe intensities. \item {\tt sector}: a constant background is calculated for each sector separately and subtracted from intensities. \item {\tt weightedsector}: first, a constant background is calculated for each sector separately, then a weight is calculated and each background value multiplied with the weight to smooth the transition between sectors, and finally the smoothed background is subtracted from intensities. \item {\tt gccontent}: background is calculated based on the mean intensity of probes with identical GC content. \end{itemize} {\bf Note:} By default the summarization methods \Rcode{farms} and \Rcode{dfw} skip the background correction step. %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{rma} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The \Rcode{rma} background adjustment method is described in Irizarry \citep{iriz:etal:2003}. By default PM probe intensities are corrected by using a global model for the distribution of probe intensities. When using function \Rfunction{express()} the following \Rcode{bgcorrect} parameters can be used for \Rcode{bgcorrect.method="rma"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.select}: \\ {\tt none}: \> scheme mask is not changed by selector \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.option = "<option>:<kernel>"}: \\ {\tt option}: \> pmonly, mmonly, both \\ {\tt kernel}: \> epanechnikov, gaussian, rectangular, triangular, biweight, cosine, optcosine \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.params}: \\ {\tt numpoints}: \> size of density array, preferrable power of 2 (default: 16384) \end{tabbing} The default settings for \Rcode{rma} background correction are: <<eval=FALSE>>= bgrd.rma <- express(data.u133p2, "BgrdRMA", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="rma", bgcorrect.select="none", bgcorrect.option="pmonly:epanechnikov", bgcorrect.params=c(16384)) @ %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{sector (mas4)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The \Rcode{sector} background subtraction method divides the array into a certain number of sectors, identifies the lowest 2\% of probe intensities for each sector, and computes their average, whereby all probes are used \citep{affy:tech:2001}. When using function \Rfunction{express()} the following \Rcode{bgcorrect} parameters can be used for \Rcode{bgcorrect.method="sector"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.select}: \\ {\tt all}: \> all probes are selected (default) \\ {\tt pmonly}: \> only PM probes are selected \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \\ {\tt none}: \> scheme mask is not changed by selector \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.option}: \\ {\tt subtractbg}: \> subtract bgrd from intensity (default) - result can be negative \\ {\tt correctbg}: \> correct bgrd with noise fraction to avoid negative results \\ {\tt attenuatebg}: \> use generalized log-transform to avoid negative results \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.params}: \\ The following parameters are necessary: \\ {\tt pcntcells}: \> precent of cells with lowest intensities in each sector (default: 0.02) \\ {\tt secrows}: \> number of sector rows (default: 4) \\ {\tt seccols}: \> number of sector columns (default: 4) \\ {\tt smoothiter}: \> number of iterations used for smoothing of sector bgrd values (default: 0) \\ Optional parameter when using \Rcode{bgcorrect.option="correctbg"}: \\ {\tt noisefrac}: \> fraction of global background variation (default: 0.5) \\ Optional parameters when using \Rcode{bgcorrect.option="attenuatebg"}: \\ {\tt l}: \> tunable parameter, 0<=l<=1 (default: 0.005) \\ {\tt h}: \> parameter (default: -1) \end{tabbing} Using the default settings the MAS4 \Rcode{sector} background correction is obtained as follows: <<eval=FALSE>>= bgrd.mas4 <- express(data.u133p2, "BgrdMAS4", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="sector", bgcorrect.select="all", bgcorrect.option="subtractbg", bgcorrect.params=c(0.02, 4, 4, 0)) @ However, using the default settings has severe disadvantages: First, using \Rcode{bgcorrect.option="subtractbg"} results in a high percentage of background-corrected probe intensities with negative values! This was the most severe problem of the MAS4 algorithm \citep{affy:tech:2001} since it resulted in negative expression levels, too. Second, to divide an array into 16 sectors was sufficient for the initial expression arrays, e.g. HuGeneFL, but for the newer arrays, e.g. HG-U133\_Plus\_2, and especially the whole genome and exon arrays it would be of advantage to divide the array into more sectors, e.g. 64 sectors. \\ Thus, the following setup is recommended for \Rcode{sector} background correction: <<eval=FALSE>>= bgrd.mas <- express(data.u133p2, "BgrdMAS8x8", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="sector", bgcorrect.select="both", bgcorrect.option="correctbg", bgcorrect.params=c(0.02, 8, 8, 3, 0.5)) @ Here we divide each array into 8x8 sectors, use only PM+MM probes, and use \Rcode{bgcorrect.option="correctbg"}, which is the default option for MAS5 and avoids negative probe intensities. \\ Now we can compare the resulting background values by drawing the corresponding images: <<eval=FALSE>>= bgrdnames <- colnames(validBgrd(bgrd.mas4)) image(bgrd.mas4, bg=TRUE, transfo=log2, col=heat.colors(12), names=bgrdnames[4]) bgrdnames <- colnames(validBgrd(bgrd.mas)) image(bgrd.mas, bg=TRUE, transfo=log2, col=heat.colors(12), names=bgrdnames[4]) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Image_BgrdMAS4_bgrd4.png} \includegraphics[width=70mm]{Image_BgrdMAS8x8x_bgrd4.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak As both figures show there is a density gradient indicating that there might have been a problem with the staining/washing step of this specific array. However, the right image shows this effect more clearly, and additionally it results in more smooth transitions between adjacent sectors. {\bf Note:} To test the quality of arrays with respect to potential density gradients, I would propose to use the \Rcode{sector} background with 8x8 sectors and option \Rcode{"correctbg"}. In addition, these settings could also replace the default MAS5 \Rcode{weightedsector} background algorithm described below. %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{weightedsector (mas5)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The \Rcode{weightedsector} background correction method divides the array into a certain number of sectors and identifies the lowest 2\% of probe intensities for each sector. However, in contrast to \Rcode{sector} only PM and MM probes are used and each probe is adjusted based upon a weighted average of the backgrounds for each of the sectors. The weights are based on the distances between the location of the probe and the centriods of the different sectors \citep{affy:tech:2002}. When using function \Rfunction{express()} the following \Rcode{bgcorrect} parameters can be used for \Rcode{bgcorrect.method="weightedsector"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.select}: \\ {\tt all}: \> all probes are selected \\ {\tt pmonly}: \> only PM probes are selected \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected (default) \\ {\tt none}: \> scheme mask is not changed by selector \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.option}: \\ {\tt subtractbg}: \> subtract bgrd from intensity - result can be negative \\ {\tt correctbg}: \> correct bgrd with noise fraction to avoid negative results (default) \\ {\tt attenuatebg}: \> use generalized log-transform to avoid negative results \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.params}: \\ The following parameters are necessary: \\ {\tt pcntcells}: \> precent of cells with lowest intensities in each sector (default: 0.02) \\ {\tt secrows}: \> number of sector rows (default: 4) \\ {\tt seccols}: \> number of sector columns (default: 4) \\ {\tt smoothiter}: \> number of iterations used for smoothing of sector bgrd values (default: 0) \\ {\tt smooth}: \> smoothing parameter added to avoid infinite weights (default: 100) \\ Parameter when using default \Rcode{mas5} background \Rcode{bgcorrect.option="correctbg"}: \\ {\tt noisefrac}: \> fraction of global background variation (default: 0.5) \\ Optional parameters when using alternative background \Rcode{bgcorrect.option="attenuatebg"}: \\ {\tt l}: \> tunable parameter, 0<=l<=1 (default: 0.005) \\ {\tt h}: \> parameter (default: -1) \\ \end{tabbing} Using the default settings the MAS5 \Rcode{weightedsector} background correction is obtained as follows: <<eval=FALSE>>= bgrd.mas5 <- express(data.u133p2, "BgrdMAS5", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="weightedsector", bgcorrect.select="both", bgcorrect.option="correctbg", bgcorrect.params=c(0.02, 4, 4, 0, 100, 0.5)) @ For quality control purposes we can e.g. draw density plots of the background-corrected intensities, and boxplots of the background values: <<eval=FALSE>>= hist(bgrd.mas5, which="both") bgrd <- validBgrd(bgrd.mas5) boxplot(log2(bgrd), las=2) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_BgrdMAS5_both.png} \includegraphics[width=70mm]{BoxPlot_BgrdMAS5_bgrd.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} The density plot of the background-corrected intensities reveals a small peak around zero log intensity indicating that for a certain fraction of probes intensities were replaced by the noise fraction. Reducing the lowest percentage of cells used for background correction to 0.5\%, i.e. \Rcode{pcntcells=0.005} eliminated this peak almost completely. \\ As already explained for \Rcode{sector} background, it is of advantage to increase the number of sectors used to 8x8 sectors: <<eval=FALSE>>= bgrd.mas <- express(data.u133p2, "BgrdMAS5_8x8", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="weightedsector", bgcorrect.select="both", bgcorrect.option="correctbg", bgcorrect.params=c(0.02, 8, 8, 3, 100, 0.5)) @ As above, we can compare the resulting background values by drawing the corresponding images: <<eval=FALSE>>= bgrdnames <- colnames(validBgrd(bgrd.mas5)) image(bgrd.mas5, bg=TRUE, transfo=log2, col=heat.colors(12), names=bgrdnames[4]) bgrdnames <- colnames(validBgrd(bgrd.mas4)) image(bgrd.mas, bg=TRUE, transfo=log2, col=heat.colors(12), names=bgrdnames[4]) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Image_BgrdMAS5_bgrd4.png} \includegraphics[width=70mm]{Image_BgrdMAS5_8x8_bgrd4.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} Due to the weighting algorithm used, each sector is smoothed so that it is a "cone" centered in the middle of the sector. Once again, the right image shows the density gradient more clearly and results in more smooth transitions between adjacent sectors. {\bf Note:} Due to the weighting algorithm used computing the \Rcode{weightedsector} background consumes large amounts of memory, which can be a severe problem when trying to use 8x8 sectors for high density arrays, especially exon arrays. %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{gccontent} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The \Rcode{gccontent} background attenuation method identifies MM probes and PM probes with the same GC content and uses the mean intensity of the MM probes to "attenuate" the corresponding PM intensities. It was introduced by Affymetrix for the exon arrays containing special "genomic" and "antigenomic" probes \citep{affy:tech:2005c}, but can also be used for expression arrays. When using function \Rfunction{express()} the following \Rcode{bgcorrect} parameters can be used for \Rcode{bgcorrect.method="gccontent"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.select}: \\ {\tt all}: \> all probes are selected \\ {\tt pmonly}: \> only PM probes are selected \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \\ {\tt none}: \> scheme mask is not changed by selector \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.option}: \\ {\tt subtractbg}: \> subtract bgrd from intensity - result can be negative \\ {\tt correctbg}: \> correct bgrd with noise fraction to avoid negative results \\ {\tt attenuatebg}: \> use generalized log-transform to avoid negative results (default) \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{bgcorrect.params}: \\ The following parameters are necessary: \\ {\tt trim}: \> trim value for trimmed mean (default: 0.5) \\ Parameters when using default background \Rcode{bgcorrect.option="attenuatebg"}: \\ {\tt l}: \> tunable parameter, 0<=l<=1 (default: 0.005) \\ {\tt h}: \> parameter (default: -1) \\ Optional parameter when using background \Rcode{bgcorrect.option="correctbg"}: \\ {\tt noisefrac}: \> fraction of global background variation (default: 0.5) \\ \end{tabbing} The \Rcode{gccontent} background correction is computed as follows: <<eval=FALSE>>= bgrd.gc <- express(data.u133p2, "BgrdGC", filedir=outdir, tmpdir="", update=FALSE, bgcorrect.method="gccontent", bgcorrect.select="none", bgcorrect.option="attenuatebg", bgcorrect.params=c(0.4, 0.005, -1.0)) @ For quality control purposes we can draw density plots of the background-corrected intensities, and boxplots of the background values: <<eval=FALSE>>= hist(bgrd.gc, which="pm") bgrd <- validBgrd(bgrd.gc, which="pm") boxplot(log2(bgrd), las=2) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_BgrdGC_pm.png} \includegraphics[width=70mm]{BoxPlot_BgrdGC_bgrd.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} In addition we can draw the image for the background values: <<eval=FALSE>>= bgrdnames <- colnames(validBgrd(bgrd.gc)) image(bgrd.gc, bg=TRUE, transfo=log2, col=heat.colors(12), names=bgrdnames[4]) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{Image_BgrdGC_bgrd4.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak %----------------------------------------------------------------------- \section{Probe-level Normalization using \Rfunction{express()}} %----------------------------------------------------------------------- Currently the following probe-level normalization methods are available as \Rcode{normalize.method}: \begin{itemize} \item {\tt quantile}: normalization is done by replacing sorted intensities with the respective column means. \item {\tt mean}: all arrays are scaled so that they have the same trimmed mean value. \item {\tt median}: all arrays are scaled so that they have the same median value. \item {\tt lowess}: each array is normalized to a reference array using locally-weighted polynomial regression. \item {\tt supsmu}: each array is normalized to a reference array by applying Friedmann's super smoother. \end{itemize} {\bf Note 1:} The following examples will be based on \Rcode{sector} background \Robject{bgrd.mas} although we could use any of the above background algorithms, or even skip the background correction by using the raw data \Robject{data.u133p2}. \\ {\bf Note 2:} By default the summarization methods \Rcode{avgdiff} (MAS4) and \Rcode{tukeybiweight} (MAS5) skip the probe-level normalization step, i.e. use \Robject{data.u133p2}. %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{quantile (rma)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The quantile normalization method was introduced by Bolstad \citep{bols:etal:2003} with the goal to give each array the same empirical distribution. To allow quantile normalization of many thousand arrays on computers with 1-2 GB RAM only, package \xps\ uses \ROOT\ trees for quantile normalization: Intensities for each array are sorted and stored in \ROOT\ trees together with their ranks. After replacing the sorted tree entries with the mean entries of all trees, the trees are sorted according to their original ranks. When using function \Rfunction{express()} the following \Rcode{normalize} parameters can be used for \Rcode{normalize.method="quantile"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.select}: \\ {\tt pmonly}: \> only PM probes are selected \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \\ {\tt none}: \> scheme mask is not changed by selector \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.option = "<option>:<dataopt>:<bgrdopt>"}: \\ {\tt option}: \> use unit tree for "transcript", "exon", "probeset" \\ {\tt dataopt}: \> applied to PM and MM "together" or "separate" \\ {\tt bgrdopt}: \> "none" - no background subtraction (already done) \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.params}: \\ {\tt trim}: \> trim value for trimmed mean (default: 0.0) \\ {\tt delta}: \> use robust ties (0.4 for package affy >=1.16.0) or not (default: 1.0) \end{tabbing} The \Rcode{quantile} probe-level normalization is computed as follows: <<eval=FALSE>>= norm.qu <- express(bgrd.mas, "NormQuan", filedir=outdir, update=FALSE, normalize.method="quantile", normalize.select="all", normalize.option="transcript:together:none", normalize.logbase="0", normalize.params=c(0.0, 1.0)) @ As for the initial example of step-wise RMA computation (see chapter 2.2 above), density plots, boxplots and scatter plots can be drawn as quality controls. %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{mean/median (constant)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - All arrays are scaled so that they have the same mean/median value. This value is determined either by the mean/median value of a reference tree or by a given target intensity. When using function \Rfunction{express()} the following \Rcode{normalize} parameters can be used for \Rcode{normalize.method="mean"} or for \Rcode{normalize.method="median"}, respectively: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.select}: \\ {\tt all}: \> all probes are selected \\ {\tt pmonly}: \> only PM probes are selected \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.option = "<option>:<dataopt>"}: \\ {\tt option}: \> use unit tree for "transcript", "exon", "probeset" \\ {\tt dataopt}: \> applied to "all" selected probes \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.params}: \\ {\tt trim}: \> trim value for trimmed mean (only for method "mean") \\ {\tt targetinten}: \> if > 0 then scale mean/median to value of target intensity \end{tabbing} The trimmed \Rcode{mean} probe-level normalization is computed as follows (by default, a mean reference tree is computed first): <<eval=FALSE>>= norm.mn <- express(bgrd.mas, "NormMean", filedir=outdir, tmpdir="", update=FALSE, normalize.method="mean", normalize.select="both", normalize.option="transcript:all", normalize.logbase="0", normalize.params=c(0.0, -1)) @ Analogously, when determining a certain target intensity (e.g. 500) as reference, the \Rcode{median} probe-level normalization is computed as follows: <<eval=FALSE>>= norm.md <- express(bgrd.mas, "NormMedian", filedir=outdir, tmpdir="", update=FALSE, normalize.method="median", normalize.select="pmonly", normalize.option="transcript:all", normalize.logbase="log2", normalize.params=c(500), reference.index=1) @ Here the reference index is set to \Rcode{reference.index=1} to prevent computing a mean reference tree which will not be used since the intensities are normalized to target intensity 500 and not to a reference tree. %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{lowess} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Each array is normalized to a reference array using locally-weighted polynomial regression. The reference array can be one of the arrays or the mean/median of all arrays. When using function \Rfunction{express()} the following \Rcode{normalize} parameters can be used for \Rcode{normalize.method="lowess"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.select}: \\ {\tt all}: \> all probes are selected \\ {\tt pmonly}: \> only PM probes are selected \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.option = "<option>:<dataopt>"}: \\ {\tt option}: \> use unit tree for "transcript", "exon", "probeset" \\ {\tt dataopt}: \> applied to "all" selected probes \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.params}: \\ The following parameters are necessary: \\ {\tt span}: \> smoother span (default: 2/3) \\ {\tt iter}: \> number of robustifying iterations (default: 3) \\ {\tt rule}: \> 0, 1 or 2, determines how approx interpolation should be done at ends \\ {\tt f}: \> for approx method "constant" a value between 0 and 1 (not used) \end{tabbing} The \Rcode{lowess} probe-level normalization is computed as follows: <<eval=FALSE>>= norm.low <- express(bgrd.mas, "NormLowess", filedir=outdir, tmpdir="", update=FALSE, normalize.method="lowess", normalize.select="pmonly", normalize.option="transcript:all", normalize.logbase="log2", normalize.params=c(0.67, 3.0, 0.0, 0.0)) @ %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{supsmu} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Each array is normalized to a reference array by applying Friedmann's super smoother. The reference array can be one of the arrays or the mean/median of all arrays. When using function \Rfunction{express()} the following \Rcode{normalize} parameters can be used for \Rcode{normalize.method="supsmu"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.select}: \\ {\tt all}: \> all probes are selected \\ {\tt pmonly}: \> only PM probes are selected \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.option = "<option>:<dataopt>"}: \\ {\tt option}: \> use unit tree for "transcript", "exon", "probeset" \\ {\tt dataopt}: \> applied to "all" selected probes \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.params}: \\ The following parameters are necessary: \\ {\tt bass}: \> controls smoothness of fitted curve (default: 0.0) \\ {\tt span}: \> fraction of observations in span (default: 0.0) \\ {\tt rule}: \> 0, 1 or 2, determines how approx interpolation should be done at ends \\ {\tt f}: \> for approx method "constant" a value between 0 and 1 (not used) \end{tabbing} The \Rcode{supsmu} probe-level normalization is computed as follows: <<eval=FALSE>>= norm.sup <- express(bgrd.mas, "NormSupsmuR2", filedir=outdir, tmpdir="", update=FALSE, normalize.method="supsmu", normalize.select="pmonly", normalize.option="transcript:all", normalize.logbase="log2", normalize.params=c(0.0, 0.0, 2.0, 0.0)) @ As quality control we can draw density plots and boxplots of the normalized probe intensities: <<eval=FALSE>>= hist(norm.sup, which="pm") boxplot(norm.sup, which="pm") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_NormSup_BgrdMAS.png} \includegraphics[width=70mm]{BoxPlot_NormSup_BgrdMAS.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak %----------------------------------------------------------------------- \section{Summarization using \Rfunction{express()}} %----------------------------------------------------------------------- Summarization converts the (background-corrected and/or normalized) probe intensities to the corresponding expression levels for probe sets defined with \Rcode{summarize.option}: \begin{tabbing} xxx\=xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill %Settings to be used for \Rcode{summarize.option}: \\ \> {\tt transcript}: \> summarization units are transcripts (default) \\ \> {\tt exon}: \> summarization units are exons (exon arrays only) \\ \> {\tt probeset}: \> summarization units are probesets (exon arrays only) \end{tabbing} \begin{tabbing} Currently the following summarization methods are available as \Rcode{summarize.method}: \end{tabbing} \begin{itemize} \item {\tt medianpolish}: fits an additive model using Tukey's median polish procedure. \item {\tt avgdiff}: computes the average difference between PM and MM. \item {\tt tukeybiweight}: a Tukey biweight estimator is used to compute a robust mean of (adjusted) intensities. \item {\tt dfw}: a distribution-free weight is applied to each probe before summarization. \item {\tt farms}: a factor analysis model is used for noise correction of measurement values before summarization. \end{itemize} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{medianpolish (rma)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This multichip summarization method fits a robust linear model using Tukey's median polish procedure. When using function \Rfunction{express()} the following \Rcode{summarize} parameters can be used for \Rcode{summarize.method="medianpolish"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.select}: \\ {\tt pmonly}: \> only PM probes are selected (default) \\ {\tt mmonly}: \> only MM probes are selected \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.params}: \\ {\tt maxiter}: \> maximal number of iterations (default: 10) \\ {\tt eps}: \> epsilon of test for convergence (default: 0.01) \\ {\tt neglog}: \> substitution for logarithm of negative values (default: 1.0) \end{tabbing} The default RMA \Rcode{medianpolish} summarization is computed as follows: <<eval=FALSE>>= expr.mp <- express(norm.qu, "ExprMedpol", filedir=outdir, update=FALSE, summarize.method="medianpolish", summarize.select="pmonly", summarize.option="transcript", summarize.logbase="log2", summarize.params=c(10, 0.01, 1.0)) @ As for the initial example of step-wise RMA computation (see chapter 2.2 above), density plots, boxplots and scatter plots can be drawn as quality controls. \pagebreak %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{avgdiff (mas4)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This summarization method takes the difference PM-MM of every probe pair and averages the differences over the entire probe set. When using function \Rfunction{express()} the following \Rcode{summarize} parameters can be used for \Rcode{summarize.method="avgdiff"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.select}: \\ {\tt none}: \> scheme mask is not changed by selector (default) \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.params}: \\ {\tt STP}: \> number of standard deviations (default: 3) \end{tabbing} The MAS4 \Rcode{avgdiff} summarization is computed as follows (here we use the background-corrected intensities \Rcode{bgrd.mas} w/o normalization): <<eval=FALSE>>= expr.adf <- express(bgrd.mas, "ExprAvgDif", filedir=outdir, update=FALSE, summarize.method="avgdiff", summarize.select="none", summarize.option="transcript", summarize.logbase="0", summarize.params=c(3.0)) @ %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{tukeybiweight (mas5)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - After subtracting an ideal mismatch value from PM, a 1-step Tukey biweight estimator is used to compute a robust mean of the resulting values. When using function \Rfunction{express()} the following \Rcode{summarize} parameters can be used for \Rcode{summarize.method="tukeybiweight"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.select}: \\ {\tt none}: \> scheme mask is not changed by selector (default) \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.params}: \\ {\tt tau}: \> contrast tau (default: 0.03) \\ {\tt scaletau}: \> scale tau (default: 10) \\ {\tt delta}: \> delta for probe value calculation (default: 2.0e-20) \\ {\tt c}: \> a tuning constant (default: 5.0) \\ {\tt eps}: \> small value to avoid zeros in division (default: 0.0001) \\ {\tt neglog}: \> substitution for logarithm of negative values (default: 1.0) \\ {\tt noisefrac}: \> fraction of global background variation (default: 0.5) \end{tabbing} The MAS5 \Rcode{tukeybiweight} summarization is computed as follows (here we use the background-corrected intensities \Rcode{bgrd.mas} w/o normalization): <<eval=FALSE>>= expr.tbw <- express(bgrd.mas, "ExprTukey", filedir=outdir, update=FALSE, summarize.method="tukeybiweight", summarize.select="none", summarize.option="transcript", summarize.logbase="log2", summarize.params=c(0.03, 10.0, 2.0e-20, 5.0, 0.0001, 1.0, 0.5)) @ \pagebreak Now we can draw density plots and boxplots of the final expression values: <<eval=FALSE>>= hist(expr.tbw) boxplot(expr.tbw) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_ExprTukey.png} \includegraphics[width=70mm]{BoxPlot_ExprTukey.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} As the boxplot reveals no probe-level normalization was done since the MAS5 algorithm usually scales the expression levels to a certain target intensity (see function \Rcode{normalize}). \\ In addition we can draw scatter plots between expression values at the transcript-level: <<eval=FALSE>>= expr <- validData(expr.tbw) plot(log2(expr[,1]), log2(expr[,2]), xlab="BreastA", ylab="BreastB") plot(log2(expr[,1]), log2(expr[,4]), xlab="BreastA", ylab="ProstateA") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{ScatterPlot_ExprTukey_BrABrB.png} \includegraphics[width=70mm]{ScatterPlot_ExprTukey_BrAPrA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{dfw} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This multichip summarization method uses the Tukey weight function to give probes with unusually high or low variability across arrays small weights. These weighted probe intensities are used to obtain the summarized expression values \citep{chen:etal:2007}. When using function \Rfunction{express()} the following \Rcode{summarize} parameters can be used for \Rcode{summarize.method="dfw"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.select}: \\ {\tt pmonly}: \> only PM probes are selected (default) \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.params}: \\ {\tt m}: \> exponent for range WR (default: 3) \\ {\tt n}: \> exponent for stdev WSD (default: 1) \\ {\tt c}: \> scale parameter (default: 0.01) \end{tabbing} The \Rcode{dfw} summarization is computed as follows (w/o background correction): <<eval=FALSE>>= expr.dfw <- express(norm.qu, "ExprDFW", filedir=outdir, update=FALSE, summarize.method="dfw", summarize.select="pmonly", summarize.option="transcript", summarize.logbase="log2", summarize.params=c(3.0, 1.0, 0.01)) @ Now we can draw density plots and boxplots of the final expression values: <<eval=FALSE>>= hist(expr.dfw) boxplot(expr.dfw) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_ExprDFW.png} \includegraphics[width=70mm]{BoxPlot_ExprDFW.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} In addition we can draw scatter plots between expression values at the transcript-level: <<eval=FALSE>>= expr <- validData(expr.dfw) plot(log2(expr[,1]), log2(expr[,2]), xlab="BreastA", ylab="BreastB") plot(log2(expr[,1]), log2(expr[,4]), xlab="BreastA", ylab="ProstateA") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{ScatterPlot_ExprDFW_BrABrB.png} \includegraphics[width=70mm]{ScatterPlot_ExprDFW_BrAPrA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{farms} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This multichip summarization method is based on a factor analysis model for which a Bayesian maximum a posteriori method optimizes the model parameters under the assumption of Gaussian measurement noise. Afterwards, the expression levels are estimated from the model \citep{hochr:etal:2006}. When using function \Rfunction{express()} the following \Rcode{summarize} parameters can be used for \Rcode{summarize.method="farms"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.select}: \\ {\tt pmonly}: \> only PM probes are selected (default) \\ {\tt mmonly}: \> only MM probes are selected \\ {\tt both}: \> both PM and MM probes are selected \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{summarize.params}: \\ {\tt version}: \> version of farms package, 131 (farms\_1.3.1); 130 (farms\_1.3) (default: 131) \\ {\tt weight}: \> hyperparameter (default: 0.5) \\ {\tt mu}: \> hyperparameter (default: 0.0) \\ {\tt scale}: \> scaling parameter (default: 1.0) \\ {\tt tol}: \> termination tolerance (default: 0.00001) \\ {\tt cyc}: \> maximum number of cycles (default: 100) \\ {\tt weighted}: \> weighted mean (for 131 only) (default: 1.0) \end{tabbing} The default \Rcode{farms} summarization is computed as follows (w/o background correction): <<eval=FALSE>>= expr.frm <- express(norm.qu, "ExprFARMS", filedir=outdir, update=FALSE, summarize.method="farms", summarize.select="pmonly", summarize.option="transcript", summarize.logbase="log2", summarize.params=c(131, 0.5, 0.0, 1.0, 0.00001, 100, 1)) @ \pagebreak Now we can draw density plots and boxplots of the final expression values: <<eval=FALSE>>= hist(expr.frm) boxplot(expr.frm) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_ExprFARMS.png} \includegraphics[width=70mm]{BoxPlot_ExprFARMS.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} In addition we can draw scatter plots between expression values at the transcript-level: <<eval=FALSE>>= expr <- validData(expr.frm) plot(log2(expr[,1]), log2(expr[,2]), xlab="BreastA", ylab="BreastB") plot(log2(expr[,1]), log2(expr[,4]), xlab="BreastA", ylab="ProstateA") @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{ScatterPlot_ExprFARMS_BrABrB.png} \includegraphics[width=70mm]{ScatterPlot_ExprFARMS_BrAPrA.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} {\bf Note:} Although package \Rpackage{farms} is available under the GNU GPL License, the authors state on their web site that: "This package (i.e. \Rpackage{farms\_1.x}) is only free for non-commercial users. Non-academic users must have a valid license." Since I do not know if this statement applies for my C++ implementation, too, it is recommended that respective users contact the authors of the original package. %----------------------------------------------------------------------- \section{Probeset-level Normalization using \Rfunction{normalize()}} %----------------------------------------------------------------------- Currently the following probeset-level normalization methods are available as \Rcode{method}: \begin{itemize} \item {\tt mean}: all arrays are scaled so that they have the same trimmed mean value. \item {\tt median}: all arrays are scaled so that they have the same median value. \item {\tt lowess}: each array is normalized to a reference array using locally-weighted polynomial regression. \item {\tt supsmu}: each array is normalized to a reference array by applying Friedmann's super smoother. \end{itemize} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{mean/median (mas)} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - All arrays are scaled so that they have the same mean/median value. This value is determined either by the mean/median value of a reference tree or by a given target intensity. When using function \Rfunction{normalize()} the following parameters can be used for \Rcode{method="mean"} or for \Rcode{method="median"}, respectively: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{select}: \\ {\tt separate}: \> use separate mask for each normalization (default) \\ {\tt together}: \> use combined masks for normalization \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.option = "<option>:<dataopt>"}: \\ {\tt option}: \> use unit tree for "transcript", "exon", "probeset" \\ {\tt dataopt}: \> applied to "all" selected probes \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{normalize.params}: \\ {\tt trim}: \> trim value for trimmed mean (only for method "mean") \\ {\tt targetinten}: \> if > 0 then scale mean/median to value of target intensity \end{tabbing} The trimmed \Rcode{mean} probeset-level normalization is computed as follows: <<eval=FALSE>>= nrxp.mn <- normalize(expr.tbw, "NrmExpMean", filedir=outdir, update=FALSE, select="separate", method="mean", option="transcript:all", logbase="0", refindex=1, refmethod="mean", params=c(0.02, 500)) @ Here the reference index is set to \Rcode{refindex=1} to prevent computing a mean reference tree which will not be used since the intensities are normalized to target intensity 500 and not to a reference tree. \\ Now we can draw density plots and boxplots of the normalized expression values: <<eval=FALSE>>= hist(nrxp.mn) boxplot(nrxp.mn) @ \begin{figure}[h] \begin{center} \includegraphics[width=70mm]{DensityPlot_NrmExpMean.png} \includegraphics[width=70mm]{BoxPlot_NrmExpMean.png} % \caption{xxx} % \label{xxx} \end{center} \end{figure} \pagebreak {\bf Note:} Probeset-level normalization is usually necessary for the MAS4 and MAS5 algorithms, since no probe-level normalization is done and the average expression level (after removing the 2\% highest and lowest expression values) \citep{affy:tech:2002} is usually scaled to a "Target Intensity" such as \Rcode{targetinten=500} for expression arrays or \Rcode{targetinten=100} for exon arrays. %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{lowess} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Each array is normalized to a reference array using locally-weighted polynomial regression. The reference array can be one of the arrays or the mean/median of all arrays. When using function \Rfunction{normalize()} the following parameters can be used for \Rcode{method="lowess"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{select}: \\ {\tt separate}: \> use separate mask for each normalization (default) \\ {\tt together}: \> use combined masks for normalization \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{option = "<option>:<dataopt>"}: \\ {\tt option}: \> use unit tree for "transcript", "exon", "probeset" \\ {\tt dataopt}: \> applied to "all" probes, selected "sel" probes only, or percentage probes (range (0,1]) \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{params}: \\ The following parameters are necessary: \\ {\tt span}: \> smoother span (default: 2/3) \\ {\tt iter}: \> number of robustifying iterations (default: 3) \\ {\tt rule}: \> 0, 1 or 2, determines how approx interpolation should be done at ends \\ {\tt f}: \> for approx method "constant" a value between 0 and 1 (not used) \end{tabbing} The \Rcode{lowess} probeset-level normalization is computed as follows: <<eval=FALSE>>= nrxp.low <- normalize(expr.tbw, "tmp_NrmExpLow", filedir=outdir, update=FALSE, select="separate", method="lowess", option="transcript:all", logbase="log2", refindex=0, refmethod="mean", params=c(0.67, 3, 0.0, 0.0)) @ %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \subsection{supsmu} %- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Each array is normalized to a reference array by applying Friedmann's super smoother. The reference array can be one of the arrays or the mean/median of all arrays. When using function \Rfunction{normalize()} the following parameters can be used for \Rcode{method="supsmu"}: \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{select}: \\ {\tt separate}: \> use separate mask for each normalization (default) \\ {\tt together}: \> use combined masks for normalization \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{option = "<option>:<dataopt>"}: \\ {\tt option}: \> use unit tree for "transcript", "exon", "probeset" \\ {\tt dataopt}: \> applied to "all" probes, selected "sel" probes only, or percentage probes (range (0,1]) \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{logbase}: \\ {\tt logbase}: \> "0" - linear, "log", "log2", "log10" - log with base e, 2, 10 \end{tabbing} \begin{tabbing} xxxxxxxxxxxxx\=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\kill Settings to be used for \Rcode{params}: \\ The following parameters are necessary: \\ {\tt bass}: \> controls smoothness of fitted curve (default: 0.0) \\ {\tt span}: \> fraction of observations in span (default: 0.0) \\ {\tt rule}: \> 0, 1 or 2, determines how approx interpolation should be done at ends \\ {\tt f}: \> for approx method "constant" a value between 0 and 1 (not used) \end{tabbing} The \Rcode{supsmu} probeset-level normalization is computed as follows: <<eval=FALSE>>= nrxp.sup <- normalize(expr.tbw, "tmp_NrmExpSupR2", filedir=outdir, update=FALSE, select="separate", method="supsmu", option="transcript:all", logbase="log2", refindex=0, refmethod="mean", params=c(0.0, 0.0, 2, 0.0)) @ \bibliographystyle{plainnat} \bibliography{xps} \end{document}