% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{1. Primer}
%\VignetteKeywords{Preprocessing, Affymetrix}
%\VignetteDepends{affy}
%\VignettePackage{affy}
%documentclass[12pt, a4paper]{article}
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\author{Laurent Gautier, Rafael Irizarry, Leslie Cope, and Ben Bolstad}
\begin{document}
\title{Description of affy}

\maketitle
\tableofcontents
\section{Introduction}
The \Rpackage{affy} package is part of the
BioConductor\footnote{\url{http://bioconductor.org/}} project.
It is meant to be an extensible, interactive environment for data analysis
and exploration of Affymetrix oligonucleotide array probe level data.

The software utilities provided with the Affymetrix software suite
summarizes the probe set intensities to form one {\it expression measure}
for each gene. The expression measure is the data available for analysis.
However, as pointed out by \cite{li:wong:2001a}, much can be learned from
studying the individual probe intensities, or as we call them, the
{\it probe level data}. This is why we developed this package. The package
includes plotting functions for the probe level data useful for quality
control, RNA degradation assessments, different probe level normalization
and background correction procedures, and flexible functions that permit
the user to convert probe level data to expression measures. The package
includes utilities for computing expression measures similar to MAS 4.0's
AvDiff \citep{affy4}, MAS 5.0's signal \citep{affy5},
DChip's MBEI \citep{li:wong:2001a}, and RMA \citep{iriz:etal:2003}.

We assume that the reader is already familiar with oligonucleotide
arrays and with the design of the Affymetrix GeneChip arrays.
If you are not, we recommend the Appendix of the Affymetrix MAS manual
\cite{affy4,affy5}.

The following terms are used throughout this document:
\begin{description}
\item[probe] oligonucleotides of 25 base pair length used to
probe RNA targets.
\item[perfect match] probes intended to match perfectly the
target sequence.
\item[$PM$] intensity value read from the perfect matches.
\item[mismatch] the probes having one base mismatch
  with the target sequence intended to account for non-specific binding.
\item[$MM$] intensity value read from the mis-matches.
\item[probe pair] a unit composed of a  perfect
match and its mismatch.
\item[affyID] an identification for a probe set (which can be a gene
or a fraction of a gene) represented on the array.
\item[probe pair set] $PM$s and $MM$s related to a common {\it affyID}.
\item[{\it CEL} files] contain measured intensities and locations for
an array that has been hybridized.
\item[{\it CDF} file] contain the information relating probe pair sets
to locations on the array.
\end{description}

Section \ref{whatsnew} describes the main differences between version
1.5 and
this version (1.6). Section \ref{sec:get.started} describes a quick
way of getting started and getting expression measures. Section
\ref{qc} describes some quality control tools. Section \ref{s1.4}
describes normalization routines. Section \ref{classes} describes
the different classes in the package. \ref{sec:probesloc} describes
our strategy to map probe locations to probe set membership. Section
\ref{configure.options} describes how to change the package's default
options. Section \ref{whatwasnew} describes earlier changes.

%%%make sure to change this when we get a publication about version 2.
{\bf Note:} If you use this package please cite
\cite{gaut:cope:bols:iriz:2003} and/or \cite{iriz:gaut:cope:2003}.

\section{Changes for affy in BioC 1.8 release}
\label{whatsnew}
There were relatively few changes.

\begin{itemize}
\item MAplot now accepts the argument \Rfunction{plot.method} which can be
used to call smoothScatter.
\item \Rfunction{normalize.quantiles.robust} has had minor changes.
\item \Rfunction{ReadAffy} can optionally return the SD values stored in
the cel file.
\item The C parsing code has been moved to the \Rpackage{affyio} package,
which is now a dependency of the affy package. This change should be
transparent to users as \Rpackage{affyio} will be automatically loaded
when affy is loaded.
\item Added a cdfname argument to \Rfunction{justRMA} and
\Rfunction{ReadAffy} to allow for the use of alternative cdf packages.
\end{itemize}

\section{Getting Started: From probe level data to expression values}
\label{sec:get.started}

The first thing you need to do is {\bf load the package}.
\begin{Sinput}
R> library(affy) ##load the affy package
\end{Sinput}
<<echo=F,results=hide>>=
library(affy)
@
This release of the \Rpackage{affy} package will automatically
download the appropriate cdf environment when you require it. However,
if you wish you may download and install the cdf environment you need
from
\url{http://bioconductor.org/help/bioc-views/release/data/annotation/}
manually. If there is no cdf environment currently built for your
particular chip and you have access to the CDF file then you may use
the \Rpackage{makecdfenv} package to create one yourself. To make the
cdf packaes, Microsoft Windows users will need to use the tools
described in \url{http://www.murdoch-sutherland.com/Rtools/}.


\subsection{Quick start}
If all you want is to go from probe level data ({\it Cel} files) to
expression measures here are some quick ways.

If you want is RMA, the quickest way of reading in data and getting
expression measures  is the following:
\begin{enumerate}
\item Create a directory, move all the relevant
{\it CEL} files to that directory
\item If using linux/unix, start R in that directory.
\item If using the Rgui for
Microsoft Windows make sure your working directory contains the {\it
Cel} files (use ``File -> Change Dir'' menu item).
\item Load the library.
\begin{Sinput}
R> library(affy) ##load the affy package
\end{Sinput}
\item Read in the data and create an expression, using RMA for example.
\begin{Sinput}
R> Data <- ReadAffy() ##read data in working directory
R> eset <- rma(Data)
\end{Sinput}
\end{enumerate}

Depending on the size of your dataset and on the memory available to your
system, you might experience errors like `Cannot allocate vector \ldots'.
An obvious option is to increase the memory available to your R process
(by adding memory and/or closing external applications\footnote{UNIX-like
systems users might also want to check {\it ulimit} and/or compile
{\bf R} and the package for 64 bits when possible.}.
An another option is to use the function \Rfunction{justRMA}.
\begin{Sinput}
R> eset <- justRMA()
\end{Sinput}
This reads the data and performs the `RMA' way to preprocess them at
the {\it C} level. One does not need to call \verb+ReadAffy+, probe level
data is never stored in an AffyBatch. \verb+rma+ continues to be the
recommended function for computing RMA.


The \Rfunction{rma} function was written in C for speed and
efficiency. It uses the expression measure described in \cite{iriz:etal:2003}.

For other popular methods use \Rfunction{expresso} instead of
\Rfunction{rma} (see Section \ref{expresso}). For example for our
version of MAS 5.0 signal uses
expresso (see code). To get mas 5.0  you can use
\begin{Sinput}
R> eset <- mas5(Data)
\end{Sinput}
which will also normalize the expression values. The normalization can
be turned off through the \verb+normalize+ argument.

In all the above examples, the variable \Robject{eset} is an object of
class \Robject{ExpressionSet} described in the Biobase vignette. Many
of the packages in BioConductor work on objects of this class. See the
\Rpackage{genefilter} and \Rpackage{geneplotter} packages for some examples.

If you want to use some other analysis package, you can write out the
expression values to file using the following command:
\begin{Sinput}
R> write.exprs(eset, file="mydata.txt")
\end{Sinput}

\subsection{Reading CEL file information}
The function \Rfunction{ReadAffy} is quite flexible. It lets you specify the
filenames, phenotype, and MIAME information. You can enter them by reading
files (see the help file) or widgets (you need to have the tkWidgets package
installed and working).

\begin{Sinput}
R> Data <- ReadAffy(widget=TRUE) ##read data in working directory
\end{Sinput}
This function call will pop-up a file browser widget, see Figure \ref{fig:widget.filechooser},
that provides an easy way of choosing cel files.

\newpage
\begin{figure}[htbp]
  \begin{center}
    \includegraphics{widgetfilechooser}
    \caption{\label{fig:widget.filechooser}Graphical display for selecting
      {\it CEL} files. This widget is part of the {\it tkWidgets} package.
      (function written by Jianhua (John) Zhang).
    }
  \end{center}
\end{figure}

Next, a widget (not shown) permits the user to enter the \verb+phenoData+.
%%See Figure \ref{fig:widget.pd}.
%% \begin{figure}[htbp]
%% \begin{center}
%% \begin{tabular}{c}
%% \includegraphics{numcovariates}\\
%% \includegraphics{namecovariates}\\
%% \includegraphics{assigncovariates}
%% \end{tabular}
%% \caption{\label{fig:widget.pd}Graphical display for entering phenoData
%% This widget is part
%% of the {\it tkWidgets} package.}
%% % (functions written by Majnu John.}
%% \end{center}
%% \end{figure}
Finally the a widget is presented for the user to enter MIAME information.
%%Seen in Figure \ref{fig:widget.tkMIAME}.


%% \begin{figure}[htbp]
%% \begin{center}
%% \includegraphics[width=0.5\textwidth]{widgettkMIAME}
%% \caption{\label{fig:widget.tkMIAME}Graphical display for entering {\it
%%     MIAME} informations. This widget is part of the {\it tkWidgets}
%%     package.}
%% % (function written by Majnu John).}
%% \end{center}
%% \end{figure}

Notice that it is not necessary to use widgets to enter this
information. Please read the help file for more information on how to
read it from flat files or to enter it programmatically.

The function \Rfunction{ReadAffy} is a wrapper for the functions
\Rfunction{read.affybatch}, \Rfunction{tkSampleNames},
\Rfunction{read.AnnotatedDataFrame}, and \Rfunction{read.MIAME}.
The function \Rfunction{read.affybatch} has some nice feature that
make it quite flexible. For example, the \verb+compression+ argument
permit the user to read compressed {\it CEL} files. The argument
{\it compress} set to {\it TRUE} will inform the readers that your files
are compressed and let you read them while they remain compressed.
The compression formats {\it zip} and {\it gzip} are known to be recognized.

A comprehensive description of all these options is found in
the help file:
\begin{Sinput}
R> ?read.affybatch
R> ?read.AnnotatedDataFrame
R> ?read.MIAME
\end{Sinput}


\subsection{Expression measures}

The most common operation is certainly to convert  probe level data to
expression values.
Typically this is achieved through the following sequence:
\begin{enumerate}
\item reading in probe level data.
\item background correction.
\item normalization.
\item probe specific background correction, e.g. subtracting $MM$.
\item summarizing the probe set values into one expression measure
and, in some cases, a standard error for this summary.
\end{enumerate}
We detail what we believe is a good way to proceed below.
As mentioned the function \Rfunction{expresso} provides many
options. For example,
\begin{Sinput}
R> eset <- expresso(Dilution, normalize.method="qspline",
               bgcorrect.method="rma",pmcorrect.method="pmonly",
               summary.method="liwong")
\end{Sinput}

This will store expression values, in the object \Robject{eset}, as an
object of class \Robject{ExpressionSet} (see the \Rpackage{Biobase}
package). You can either use R and the BioConductor packages to analyze
your expression data or if you rather use another package you can write
it out to a tab delimited file like this

\begin{Sinput}
R> write.exprs(eset, file="mydata.txt")
\end{Sinput}

In the \verb+mydata.txt+ file, row will represent genes and columns
will represent samples/arrays. The first row will be a header describing
the columns. The first column will have the {\it affyID}s. The
\Rfunction{write.exprs} function is quite flexible on what it writes (see
the help file).


\subsubsection{expresso}
\label{expresso}
The function \Rfunction{expresso} performs the steps background
correction, normalization, probe specific correction, and summary
value computation. We now show this using an \Robject{AffyBatch}
included in the package for examples. The command
\verb+data(Dilution)+ is used to load these data.

Important parameters for the expresso function are:

\begin{description}
\item[bgcorrect.method]. The background correction method to use.
The available methods are
<<>>=
bgcorrect.methods()
@

\item[normalize.method]. The normalization method to use. The
   available methods can be queried by using
\verb+normalize.methods+.
<<>>=
library(affydata)
data(Dilution) ##data included in the package for examples
normalize.methods(Dilution)
@

\item[pmcorrect.method]
The method for probe specific correction. The available methods are
<<>>=
pmcorrect.methods()
@

\item[summary.method]. The summary method to use. The available
methods are
<<>>=
express.summary.stat.methods()
@
Here we use \Rfunction{mas} to refer to the methods described in the
Affymetrix manual version 5.0.

\item[widget]
Making the \verb+widget+ argument \verb+TRUE+,  will let you select
missing parameters (like the normalization method, the background correction
method or the summary method).
Figure \ref{fig:expressochooser} shows the widget for the selection of
preprocessing methods for each of the steps.

\begin{Sinput}
R> expresso(Dilution, widget=TRUE)
\end{Sinput}

\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.5\textwidth]{EWSnap}
\caption{\label{fig:expressochooser}Graphical display for
selecting expresso methods.}
\end{center}
\end{figure}

\end{description}

There is a separate vignette {\bf affy: Built-in Processing Methods}
which explains in more detail what each of the preprocessing options
does.

\subsubsection{MAS 5.0}
To obtain expression values that correspond to those from MAS 5.0, use
\Rfunction{mas5}, which wraps \Rfunction{expresso} and
\Rfunction{affy.scalevalue.exprSet}.

<<>>=
eset <- mas5(Dilution)
@

To obtain MAS 5.0 presence calls you can use the \verb+mas5calls+ method.

<<>>=
Calls <- mas5calls(Dilution)
@
This returns an \verb+ExpressionSet+ object containing P/M/A calls and
their associated Wilcoxon p-values.

\subsubsection{Li and Wong's MBEI (dchip)}
To obtain our version of Li and Wong's MBEI one can use
\begin{Sinput}
R> eset <- expresso(Dilution, normalize.method="invariantset",
                 bg.correct=FALSE,
                 pmcorrect.method="pmonly",summary.method="liwong")
\end{Sinput}

This gives the current $PM$-only default. The reduced model (previous
default) can be obtained using \verb+pmcorrect.method="subtractmm"+.

\subsubsection{C implementation of RMA}
One of the quickest ways to compute expression using the \Rpackage{affy}
package is to use the \Rfunction{rma} function. We have found that this
method allows a user to compute the RMA expression measure in a matter
of minutes for datasets that may have taken hours in previous versions
of \Rpackage{affy}. The function serves as an interface to a hard coded C
implementation of the RMA method \citep{iriz:etal:2003}. Generally, the
following would be sufficient to compute RMA expression measures:
<<>>=
eset <- rma(Dilution)
@

Currently the \Rfunction{rma} function implements RMA in the following manner
\begin{enumerate}
\item Probe specific correction of the PM probes using a model based
  on observed intensity being the sum of signal and noise
\item Normalization of corrected PM probes using quantile
  normalization \citep{bols:etal:2003}
\item Calculation of Expression measure using median polish.
\end{enumerate}

The \Rfunction{rma} function is likely to be improved and extended in the
future as the RMA method is fine-tuned.

\newpage


\section{Quality Control through Data Exploration}
\label{qc}

For the users convenience we have included the \verb+Dilution+ sample data set:

<<>>=
Dilution
@

This will create the \verb+Dilution+ object of class \Robject{AffyBatch}.
\Rfunction{print} (or \Rfunction{show}) will display summary information.
These objects represent data from one experiment. The \Robject{AffyBatch}
class combines the information of various {\it CEL} files with a common
{\it CDF} file. This class is designed to keep information of one experiment.
The probe level data is contained in this object.

The data in \verb+Dilution+ is a small sample of probe sets from 2
sets of duplicate arrays hybridized with different concentrations of the
same RNA. This information is part of the \Robject{AffyBatch} and can be
accessed with the \verb+phenoData+ and \verb+pData+ methods:
<<>>=
phenoData(Dilution)
pData(Dilution)
@

Several of the functions for plotting summarized probe level data are
useful for diagnosing problems with the data. The plotting functions
\Rfunction{boxplot} and \Rfunction{hist} have methods for
\Robject{AffyBatch} objects.  Each of these functions presents
side-by-side graphical summaries of intensity information from each
array.  Important differences in the distribution of intensities are
often evident in these plots. The function \Rfunction{MAplot}
(applied, for example, to \verb+pm(Dilution)+), offers pairwise
graphical comparison of intensity data. The option \verb+pairs+ permits
you to chose between all pairwise comparisons (when \verb+TRUE+) or
compared to a reference array (the default). These plots can be particularly
useful in diagnosing problems in replicate sets of arrays. The function
argument \verb+plot.method+ can be used to create a MAplot using a
smoothScatter, rather than the default method which is to draw every point.

\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
data(Dilution)
MAplot(Dilution,pairs=TRUE,plot.method="smoothScatter")
@
\end{center}
\caption{Pairwise MA plots}
\end{figure}


\subsection{Accessing $PM$ and $MM$ Data}
The $PM$ and $MM$  intensities and corresponding {\it affyID} can be
accessed with the \Rfunction{pm}, \Rfunction{mm}, and
\Rfunction{probeNames} methods. These will be matrices with rows
representing probe pairs and columns representing arrays. The gene
name associated with the probe pair in row $i$ can be found in the
$i$th entry of the vector returned by \Rfunction{probeNames}.

<<>>=
Index <- c(1,2,3,100,1000,2000) ##6 arbitrary probe positions
pm(Dilution)[Index,]
mm(Dilution)[Index,]
probeNames(Dilution)[Index]
@
\verb+Index+ contains six arbitrary probe positions.

Notice that the column names of $PM$ and $MM$ matrices are the sample names
and the row names are the {\it affyID}, e.g. \verb+1001_at+ and
\verb+1000_at+ together with the probe number (related to position in the
target sequence).

<<>>=
sampleNames(Dilution)
@

{\bf Quick example:} To see what percentage of the $MM$ are larger
than the $PM$ simply type
<<>>=
mean(mm(Dilution)>pm(Dilution))
@

The \Rfunction{pm} and \Rfunction{mm} functions can be used to extract
specific probe set intensities.
<<>>=
gn <- geneNames(Dilution)
pm(Dilution, gn[100])
@
The method \Rfunction{geneNames} extracts the unique {\it affyID}s. Also
notice that the 100th probe set is different from the 100th probe! The
100th probe is not part of the the 100th probe set.

The methods \Rfunction{boxplot}, \Rfunction{hist}, and \Rfunction{image}
are useful for quality control. Figure \ref{f3} shows kernel density
estimates (rather than histograms) of $PM$ intensities for the 1st and
2nd array of the \verb+Dilution+ also included in the package.

\subsection{Histograms, Images, and Boxplots}
\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
hist(Dilution[,1:2]) ##PM histogram of arrays 1 and 2
@
\caption{\label{f3} Histogram of $PM$ intensities for 1st and 2nd array}
\end{center}
\end{figure}

As seen in the previous example, the sub-setting method \verb+[+ can be
used to extract specific arrays. {\bf NOTE: Sub-setting is different in
this version. One can no longer subset by gene. We can only define subsets
by one dimension: the columns, i.e. the arrays. Because the \verb+Cel+
class is no longer available \verb+[[+ is no longer available.} %]]

The method \verb+image()+ can be used to detect spatial artifacts.
By default we look at log transformed intensities. This can be
changed through the \verb+transfo+ argument.

<<eval=FALSE>>=
par(mfrow=c(2,2))
image(Dilution)
@

\begin{figure}[htbp]
\begin{center}
\includegraphics{image}
\caption{\label{f1} Image of the log intensities.}
\end{center}
\end{figure}

These images are quite useful for quality control. We recommend
examining these images as a first step in data exploration.

The method \Rfunction{boxplot} can be used to show $PM$, $MM$ or both intensities.
\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(1,1))
boxplot(Dilution, col=c(2,3,4))
@
\caption{\label{f4}Boxplot of arrays in Dilution data.}
\end{center}
\end{figure}
As discussed in the next section this plot shows that we need to
normalize these arrays.


\subsection{RNA degradation plots}
The functions \Rfunction{AffyRNAdeg}, \Rfunction{summaryAffyRNAdeg}, and
\Rfunction{plotAffyRNAdeg} aid in assessment of RNA quality. Individual
probes in a probeset are ordered by location relative to the $5'$ end of
the targeted RNA molecule.\cite{affy4}  Since RNA degradation typically
starts from the $5'$ end of the molecule, we would expect probe
intensities to be systematically lowered at that end of a probeset when
compared to the $3'$ end.  On each chip, probe intensities are averaged
by location in probeset, with the average taken over probesets. The
function \Rfunction{plotAffyRNAdeg} produces a side-by-side plots of these
means, making it easy to notice any $5'$ to $3'$ trend.  The function
\Rfunction{summaryAffyRNAdeg} produces a single summary statistic for each
array in the batch, offering a convenient measure of the severity of
degradation and significance level.  For an example
<<>>=
deg <- AffyRNAdeg(Dilution)
names(deg)
@
does the degradation analysis and returns a list with various
components. A summary can be obtained using
<<>>=
summaryAffyRNAdeg(deg)
@

Finally a plot can be created using \Rfunction{plotAffyRNAdeg}, see Figure \ref{f4.3}.

\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
plotAffyRNAdeg(deg)
@
\caption{\label{f4.3} Side-by-side plot produced by plotAffyRNAdeg.}
\end{center}
\end{figure}

\newpage

\section{Normalization}
\label{s1.4}
Various researchers have pointed out the need for normalization of Affymetrix
arrays. See for example \cite{bols:etal:2003}. The method \verb+normalize+
lets one normalize at the probe level

<<>>=
Dilution.normalized <- normalize(Dilution)
@

For an extended example on normalization please refer to the vignette
in the affydata package.

\section{Classes}
\label{classes}
\verb+AffyBatch+ is the main class in this package. There are three
other auxiliary classes that we also describe in this Section.
\subsection{AffyBatch}
The AffyBatch class has slots to keep all the probe level information
for a batch of {\it Cel} files, which usually represent an experiment.
It also stores phenotypic and MIAME information as does the
\verb+ExpressionSet+ class in the Biobase package (the base package for
BioConductor). In fact, \verb+AffyBatch+ extends \verb+ExpressionSet+.

The expression matrix in \verb+AffyBatch+ has columns representing the
intensities read from the different arrays. The rows represent the
{\it cel} intensities for all position on the array. The cel intensity
with physical coordinates\footnote{Note that in the {\it .CEL} files
the indexing starts at zero while it starts at 1 in the package
(as indexing starts at 1 in {\bf R}).} $(x,y)$ will be in row
\[i = x + \mathtt{nrow} \times (y - 1)\]. The \verb+ncol+ and
\verb+nrow+ slots contain the physical rows of the array. Notice that
this is different from the dimensions of the expression matrix. The
number of row of the expression matrix is equal to
\verb+ncol+$\times$\verb+nrow+. We advice the use of the functions
\verb+xy2indices+ and \verb+indices2xy+ to shuttle from X/Y
coordinates to indices.

For compatibility with previous versions the accessor method
\verb+intensity+ exists for obtaining the expression matrix.

The \verb+cdfName+ slot contains the necessary information for the
package to find the locations of the probes for each probe set. See
Section \ref{sec:probesloc} for more on this.

\subsection{ProbeSet}
The \verb+ProbeSet+ class holds the information of all
the probes related to an {\it affyID}. The components are \verb+pm+ and
\verb+mm+.

The method \verb+probeset+ extracts probe sets from \verb+AffyBatch+
objects. It takes as arguments an \verb+AffyBatch+ object and a vector of
{\it affyIDs} and returns a list of objects of class \verb+ProbeSet+
<<>>=
gn <- featureNames(Dilution)
ps <- probeset(Dilution, gn[1:2])
#this is what i should be using: ps
show(ps[[1]])
@

The \verb+pm+ and \verb+mm+ methods can be used to extract these
matrices (see below).

This function is general in the way it defines a probe set. The
default is to use the definition of a probe set given by Affymetrix in
the CDF file. However, the user can define arbitrary probe sets. The
argument \verb+locations+ lets the user decide the row numbers in the
\verb+intensity+ that define a probe set. For example, if we are
interested in redefining the \verb+AB000114_at+ and \verb+AB000115_at+
probe sets, we could do the following:

First, define the locations of the $PM$ and $MM$ on the array of the
\verb+1000_at+ and \verb+1001_at+ probe sets
<<>>=
mylocation <- list("1000_at"=cbind(pm=c(1,2,3),mm=c(4,5,6)),
                   "1001_at"=cbind(pm=c(4,5,6),mm=c(1,2,3)))
@
The first column of the matrix defines the location of the $PM$s and
the second column the $MM$s.

Now we are ready to extract the \verb+ProbSet+s using the
\verb+probeset+ function:
<<>>=
ps <- probeset(Dilution, genenames=c("1000_at","1001_at"),
                 locations=mylocation)
@
Now, \verb+ps+ is list of \verb+ProbeSet+s. We can see the $PM$s and
$MM$s of each component using the \verb+pm+ and \verb+mm+ accessor methods.
<<>>=
pm(ps[[1]])
mm(ps[[1]])
pm(ps[[2]])
mm(ps[[2]])
@

This can be useful in situations where the user wants to determine if
leaving out certain probes improves performance at the expression level.
It can also be useful to combine probes from different human chips, for
example by considering only probes common to both arrays.

Users can also define their own environment for probe set location
mapping. More on this in Section \ref{sec:probesloc}.

An example of a \verb+ProbeSet+ is included in the package. A spike-in
data set is included in the package in the form of a list of
\verb+ProbeSet+s. The help file describes the data set.
Figure \ref{f5.3} uses this data set to demonstrate that the $MM$
also detect transcript signal.

\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
data(SpikeIn) ##SpikeIn is a ProbeSets
pms <- pm(SpikeIn)
mms <- mm(SpikeIn)

##pms follow concentration
par(mfrow=c(1,2))
concentrations <- matrix(as.numeric(sampleNames(SpikeIn)),20,12,byrow=TRUE)
matplot(concentrations,pms,log="xy",main="PM",ylim=c(30,20000))
lines(concentrations[1,],apply(pms,2,mean),lwd=3)
##so do mms
matplot(concentrations,mms,log="xy",main="MM",ylim=c(30,20000))
lines(concentrations[1,],apply(mms,2,mean),lwd=3)
@
\caption{\label{f5.3}PM and MM intensities plotted against SpikeIn concentration}
\end{center}
\end{figure}


\section{Location to ProbeSet Mapping}
\label{sec:probesloc}
On Affymetrix GeneChip arrays, several probes are used to represent genes
in the form of probe sets. From a {\it CEL} file we get for each physical
location, or cel, (defined by $x$ and $y$ coordinates) an intensity. The
{\it CEL} file also contains the name of the {\it CDF} file needed for the
location-probe-set mapping. The {\it CDF} files store the probe set related
to each location on the array. The computation of a summary expression values
from the probe intensities requires a fast way to map an {\it affyid} to
corresponding probes. We store this mapping information in {\bf R}
environments\footnote{Please refer to the {\bf R} documentation to know more about environments.}.
They only contain a part of the information that can be found in the {\it CDF}
files. The {\it cdfenvs} are sufficient to perform the numerical processing
methods included in the package. For each {\it CDF} file there is package,
available from \url{http://bioconductor.org/help/bioc-views/release/data/annotation/}, that
contains exactly one of these environments. The {\it cdfenvs} we store the
$x$ and $y$ coordinates as one number (see above).

In instances of {\it AffyBatch}, the {\it cdfName} slot gives the name of the
appropriate {\it CDF} file for arrays represented in the \verb+intensity+ slot.
The functions \verb+read.celfile+, \verb+read.affybatch+, and \verb+ReadAffy+
extract the {\it CDF} filename from the {\it CEL} files being read. Each
{\it CDF} file corresponds to exactly one environment. The function
\verb+cleancdfname+ converts the Affymetrix given {\it CDF} name to a
BioConductor environment and annotation name. Here are two examples:

These give environment names:
<<>>=
cat("HG_U95Av2 is",cleancdfname("HG_U95Av2"),"\n")
cat("HG-133A is",cleancdfname("HG-133A"),"\n")
@

This gives annotation name:
<<>>=
cat("HG_U95Av2 is",cleancdfname("HG_U95Av2",addcdf=FALSE),"\n")
@

An environment representing the corner of an Hu6800 array is available with
the package. In the following, we load the environment, look at the names for
the first 5 objects defined in the environment, and finally look at the first
object in the environment:
<<>>=
data(cdfenv.example)
ls(cdfenv.example)[1:5]
get(ls(cdfenv.example)[1],cdfenv.example)
@

The package needs to know what locations correspond to which probe
sets. The \verb+cdfName+ slot contains the necessary information to
find the environment with this location information. The method
\verb+getCdfInfo+ takes as an argument an \verb+AffyBatch+ and
returns the necessary environment. If \verb+x+ is an \verb+AffyBatch+,
this function will look for an environment with name
\verb+cleancdfname(x@cdfName)+.

<<>>=
print(Dilution@cdfName)
myenv <- getCdfInfo(Dilution)
ls(myenv)[1:5]
@

By default we search for the environment first in the global
environment, then in a package named \verb+cleancdfname(x@cdfName)+.

Various methods exist to obtain locations of probes as demonstrated
in the following examples:
<<>>=
Index <- pmindex(Dilution)
names(Index)[1:2]
Index[1:2]
@
\verb+pmindex+ returns a list with probe set names as names and
locations in the components. We can also get specific probe sets:
<<>>=
pmindex(Dilution, genenames=c("1000_at","1001_at"))
@
The locations are ordered from 5' to 3' on the target transcript.
The function \verb+mmindex+ performs in a similar way:
<<>>=
mmindex(Dilution, genenames=c("1000_at","1001_at"))
@
They both use the method \verb+indexProbes+
<<>>=
indexProbes(Dilution, which="pm")[1]
indexProbes(Dilution, which="mm")[1]
indexProbes(Dilution, which="both")[1]
@
The \verb+which="both"+ options returns the location of the $PM$s
followed by the $MM$s.


\section{Configuring the package options}
\label{configure.options}
Package-wide options can be configured, as shown below through examples.

\begin{itemize}
 \item Getting the names for the options:
<<>>=
opt <- getOption("BioC")
affy.opt <- opt$affy
print(names(affy.opt))
@
%$
 \item Default processing methods:
<<>>=
opt <- getOption("BioC")
affy.opt <- opt$affy
affy.opt$normalize.method <- "constant"
opt$affy <- affy.opt
options(BioC=opt)
@
%$
\item Compression of files: if you are always compressing your CEL files,
you might find annoying to specify it each time you call a reading function.
It can be specified once for all in the options.
<<>>=
opt <- getOption("BioC")
affy.opt <- opt$affy
affy.opt$compress.cel <- TRUE
opt$affy <- affy.opt
options(BioC=opt)
@
%$
\item Priority rule for the use of a cdf environment:
The option {\it probesloc} is a list. Each element of the list is itself a
list with two elements {\it what} and {\it where}. When looking for the
information related to the locations of the probes on the array, the elements
in the list will be looked at sequentially. The first one leading to the
information is used (an error message is returned if none permits to find the
information). The element {\it what} can be one of {\it package},
{\it environment}.
\end{itemize}


\section{Where can I get more information?}
\label{moreinfo}

There are several other vignettes addressing more specialised topics related to the {\tt affy} package.
\begin{itemize}
\item {\bf affy: Custom Processing Methods (HowTo)}: A description of how to use custom preprocessing methods with the package. This document gives examples of how you might write your own preprocessing method and use it with the package.
\item {\bf affy: Built-in Processing Methods}: A document giving fuller descriptions of each of the preprocessing methods that are available within the {\tt affy} package.
\item {\bf affy: Import Methods (HowTo)}: A discussion of the data structures used and how you might import non standard data into the package.
\item {\bf affy: Loading Affymetrix Data (HowTo)}: A quick guide to loading Affymetrix data into R.
\item {\bf affy: Automatic downloading of cdfenvs (HowTo)}: How you can configure the automatic downloading of the appropriate {\it cdfenv} for your analysis.
\end{itemize}


\appendix

\section{Previous Release Notes}

\subsection{Changes in versions 1.6.x}
There were very few changes.

\begin{itemize}
\item The function \verb+MAplot+ has been added. It works on instances of AffyBatch.
You can decide if you want to make all pairwise MA plots or compare to a
reference array using the pairs argument.
\item Minor bugs fixed in the parsers.
\item The path of celfiles is now removed by ReadAffy.
\end{itemize}


\subsection{Changes in versions 1.5.x}
There are some minor differences in what you can do but little
functionality has disappeared. Memory efficiency and speed have improved.

\begin{itemize}
\item The widgets used by ReadAffy have changed.
\item The path of celfiles is now removed by ReadAffy.
\end{itemize}

\subsection{Changes in versions 1.4.x}
There are some minor differences in what you can do but little
functionality has disappeared. Memory efficiency and speed have improved.

\begin{itemize}
\item For instances of \verb+AffyBatch+ the subsetting has
  changed. For consistency with \verb+exprSets+ one can only subset by
  the second dimension. So to obtain the first array, \verb+abatch[1]+
  and \verb+abatch[1,]+ will give warnings (errors in the next
  release). The correct code is \verb+abatch[,1]+.
\item mas5calls is now faster and reproduces Affymetrix's official
  version much better.
\item If you use \verb+pm+ and \verb+mm+ to get the entire set of
  probes, e.g. by typing \verb+pm(abatch)+ then the method will be, on
  average, about 2-3 times faster than in version 1.3.
\end{itemize}

\bibliographystyle{plainnat}
\bibliography{affy}

\end{document}