%\VignetteIndexEntry{Agi4x44PreProcess}
%\VignetteKeywords{Preprocessing, Agilent}
%\VignetteDepends{Agi4x44PreProcess}
%\VignettePackage{Agi4x44PreProcess}

\documentclass{article}
\usepackage[latin1]{inputenc}
\usepackage[english]{babel}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\begin{document}

\title{Agi4x44Preprocess}
\author{Pedro Lopez-Romero}
\maketitle

\section{Package Overview}

The Agi4x44PreProcess package has been designed to read 
Agilent 4 x 44 gene expression arrays data files 
into R \cite{R} for its pre-processing using other Bioconductor 
functions. The package needs plain text files exported 
by the Agilent Feature Extraction 9.1.3.1 (or later version) 
image analysis software (AFE) \cite{AFE}. The pre-processing steps 
implemented in the package are the following: 

1.-	read the target file 
2.-	read the array samples
3.-	Background correction and Normalization between samples
4.-	Filtering Probes by Quality Flag
5.-	Summarizing of Replicated Probes 
6.-	Creating and ExpressionSet object with the processed data

The package also contains other utilities that allow the user to explore the 
architecture of the chip in terms of probe replication and gene 
replication. 

Agi4x44PreProcess contains standard graphical 
utilities to evaluate the quality of the data. These graphics 
might help the users to decide what sort foreground and background 
signal, amongst those provided by the AFE, they want to use in 
their analysis, and what background signal correction and 
normalization method between samples they want to perform. 

Finally, Agi4x44PreProcess also generates the files DataSet.gct 
and Phenotypes.cls that are used by the  Gene Set Enrichment Analysis
tool (GSEA) \cite{GSEA}.

Agi4x44PreProcess employees the corresponding Bioconductor annotation 
packages (human: "hgug4112a.db"; mouse: "mgug4122a.db") to assign 
to each probe the ACCNUM, SYMBOL, ENTREZID, DESCRIPTION, GO TERMS AND GO IDS.
The annotacion package that is going to be used should 
be loaded into the R session. In our example we used the   
hgug4112a.db annotation package. The annotation package can be loaded
using \Rfunction{library} 

<<>>=
library("hgug4112a.db")
@

The package allows choosing between different alternatives in each 
of the pre-processing steps. Currently only the plain text files exported 
by AFE are supported.  To read these files into R the Rpackage{limma} 
function  \Rfunction{read.maimages} \cite{limma} is used. The user can choose 
the foreground signal (gProcessedSignal or gMeanSignal) and the 
background signal for the background correction (gBGMedianSignal or gBGUsed). 
It also decides the methods for the Background correction 
("none", "half", "normexp") and for the Normalization 
between arrays ("none", "quantile", "vsn"). The \Rfunction{backgroundCorrect}
and \Rfunction{normalizeBetweenArrays} Rpackage{limma} functions \cite{limma}
have been used. If the user wants to use other methods implemented 
in \Rfunction{backgroundCorrect} and in \Rfunction{normalizeBetweenArrays}
but not incorporated in the Agi4x44PreProcess package, they can easily 
call any of these functions externally. The users also choose if they want to 
filter out the probes that do not reach a minimum of quality, which is also 
established by the user. In this sense we can be more or less demanding in 
function of the data in hand (low number of replicates probably demands 
being more restrictive about quality limits). There is a final summarization step 
that collapse the non-control replicated probes into a single value. 
The users can also skip the Summarization step in order to use all the probes 
being replicated or not. Finally the processed data is stored in an RGList that can be 
transformed into an ExpressionSet. The \Robject{ExpressionSet} object can be used to 
statistically analyze the data (differential expression, functional analysis, etc)  
using other analytical packages such as \Rpackage{limma}. 

\section{Target File}

We have to specify the experimental conditions under which the data have been 
generated in a target file, where each sample is related with an experimental group.
This is done in a plain text file that can be loaded into R using the Agi4x44PreProcess
function \Rfunction{read.targets}. 

First we load the package and then we read the target file 

\begin{Sinput}
#  NOT RUN # 

> library("Agi4x44PreProcess")
> targets=read.targets(infile="targets.txt")

#  NOT RUN # 
\end{Sinput}

The function \Rfunction{read.target} returns a \Robject{data.frame} object similar to this:

<<data>>=
library("Agi4x44PreProcess")
data(targets)
targets
@

In the target file, the fields Filename, Treatment and GErep are mandatory. 
The Filename specifies the name of the files, the Treatment 
specify which level of the treatment corresponds the FileName. 
Other variables might also be included if the users want to use them in 
further downstream statistical analysis. In our given example, 
we have human cells that have been treated with BMP2 that are to be 
compared with untreated cells, so in our design we consider a Treatment 
effect with two levels, Stimulated and Unstimulated. We have 
collected two replicates of each treatment level and  both treatments 
have been applied to cells of the same individuals, that is, the design 
have been blocked by Subject. As we only have two levels of the blocking 
variable subject, this kind of design is normally known as a  
paired design. To consider the paired design in future downstream analysis, 
using limma for instance, we have to add a subject variable 
that relates the individual with its sample. The GErep is a redundant 
variable that mirrors the Treatment variable using a numeric code,
i.e.,  each treatment level (Stimulated an Unstimulated) is also codified 
numerically by   being n the number of levels of the treatment effect. 
We have included the Array variable as an example of other sort of variables 
that we may want to use. In this case, the Array variable refers to platform 
where the sample has been hybridized. Recall that the Agilent 4x44 platforms 
allows the hybridization of 4 different samples onto the same platform. 
In our example, as long as we have only 4 samples hybridized on the same 
platform the inclusion of this variable does not make any sense, since 
all the samples have been hybridized on the same platform, but if we had 
more samples hybridized on different platforms we could consider the 
platform (Array) as a blocking variable, using designs such as complete block 
designs or incomplete block designs.  

\section{Reading the data}

The chips were scanned using the Agilent G2567AA Microarray Scanner System 
(Agilent Technologies) with the extended dynamic range option turned on. 
Image analysis and data collection were carried out using the Agilent Feature 
Extraction 9.1.3.1. (AFE) \cite{AFE}. For the background signal the AFE was 
set to use the spatial detrend surface value that estimate 
the noise due to a systematic gradient on the array and whose computation 
is based on a Loess algorithm. Details of how the spatial detrend algorithm 
works can be found in the AFE reference guide.

Currently only the plain text files exported by AFE are supported. To read 
these files into R the \Rpackage{limma} function \cite{limma}
\Rfunction{read.maimages} is used. In our example, use real data from Agilent 4x44 
Human chips. However, for the sake of reducing the disk space storing the original 
data have been trimmed, leaving 12,015 features out of the 45,015 that should be 
found on a Human Agilent 4x44 chip. Despite of this, the data example is perfectly 
valid to illustrate all the features and performance of the functions included 
in the package, although some of the functions regarding counting replicated 
probes, etc will produce numbers that will not coincide with the real data.

To read the files we use the \Rfunction{read.AgilentFE} function: 

\begin{Sinput}

#  NOT RUN # 

> dd=read.AgilentFE(targets, makePLOT=FALSE)

#  NOT RUN # 

\end{Sinput}

The result will be an RGList similar to this: 

<<data>>=
data(dd) 
class(dd)
dim(dd)
@

The RGList contains the following slots

<<>>=
names(dd)
@ 

The data stored in the RGList are the following: 

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline \em variable & \em data \\
\hline
dd\$R & gProcessedSignal \\
dd\$G & gMeanSignal \\
dd\$Rb & gBGMedianSignal \\
dd\$Gb& gBGUsed \\
dd\$targets & File names \\
dd\$genes\$ProbeName & Probe Name \\
dd\$genes\$GeneName & Gene Name \\
dd\$genes\$SystematicName & Systematic Name \\
dd\$genes\$Description & Description Name \\
dd\$genes\$Sequence & 60 bases Sequence \\
dd\$genes\$ControlType & FLAG to specify the sort of feature \\
dd\$other\$gIsWellAboveBG & FLAG IsWellAboveBG \\
dd\$other\$gIsFound & FLAG IsFound \\
dd\$other\$gIsSaturated & FLAG IsSaturated \\
dd\$other\$gIsFeatPopnOL & FLAG IsFeatPopnOL \\
dd\$other\$gIsFeatNonUnifOL & FLAG IsFeatNonUnifOL \\
dd\$other\$chr\_coord & CHR coordinate from Agilent data files \\
\hline
\end{tabular}
\label{tab:Flag}
\end{center}
\end{table}

The MeanSignal is the Raw mean signal of the spot. The ProcessedSignal is the signal processed 
by AFE \cite{AFE}. It contains the  Multiplicatively Detrend Background Subtracted Signal if the 
detrending is selected and it helps. If the detrending does not help, the ProcessedSignal will be the 
Background Subtracted Signal. The BGMedianSignal is the Median local background signal. The 
BGUsed depends on the scanner settings for the type of background method and the setting for the 
spatial detrend. Usually, the Background Signal Used is the sum of the local background + the spatial 
detrending surface value computed by the AFE software. To view the values used to calculate this
variable using different background signals and settings of spatial detrend and global background adjust
 \cite{AFE}.

The dd\$targets contain the name of the files equal to those in target file. The dd\$genes\$ProbeName,
dd\$genes\$GeneName, dd\$genes\$SystematicName, and dd\$genes\$Description
provide the mappings to dd\$genes\$ProbeName according to Agilent. In the Agi4x44PreProcess, 
if an annotation package exists, the fields SystematicName, GeneName and Description are replaced, 
respectively,  by the corresponding ACCNUM, SYMBOL and DESCRIPTION obtained from the 
corresponding annotation package.

The AFE attach to each feature a flag that identifies different 
quantification errors of the signal. These
quantification flags can be used to filter 
out signals that do not reach a minimum established criterion of quality. 
We will come back again to the filtering process in the section 8. 

The dd\$other\$chr\_coord variable contains the chromosomal coordinates provided by the Agilent
manufacturer. These coordinates are used by the function genes.rpt.agi to create links to the ENSEMBL 
data base for the different probes interrogating the same gene. See section 6. 

If we set the variable makePLOT = TRUE in the function \Rfunction{read.AgilentFE} we will get a density plot 
and a boxplot of the 	

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline \em variable & \em data \\
\hline
dd\$R & gProcessedSignal \\
dd\$G & gMeanSignal \\
\hline
\end{tabular}
\label{tab:Flag}
\end{center}
\end{table}

And a boxplot of the 

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline \em variable & \em data \\
\hline
dd\$Rb & gBGMedianSignal \\
dd\$Gb& gBGUsed \\
\hline
\end{tabular}
\label{tab:Flag}
\end{center}
\end{table}

These plots can also be obtained with the corresponding plot 
functions (see section 12)

\begin{center}
<<BoxPlot1,fig=TRUE>>=
BoxPlot(log2(dd$R),"ProcessedSignal", "red",
xlab="Samples", ylab="expression")
@ 
\end{center}

\begin{center}
<<plotDensity1, fig=TRUE>>=
plotDensity(log2(dd$R),"ProcessedSignal")
@
\end{center}

\begin{center}
<<BoxPlot2,fig=TRUE>>=
BoxPlot(log2(dd$G),"MeanSignal", "green",
xlab="Samples", ylab="expression")
@ 
\end{center}

\begin{center}
<<plotDensity2,fig=TRUE>>=
plotDensity(log2(dd$G),"MeanSignal")
@
\end{center}

\begin{center}
<<BoxPlot3,fig=TRUE>>=
BoxPlot(log2(dd$Gb),"BGUsed", "orange",
xlab="Samples", ylab="expression")
@ 
\end{center}

\begin{center}
<<BoxPlot4,fig=TRUE>>=
BoxPlot(log2(dd$Rb),"BGMedianSignal", "blue",
xlab="Samples", ylab="expression")
@ 
\end{center}


\section{Replicated Probes}

The Agilent arrays contains a number of non-control 
probes replicated up to ten times which are spread 
across the array. This allows computing the 
\% CV (percent of the coefficient of variation) for each array. 
Agi4x44PreProcess incorporates a specific function, CV.rep.probes, 
that allows the identification of the
non-control replicated probes (we call them Probe Sets) 
and the computation of their coefficient of variation
(\% CV). The CV is computed for every set of replicated 
probes. Within each array, the median of the CV 
of every probe-set is reported as the CV of the array. 
A lower median CV indicates a better reproducibility 
of the array. The CV can be used as a measured of the 
quality of the arrays and it can help to detect a sample 
that deviates from the rest as an erroneous one. This measure 
is also reported by the QC report of AFE. 
The CV.rep.probes function also writes a file (Probe.Sets.txt) 
that contains the non-control 
replicated probes,  along with its PROBE ID, the number 
of replicates, the ACCNUM code, the SYMBOL 
code, the DESCRIPTION of the gene, and the \% CV 
of the probe in each array. The CVs are also given in 
a boxplot. 

<<CV.rep.probes, fig=FALSE>>=
CV.rep.probes(dd,"hgug4112a.db",
               	foreground="MeanSignal", raw.data=TRUE,
		writeR=FALSE, targets)
@ 

\section{Replicated Genes}

Agi4x44PreProcess incorporates a function, genes.rpt.agi, 
to identify what we call the 
Gen Sets, i.e. groups of identical genes, 
according to the ACCNUM code obtained from 
the corresponding Bioconductor annotation 
package, which are interrogated by different probes 
at different locations. Genes with a NA ACCNUM 
reference as well as non-controls replicated 
probes are dismissed for these computations. 
The genes.rpt.agi function deals with these 
gen-sets and generates an HTML output that 
contains links to ENSEMBL data base in order to 
allow checking the exact location of the 
Agilent-probe in the chromosome. Agi4x44PreProcess 
uses the chromosome coordinates provided by 
the Agilent Manufacturer in the data extracted by 
AFE. The Agi4x44PreProcess library also reports 
information such as the distance amongst the
 members of a gen set, the member of the same 
gen set that are reported to be located in different 
chromosomes and the distribution of the gen sets size. 


<<genes.rpt.agi>>=
genes.rpt.agi(dd,"hgug4112a.db",raw.data=TRUE,
               WRITE.html=FALSE,REPORT=FALSE)
@

Be aware that may be non-control replicated probes that 
interrogate the same gene in the same 
location, and the very same gene might be interrogated 
by other probe at different chromosomal location. 


\section{Background Correction and Normalization between arrays}

To make direct comparisons of data coming from different chips it is important to remove 
sources of variation of non biological nature that may exists between arrays. Systematic 
non-biological differences between chips become apparent in several obvious ways 
especially in labelling and in hybridization, and bias the relative measures on any two 
chips when we want to quantify the differences in treatment of two samples. Normalization 
is the attempt to compensate for systematic technical differences between chips, 
to see more clearly the systematic biological differences between samples. First the 
data are background corrected. We produced a Background Subtracted Signal. The 
Background Signal Used depends on the scanner settings for the type of background 
method and the settings for spatial detrend. Usually, the Background Signal Used is the 
sum of the Local Background Signal + the Spatial Detrending Surface Value computed by 
the scanner software. For the Background correction we use the \Rfunction{backgroundCorrect} 
function of the \Rpackage{limma} package with options "half", "normexp". This function is designed 
to produce positive corrected intensities. First, any intensity which is less than 0.5 
is reset to be equal to 0.5. Besides, and offset value (normally 50) is used. This offset
 adds a constant to the intensities before log-transforming, so that the log ratios are shrunk 
towards zero at the lower intensities. After background correction, data are normalized 
between arrays using the \Rpackage{limma} function \Rfunction{normalizeBetweenArrays} 
with options "quantile", "vsn".
	
For the foreground signal, the user can choose 
between the "MeanSignal" and the "ProcessedSignal" 
and between the "BGMedianSignal" and the "BGUsed" 
for the background signal that may be used 
in the background correction. The user may want 
to have a look at different plots of the intensities 
(density plots, etc ...) in order to decide what signal 
they want to use in their analysis. The "MeanSignal" 
is the Raw mean signal of the feature. The 
"ProcessedSignal" is the signal processed by the AFE. 
The "BGMedianSignal" is the Median local background 
signal. The "BGUsed" depends on the scanner 
settings for the type of background method and the 
setting for the spatial detrend. Usually, the 
BGUsed is the sum of the local background + the spatial 
detrending surface value computed by 
the AFE software. The limma function "backgroundCorrect" 
is used for the background correction. 
This function is designed to produced positive intensities. 
Any intensity which is less than 0.5 is reset 
to be equal to 0.5. Additionally, a constant of 50 
(normally) is used as an offset that it is added to the 
intensities before the log transformation. The effect 
of the offset addition is to shrunk log ratios to zero 
at the lower intensities and thus reducing the 
variability of the log-ratios for low intensity spots. The 
optimal choice for the offset is the one which makes 
the variability of the log-ratios as constant as possible 
across the range of intensity values 
(Smyth, G. in BioC mailing List). If the 'half' method is chosen for
the background correction, the method will subtract 
the chosen BACKGROUND signal to the 
chosen FOREGROUND signal, to produce positive corrected 
intensities according to the "half" method. 
If the "normexp" method is selected, then a convolution 
of normal and exponential distributions is fitted 
to the foreground intensities using the background 
intensities as a covariate, and the expected signal given 
the observed foreground becomes the corrected intensity. 
See \Rpackage{limma} user guide for details.  


<<BGandNorm, fig=FALSE>>=
ddNORM=BGandNorm(dd,BGmethod="half",NORMmethod="quantile",
                 foreground="MeanSignal",background="BGMedianSignal",
                 offset=50,makePLOTpre=FALSE,makePLOTpost=FALSE)
@ 

If we set the variables makePLOTpre = TRUE  
and makePLOTpost = TRUE,  
a density Plot,  a boxplot differentiating 
negative controls from the rest of the signals, 
and MA plot identifying the different kind of features, and a 
Relative Log Expression plot (RLE) \cite{Bold3}
are constructed using the data before and 
after normalization, respectively. The same 
plots can be obtained calling the respective 
Agi4x44PreProcess plotting functions. 
See section 12.   

Usually, we prefer to normalize the 
data before filtering probes out. Most 
of the probes that are going to be 
filtered out are going to be the ones that 
are not distinguishable from the background 
signal. Normally these are probes 
that are not expressed in the biological system under 
study and will be filtered 
out but it is interesting to keep 
these signal values to perform the normalization 
using the maximum amount of information as possible.

\section{Filtering Probes}

The Agilent Feature Extraction software provides 
for each feature a flag that identifies 
different quantification errors of the signal. 
The quantification flags can be used to 
filter out signals that didn't reach a minimum 
criterion of quality established by the 
user. The data are filtered at a feature level 
according to the following criteria.

a) To keep features within the dynamic range of the scanner:
For a spot = xi across all the samples, we demand that at leas p \% 
of the probes of the spot xi in at least one experimental condition had 
a quantification flag denoting that the signal is distinguishable from 
background. The same criterion is applied independently for the "IsFound" 
flag and for the saturation of the signal. 

b) To keep features that are of good quality, for each probe we filtered 
out the probe  that had more than y \% of the replicates in at least one
experimental condition with a flag indicating presence of Outliers. 
The function returns an RGList containing with the FILTERED data eliminated

In order to allow the tracking of features that may have been 
filtered out from the original raw data, 
the following files are given:   


RawDataNOCtrl.txt: contains all the features included in the array 
once the internal controls were removed. Internal controls were removed 
prior to any pre-processing step.

IsNOTWellAboveBG.txt: contains the features that were filtered 
out because they were not distinguishable from the local background 
signal. We uses a Boolean flag indicating if a feature is WellAboveBackground 
(Flag = 1) or not (Flag = 0). A feature reaches a Flag = 1 if IsPosAndSignif 
and additionally the gBGSubSignal is greater than 2.6*g(r)BG\_SD.

IsPosAndSignif uses a Boolean flag, established via a 2-sided t-test,
 indicates if the mean signal of a feature is greater than the corresponding 
background. 1 indicates Feature is positive and significant above background

IsNOTFound.txt: contains the features that were filtered out because t
hey were NOT FOUND. A feature is considered Found if two conditions 
are true: 1) the difference between the feature signal and the local background 
signal is more than 1.5 times the local background noise and 2) the spot 
diameter is at least 0.30 times the nominal spot diameter. A Boolean variable 
is used to flag found features. 1 = IsFound

IsSaturated.txt: contains the probes that are saturated. A feature is saturated 
IF 50 \% of the pixels in a feature are above the saturation threshold. 1 = Saturated

IsFeatNonUnifOL.txt: contains the features that are considered a Non Uniformity 
Outlier. A feature is non-uniform if the pixel noise of feature exceeds a threshold 
established for a uniform> feature. 1 indicates Feature is a non-uniformity outlier.

IsFeatPopnOL.txt: contains the features that are considered a Population Outlier. 
A feature is a population outlier if its signal is less than a lower threshold of exceeds 
an upper threshold determined using a multiplier (1.42) times the interquartile range 
of the population. 1 indicates Feature is a population outlier

IsNOTWellAboveNEG.txt: Besides, for each feature we can demand a minimum 
signal value that have to be reached at least for a p \% of the replicates of the features 
in one of the experimental conditions. The minimum limit is established as Mean 
Negative Controls  + 1.5*(Std. dev.Negative Controls). Normally, after filtering 
by the WellAboveBG and IsFound criteria, all the probes are well above negative controls.

In addition to all these files indicated above we have added the ACCNUM, GENE SYMBOL,
the  ENTREZID reference and the gene DESCRIPTION that map to each manufacturer probe 
code in the corresponding annotation package.

In the filter.probes  function, the management about 
which filtering process are done is controlled 
by the following logical variables: 

control, wellaboveBG, isfound, wellaboveNEG, sat, 
PopnOL, NonUnifOL and nas. These 
logical variables are set to TRUE if we want to 
accomplish a specific filtering step, remove controls, 
well above background, well above negative controls, 
saturation, population outliers, non uniform 
outliers and removing NAs, respectively. 

The variables that control the filtering process are: 

limISF: for a given feature xi across samples, is the minimum \% of probes of spots 
for that feature that is demanded to remain in a experimental condition with a 
isfound-FLAG = 1 (Is Found). 
  
limNEG: for a given feature xi across samples, is the minimum \% of spots for that 
feature that is demanded to remain in a experimental condition with a intensity > Limit 
established for negative controls (Mean + 1.5 x SD).

limSAT: for a given feature xi across samples, is the minimum \% of spots for that
 feature that is demanded to remain in a experimental condition with a saturation-FLAG = 0 
(Non Saturated). 
  
limPopnOL: for a given feature xi across samples, is the minimum \% of spots for that 
feature that can be seen  in an experimental condition with a saturation-FLAG = 1 (Is Pop OL). 

limNonUnifOL: for a given feature xi across samples, is the minimum \% of spots for 
that feature that can be seen in an experimental condition with saturation-FLAG = 1 (Is Non Uni OL). 

limNAS: for a given feature xi across samples, is the minimum \% of NAs spots for that 
feature that is demanded to remain in an experimental condition. 


<<filter.probes, fig=FALSE>>=
ddFILT=filter.probes(ddNORM,
                control=TRUE,
                wellaboveBG=TRUE,
                isfound=TRUE,
                wellaboveNEG=TRUE,
                sat=TRUE,
                PopnOL=TRUE,
                NonUnifOL=T,
                nas=TRUE,
                limWellAbove=75,
                limISF=75,
                limNEG=75,
                limSAT=75,
                limPopnOL=75,
                limNonUnifOL=75,
                limNAS=100,
                makePLOT=F,annotation.package="hgug4112a.db",flag.counts=T,targets)
@

<<>>=
dim(ddFILT)
@ 

If we set the variables makePLOT = TRUE a density Plot,  a boxplot, and a Relative Log 
Expression (RLE) plot are constructed using the filtered data.  The same plots can be 
obtained calling the respective Agi4x44PreProcess plotting functions. See section 12.  


\section{Summarizing}

Normally, the Agilent 4 x 44 chips contain a set of non-control probes that are 
replicated up to ten times. These probes are spread over the chip and allows 
measuring the chip reproducibility in terms of the coefficient of variation (\%CV) 
in such a way that lower CV indicates a better reproducibility of the array. 

These replicated probes can be seen as a sub-sampling or pseudo-replication 
of the same experimental unit, i.e. for the same sample and array (confounded) 
a given 60 mer sequence is replicated.  The same probes within the array are 
correlated and its variability should be mainly due to differences in labelling, 
hybridization and array location.  For the eventual statistical analysis we can leave 
the replicate probes as they are, and analyze them independently to each other. 
This will have the inconvenient of having different results for the same probe that 
eventually will have to be condensed into a single one to make a decision about 
the differential expression of the gene that the probe is interrogating. We could take
into account the sub sampling by a statistical model that  includes a term that consider 
the replication of the probes. The strategy adopted in Agi4x44Preproces is to produce, 
for each set of replicated probes,  a unique probe value obtained by computing the 
median of the intensities of the probes belonging to the replicated probe set. This has 
been implemented in the summarize.probe function. This function uses an RGList as 
an input and it produces another RGList where each set of replicated non-control 
probes have been collapsed into a single value. Normally, 
the input RGList is the "filtered data", 
but other RGLists can be used as inputs. This summarization process could affect to 
the estimated prior value and prior degrees of freedom for the residual variance 
of the genes when using eBayes in the limma package. However, 
since the number of replicated probes is extremely low in comparison to the total 
number of probes in the array, this effect is very small. 

This is an optional step that produces the processed data that can be analyzed. 
If this step is not performed, the filtered data are the one that has to be analyzed. 

<<summarize.probe,fig=FALSE>>=
ddPROC=summarize.probe(ddFILT, makePLOT=FALSE, targets)
@

If we set the variables makePLOT = TRUE  a density Plot,  a boxplot, a 
Relative Log Expression (RLE) plot, MVA plots and a hierarchical cluster 
plot are constructed using the summarized data.  The same plots can be obtained 
calling the respective Agi4x44PreProcess plotting functions. See section 12.  

\section{Creating an ExpressionSet object}

The build.eset function creates an instance of class \Robject{ExpressionSet} 
object from an RGList. Usually this function is applied to an \Robject{RGList} 
object containing the processed data, but certainly other RGList objects can be employed. 

<<build.eset,fig=FALSE>>=
esetPROC=build.eset(ddPROC, targets, makePLOT=FALSE,
                annotation.package="hgug4112a.db")
@ 

If we set the variables makePLOT = TRUE  it makes a heatmap with the 
100 greater variance genes, a 'hierarchical cluster' with all the genes and a  
pca plot. The same plots can be obtained calling the respective 
Agi4x44PreProcess plotting functions. See section 12.  The following plots 
show the esetPROC data. 

The information contained in the ExpressionSet object can be written in a file, 
ProcessedData.txt,  using the function write.eset. This function also writes 
the mappings of the Agilent PROBE ID with the ACCNUM, SYMBOL, 
ENTREZID and DESCRIPTION fields, using the corresponding annotation package. 

\begin{Sinput}

# NOT RUN

> write.eset(esetPROC,ddPROC,"hgug4112a.db",targets)

# NOT RUN

\end{Sinput}

\section{mappings}

The function build.mappings creates a data.frame that contains
by rows the PROBE IDs and
by columns contains "ACCNUM","SYMBOL","ENTREZID",
"DESCRIPTION","GO.Id" and "GO.Terms" for each probe. Mappings
are extracted from the corresponding annotation package.
Usually this function is applied to an Expression Set object containing
the processed data

\begin{Sinput}

# NOT RUN

> mappings=build.mappings(esetPROC,"hgug4112a.db")
> names(mappings)

# NOT RUN

\end{Sinput}

\section{GSEA outputs}

The function gsea.files generates the files "DataSet.gct" and "Phenotypes.cls" that are 
used by the  Gene Set Enrichment Analysis tool (GSEA) \cite{GSEA} 

\begin{Sinput}

# NOT RUN

> gsea.files(esetPROC,targets,"hgug4112a.db")

# NOT RUN

\end{Sinput}


\section{Plotting Functions}

We have implemented in Agi4x44PreProcess some diagnostic plots. 
These are functions to produce boxplots (Boxplot and boxplotNegCtrl), 
density plots (plotDensity), MA plots (MVAplotMEDctr and MVAplotMED),
Relative Log Expression plots (RLE), heatmaps (HeatMap), hierarchical cluster 
of samples (hierclus) and PCA plots (PCAplot). Let us see some examples:  

\subsection{Boxplot}

\begin{center}
<<BoxPlot,eval=FALSE>>=
BoxPlot(log2(dd$G),"MeanSignal","green",
xlab="Samples",ylab="expression")
@ 
\end{center}

\subsection{boxplotNegCtrl}

\begin{center}
<<boxplotNegCtrl,fig=TRUE>>=
boxplotNegCtrl(dd,Log2=FALSE, channel="G")
@ 
\end{center}

The functions, Boxplot and boxplotNegCtrl, construct a boxplot using the intensities 
of each sample The Boxplot uses as input a matrix in log2 scale whereas the 
boxplotNegCtrl uses an RGList. In this case we have to pass to the boxplotNegCtrl 
functions two arguments that indicates if the signal in the RGList on which the boxplot 
is base is in log2 scale (Log2=TRUE) or not (Log2=FALSE). The "channel" argument 
specifies on which signal the boxplot is based on. If "channel = R", then the data stored 
in dd\$R is used, if channel is missing or "channel = G" then the data stored in dd\$G is used.  
In the boxplotNegCtrl the gene signals and the signals of the negative controls are 
separated in the plot. This allows studying if the relative comparison between the 
signals of the gene features and the negative controls. 

\subsection{plotDensity}

This function creates a density plot with the intensities of the arrays

\begin{center}
<<plotDensity,eval=FALSE>>=
plotDensity(log2(dd$G),"Density Plot example")
@ 
\end{center}


\subsection{MVAplotMEDctr}

It creates a MA plots using a synthetic array as a reference. i.e., the M 
value is computed for every spot as the difference between the spot in the 
array and the same spot averaged over the whole set of arrays. Every kind 
of feature is identified with different colour.  As in the "boxplotNegCtrl" 
we have to give the "channel" argument to specify on which 
signal the MA is based.  The function also produces a short report giving 
information about how many spots of each kind (REPLICATED NON-CTRL, 
POSITIVE CTRL, NEGATIVE CTRL and STRUCTURAL) there are inside 
the chip. This function is normally applied to Raw data. 


\begin{center}
<<MVAplotMEDctrl,eval=FALSE>>=

par(mfrow=c(2,2),ask=TRUE)	
MVAplotMEDctrl(dd,"MVA example",channel="G")
@ 
\end{center}

\subsection{MVAplotMED}

It creates a MA plots using a "synthetic" array as a reference but it does not distinguish 
the different sort of spots. The function can be used to evaluate the performance 
of the Normalization process. 


\begin{center}
<<MVAplotMED,eval=FALSE>>=
par(mfrow=c(2,2))
MVAplotMED(dd$G,"red","MVA example")
@ 
\end{center}


\subsection{RLE}

This function produces for each sample a Boxplot that displays the Relative 
Log Expression (RLE) \cite{Bold3}. The RLE is computed for 
every spot in the array as the difference between the spot and the median 
of the same spot across all the arrays. As the majority of the spots are expected 
not to be differentially expressed, the plot should show boxplots centred on zero 
and all of them having the approximately the same dispersion. An array showing 
greater dispersion than the other, or being not centred at zero could have quality problems. 


\begin{center}
<<RLE,fig=TRUE>>=
par(mfrow=c(1,1))	
RLE(log2(dd$G),"RLE example ","orange")

@ 
\end{center}


\subsection{HeatMap }

This function creates a HeatMap graph using the \Rfunction{heatmap.2}
function of the \Rpackage{gplots} package. The plot is created for 
the number of highest variance genes indicated in the argument 
"size" of the function. 


\begin{center}
<<HeatMap,fig=TRUE>>=
HeatMap(exprs(esetPROC),size=100,"100 High Var genes")
@ 
\end{center}


\subsection{hierclus}

This function makes a hierarchical cluster of the samples using 
the \Rfunction{hclust} function of the 
\Rpackage{stats} package. If the argument sel = TRUE it selects the 
"size" highest variance genes for the plot. 


\begin{center}
<<hierclus,fig=TRUE>>=

data(targets)
GErep=targets$Gerep 
hierclus(exprs(esetPROC),GErep,methdis="euclidean",
        methclu="complete",sel=FALSE,size = 100)

@ 
\end{center}


\subsection{PCAplot}

This function makes a PCA plot of the sample space using the \Rfunction{plotPCA} of the 
\Rpackage{affycoretools} package. 


\begin{center}
<<PCAplot,eval=FALSE>>=

data(targets)
PCAplot(esetPROC,targets)
@ 
\end{center}

\section{Parameter File}

It is convenient to write the pre-processing settings that we want to  
use in the pre-processing in a parameter file. The parameter file 
can be a plain text file where the specific values for the variables 
used in Agi4x44PreProcess are defined. This file can be loaded into 
R using \Rfunction{source}, source("AGI4x44PreProcess.param.txt").
An example of this sort of file is provided below. 

\# OVERALL parameters

        annotation.package="hgug4112a.db"
	foreground="MeanSignal"
	background="BGMedianSignal" 

\#  READING THE Target File

        infile="targets.txt"

\# READING THE DATA (RGList)

        makePLOT.rAg=TRUE

\# PROBES REPLICATED \& GENES REPLICATED 

	makePROBES=FALSE
	makeGENES=FALSE

	raw.data=TRUE
	write.probes=TRUE
	WRITE.genes.html=TRUE
	REPORT.genes=TRUE

\# NORMALIZATION (here the foreground and background are used)

        BGmethod="half"
        NORMmethod="quantile"
        offset=50
        makePLOTpre=TRUE
        makePLOTpost=TRUE

\# FILTERING PROBES

        control=TRUE
        wellaboveBG=TRUE
        isfound=TRUE
        wellaboveNEG=TRUE
        sat=TRUE
        PopnOL=TRUE
        NonUnifOL=TRUE
        nas=TRUE
        limWellAbove=75.0
        limISF=75.0
        limNEG=75.0
        limSAT=75.0
        limPopnOL=75.0
        limNonUnifOL=75.0
        limNAS=100
        makePLOT.filt=TRUE
        flag.counts=FALSE

\# SUMMARIZING PROBES

        makePLOT.summ=TRUE

\# CREATING \& WRITING EXPRESIONSET

        makePLOT.eset=TRUE

\# MAPPING  

        makeMAPPINGS=TRUE 

\# GSEA output

        makeGSEA=TRUE

\begin{Sinput}

# NOT RUN

> source("AGI4x44PreProcess.param.txt")

# NOT RUN

\end{Sinput}

Once this file has been loaded in to R, we can write an R script where we can have all 
the Agi4x44PreProcess functions to carry out the pre-processing steps. 
This code can be read into R 
by \Rfunction{source}. An example of a typical script is given below.  

\begin{Sinput}

# NOT RUN

> library("Agi4x44PreProcess")
> library("hgug4112a.db")

> source("AGI4x44PreProcess.param.txt")

# reading target file (TXT).

> targets=read.targets(infile=infile)

# reading Agilent Feature Extraction data files (TXT).

> dd=read.AgilentFE(targets,makePLOT=makePLOT.rAg)
   
# PROBES REPLICATED 
	
> if(makePROBES){	
> CV.rep.probes(dd,annotation.package,
	foreground,raw.data,writeR=write.probes,targets) 
> }

# GENES REPLICATED 

> if(makeGENES){
> genes.rpt.agi(dd,annotation.package,raw.data,
	WRITE.html=WRITE.genes.html,REPORT=REPORT.genes)
> }

# NORMALIZATION 

> ddNORM=BGandNorm(dd,BGmethod,NORMmethod,
	foreground,background,
	offset,makePLOTpre,makePLOTpost)

# FILTERING PROBES 

> ddFILT=filter.probes(ddNORM,
control,
wellaboveBG,	
isfound,
wellaboveNEG,
sat,
PopnOL,
NonUnifOL,
nas,
limWellAbove,
limISF,
limNEG,
limSAT,
limPopnOL,
limNonUnifOL,
limNAS,
makePLOT=makePLOT.filt,annotation.package,flag.counts,targets)

# SUMMARIZING PROBES

> ddPROC=summarize.probe(ddFILT,makePLOT=makePLOT.summ,targets)

# CREATING EXPRESIONSET OBJECT 

> esetPROC=build.eset(ddPROC,targets,makePLOT=makePLOT.eset,
	annotation.package)

# WRITING EXPRESIONSET OBJECT: ProcessedData.txt 

> write.eset(esetPROC,ddPROC,annotation.package,targets) 

# MAPPING 

> if(makeMAPPINGS){ 
> mappings=build.mappings(esetPROC,annotation.package)
> }

# GSEA OUTPUTS 

> if(makeGSEA){	
> gsea.files(esetPROC,targets,annotation.package)
> } 

# NOT RUN

\end{Sinput}


\bibliographystyle{plain}
\bibliography{Agi4x44PreProcess}

\end{document}