\documentclass[a4paper]{article}
\usepackage[OT1]{fontenc}
\usepackage{Sweave}
\usepackage{url}
\usepackage{afterpage}
\usepackage{hyperref}
\usepackage{geometry}
\geometry{ hmargin=3cm, vmargin=2.5cm }
\usepackage{graphicx}
\begin{document}

% \VignetteIndexEntry{a4vignette}
% \VignetteDepends{ALL, MLP, nlcv}

\title{Using the \texttt{a4} package}
\author{Willem Talloen, Tobias Verbeke}

\maketitle

\tableofcontents
\pagebreak{}

<<config,echo=FALSE>>=
options(width = 50)
options(continue=" ")
options(prompt="R> ")
set.seed(123)
@

\section{Introduction}

The \texttt{a4} suite of packages is a suite for convenient analysis of Affymetrix microarray
experiments which supplements Goehlmann and Talloen (2010). The suite currently consists of several packages which are centered around
particular tasks:
\begin{itemize}
  \item \texttt{a4Preproc}: package for preprocessing of microarray data. Currently the only function
    in the package adds complementary annotation information to the ExpressionSet objects
    (in function \texttt{addGeneInfo}). Many of the subsequent analysis functions rely on the presence
    of such information.
  \item \texttt{a4Core}: package made to allow for easy interoperability with the \texttt{nlcv} package
    which is currently being developed on R-Forge at \url{http://r-forge.r-project.org/projects/nlcv}.
  \item \texttt{a4Base}: all basic functionality of the \texttt{a4} suite
  \item \texttt{a4Classif}: functionality for classification work that has been split off a.o. 
    in order to reduce \texttt{a4Base} loading time
  \item \texttt{a4Reporting}: a package which provides reporting functionality and defines
    \texttt{xtable}-methods that are foreseen for tables with hyperlinks to public gene annotation
    resources. 
\end{itemize}

This document provides an overview of the typical analysis workflow for such microarray experiments using 
functionality of all of the mentioned packages. 

\section{Preparation of the Data}
First we load the package \texttt{a4} and the example real-life data set \texttt{ALL}.
<<loadPackage>>=
library(a4)
require(ALL)
data(ALL, package = "ALL")
@

For illustrative purposes, simulated data sets can also be very valuable (but not used here).
<<prepareSimulatedData>>=
require(nlcv)
esSim <- simulateData(nEffectRows=50, betweenClassDifference = 5, nNoEffectCols = 5, withinClassSd = 0.2)
@

\subsection{ExpressionSet object}
The data are assumed to be in an expressionSet object. Such an object structure combines different sources
of information into a single structure, allowing easy data manipulation (e.g., subsetting, copying)
and data modelling.

The texttt{featureData} slot is typically not yet containing all relevant
 information about the genes.
This interesting extra gene information can be added using \texttt{addGeneInfo}.

<<prepareALL>>=
ALL <- addGeneInfo(ALL)
@

<<show ALL, eval=FALSE, echo=FALSE>>=
# The phenotypic data
head(pData(ALL)[, c(1:5, 13, 18, 21)])
# The gene expression data
head(exprs(ALL)[, 1:5])
# The feature data
fDat <- head(pData(featureData(ALL)))
fDat[,"Description"] <- substr(fDat[,"Description"], 1, 30)
fDat
@

\subsection{Some data manipulation}
The \texttt{ALL} data consists out of samples obtained from
two types of cells with very distinct expression profiles; B-cells and T-cells. To have a more subtle signal,
gene expression will also be compared between the BCR/ABL and the NEG group within B-cells only.
To this end, we create the expressionSet \texttt{bcrAblOrNeg} containing only B-cells with BCR/ABL or NEG.
  
<<Create bcrAblOrNeg>>=
Bcell <- grep("^B", as.character(ALL$BT))  # create B-Cell subset for ALL

subsetType <- "BCR/ABL"  # other subsetType can be "ALL/AF4"
bcrAblOrNegIdx <- which(as.character(ALL$mol) %in% c("NEG", subsetType))
bcrAblOrNeg <- ALL[, intersect(Bcell, bcrAblOrNegIdx)]
bcrAblOrNeg$mol.biol <- factor(bcrAblOrNeg$mol.biol)
@

\pagebreak{}
\section{Unsupervised data exploration}

Spectral maps are very powerful techniques to get an unsupervised picture of how the data look like.
A spectral map of the \texttt{ALL} data set shows that the B- and the T-subtypes cluster together along the
 x-axis (the first principal component).
 The plot also indicates which genes contribute in which way to this clustering. For example, the genes located
 in the same direction as the T-cell samples are higher expressed in these T-cells. Indeed, the two genes at the
  left (TCF7 and CD3D) are well known to be specifically expressed by T-cells (Wetering 1992, Krissansen 1986).
 
<<spectralMapALL, fig=TRUE>>=
spectralMap(object = ALL, groups = "BT")
  
  # optional argument settings
  #    plot.mpm.args=list(label.tol = 12, zoom = c(1,2), do.smoothScatter = TRUE),
  #    probe2gene = TRUE)
@

A spectral map of the \texttt{bcrAblOrNeg} data subset does not show a clustering of BCR/ABL or NEG cells.

<<spectralMapALLSubset, fig=TRUE>>=
spectralMap(object = bcrAblOrNeg, groups = "mol.biol", probe2gene = TRUE)
@

\section{Filtering}
The data can be filtered, for instance based on variance and intensity, in order to reduce
 the high-dimensionality.

<<filtering>>=
selBcrAblOrNeg <- filterVarInt(object = bcrAblOrNeg)
propSelGenes <- round((dim(selBcrAblOrNeg)[1]/dim(bcrAblOrNeg)[1])*100,1)
@

This filter selected \Sexpr{propSelGenes} \% of the genes
(\Sexpr{dim(selBcrAblOrNeg)[1]} of the in total \Sexpr{dim(bcrAblOrNeg)[1]} genes).

\pagebreak{}
\section{Detecting differential expression}

\subsection{T-test}

<<tTest, term=FALSE>>=
tTestResult <- tTest(selBcrAblOrNeg, "mol.biol")
@  
<<tTestHist, fig=TRUE, term=FALSE>>=
histPvalue(tTestResult[,"p"], addLegend = TRUE)
propDEgenesRes <- propDEgenes(tTestResult[,"p"])
@

Using an ordinary t-test, there are \Sexpr{sum(tTestResult[, "pBH"] < 0.1)} genes significant at a FDR of 10\%.
The proportion of genes that are trully differentially expressed is estimated to be around \Sexpr{propDEgenesRes}.

The toptable and the volcano plot show that three most significant probe sets all target \texttt{ABL1}.
This makes sense as the main difference between BCR/ABL and NEG cells is a mutation in this particular ABL gene.

<<tabTTest, results = tex, term = FALSE, keep.source=TRUE>>=
tabTTest <- topTable(tTestResult, n = 10)
xtable(tabTTest,
    caption="The top 5 features selected by an ordinary t-test.",
  label ="tablassoClass")
@

<<tTestVolcanoPlot, fig=TRUE, term = FALSE>>=
volcanoPlot(tTestResult, topPValues = 5, topLogRatios = 5)
@

\subsection{Limma for comparing two groups}
In this particular data set, the  modified t-test using \texttt{limmaTwoLevels} provides
 very similar results. This is because the sample size is relatively large.

<<limmaTwoLevels, fig=TRUE, term=FALSE>>=
limmaResult <- limmaTwoLevels(selBcrAblOrNeg, "mol.biol")
volcanoPlot(limmaResult)
# histPvalue(limmaResult)
# propDEgenes(limmaResult)
@

It is very useful to put lists of genes in annotated tables where the genes get hyperlinks to 
\href{http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene}{EntrezGene}.

<<limma, term=FALSE>>=
tabLimma <- topTable(limmaResult, n = 10, coef = 2) # 1st is (Intercept)
@

<<annotationTableLimma, results = tex, fig = FALSE, echo = FALSE, term = FALSE, keep.source=TRUE>>=
  tabLimmaSel <- tabLimma[,c("SYMBOL", "logFC", "AveExpr","P.Value","adj.P.Val", "GENENAME" )]
  tabLimmaSel[, "GENENAME"] <- substr(tabLimmaSel[,"GENENAME"], 1, 38)
  dData <- data.frame(Gene = tabLimmaSel[, 1], tabLimmaSel[,-1],
      stringsAsFactors = FALSE,  row.names = NULL)
  hData <- data.frame(Gene = generateEntrezIdLinks(tabLimma[,"ENTREZID"]))
  tabAnnot <- annotationTable(displayData = dData, hrefData = hData)
  xTabAnnot <- a4Reporting::xtable(tabAnnot, digits = 2,
         caption = "Top differentially expressed genes between disabled anf functional p53 cell lines.")
  print(xTabAnnot, include.rownames = FALSE, floating = FALSE)
@

\subsection{Limma for linear relations with a continuous variable}
Testing for (linear) relations of gene expression with a (continuous) variable is typically done
using regression. A modified t-test approach improves the results by penalizing small slopes.
The modified regressions can be applied using \texttt{limmaReg}. 

<<limmaReg, term=FALSE>>=

@

\pagebreak{}
\section{Class prediction}

There are many classification algorithms with profound conceptual and methodological differences.
 Given the differences between the methods,there's probably no single classification method
  that always works best, but that certain methods perform better depending on the characteristics
  of the data.

On the other hand, these methods are all designed for the same purpose, namely maximizing
 classification accuracy. They should consequently all pick up (the same) strong biological signal when present,
 resulting in similar outcomes.

Personally, we like to apply four different approaches; PAM, RandomForest, 
 forward filtering in combination with various classifiers, and LASSO.

All four methods have the property that they search for the smallest set of genes while having
 the highest classification accuracy. The underlying rationale and algorithm is very different
 between the four approaches, making their combined use potentially complementary.

\subsection{PAM}
PAM (Tibshirani 2002) applies univariate and dependent feature selection.

<<PAM, fig=TRUE, results = tex, term=FALSE, keep.source=TRUE>>=
resultPam <- pamClass(selBcrAblOrNeg, "mol.biol")
plot(resultPam)
featResultPam <- topTable(resultPam, n = 15)
xtable(head(featResultPam$listGenes),
    caption = "Top 5 features selected by PAM.")
@

\subsection{Random forest}
Random forest with variable importance filtering (Breiman 2001, Diaz-Uriarte 2006) applies multivariate
 and dependent feature selection. Be cautious when interpreting its outcome, as the obtained results
 are unstable and sometimes overoptimistic.

<<randomForest, fig=TRUE, results = tex, keep.source=TRUE>>=
resultRF <- rfClass(selBcrAblOrNeg, "mol.biol")
plot(resultRF, which = 2)
featResultRF <- topTable(resultRF, n = 15)
xtable(head(featResultRF$topList),
    caption = "Features selected by Random Forest variable importance.")
  
@


<<plotTop2_3genesRf, fig=FALSE, echo=FALSE, keep.source=TRUE>>=
plotCombination2genes(probesetId1=rownames(featResultRF$topList)[1],
    probesetId2=rownames(featResultRF$topList)[2],
    object = selBcrAblOrNeg, groups = "mol.biol")
@

\subsection{Forward filtering with various classifiers}
Forward filtering in combination with various classifiers (like DLDA, SVM, random forest, etc.)
 apply an independent feature selection. The selection can be either univariate or multivariate
 depending on the chosen selection algorithm; we usually choose Limma as a univariate
 although random forest variable importance could also be used as a multivariate selection criterium.


<<loadNlcvTT, echo = FALSE, term = FALSE>>=
#  nlcvTT <- nlcv(selBcrAblOrNeg, classVar = 'mol.biol', 
#                  classdist = "unbalanced",
#                  nRuns = 10, fsMethod = "t.test", verbose = TRUE)
data(nlcvTT)
@

<<MCRPlot, fig = TRUE, term = FALSE, keep.source=TRUE>>=
mcrPlot_TT <- mcrPlot(nlcvTT, plot = TRUE, optimalDots = TRUE,
    layout = TRUE, main = "t-test selection")
@

<<tabmcrPlot, results = tex, echo = FALSE>>=
xtable(summary(mcrPlot_TT),rownames=TRUE,
    caption = "Optimal number of genes per classification method together with the 
    respective misclassification error rate (mean and standard deviation across all CV loops).",
    label = "tabmcrPlot_TT")
@

<<ScoresPlot, fig = TRUE, term = FALSE>>=
scoresPlot(nlcvTT, tech = "svm", nfeat = 2)
@


\subsection{Penalized regression}
LASSO (Tibshirani 2002) or elastic net (Zou 2005) apply multivariate and dependent feature selection.

<<lasso, fig=TRUE, term = FALSE, keep.source=TRUE>>=
resultLasso <- lassoClass(object = bcrAblOrNeg, groups = "mol.biol")
plot(resultLasso, label = TRUE,
    main = "Lasso coefficients in relation to degree of penalization.")
featResultLasso <- topTable(resultLasso, n = 15)
@

<<tabLasso, results = tex, echo = FALSE, term = FALSE, keep.source=TRUE>>=
lassoTable <- xtable(featResultLasso, label = "tablassoClass",
    caption = "Features selected by Lasso, ranked from largest to smallest penalized coefficient.")
print(lassoTable, include.rownames = FALSE)
@

<<plotTop2_3genesLasso, fig = FALSE, echo = FALSE, term = FALSE, keep.source=TRUE>>=
op <- par(mfrow=c(1,2))
  plotCombination2genes(geneSymbol1 = featResultLasso$topList[1, 1], 
    geneSymbol2 = featResultLasso$topList[2, 1],
    object = bcrAblOrNeg, groups = "mol.biol",
    main = "Combination of\nfirst and second gene", addLegend = TRUE, 
    legendPos = "topright")

  plotCombination2genes(geneSymbol1 = featResultLasso$topList[1, 1], 
      geneSymbol2 = featResultLasso$topList[3, 1],
    object = bcrAblOrNeg, groups = "mol.biol",
    main = "Combination of\nfirst and third gene", addLegend = FALSE)
par(op)
@

\afterpage{\clearpage}
\pagebreak{}
\subsection{Logistic regression}
Logistic regression is used for predicting the probability to belong to a certain class
 in binary classification problems.

<<LogisticRegression, fig=TRUE, term = FALSE>>=
logRegRes <- logReg(geneSymbol = "ABL1", object = bcrAblOrNeg, groups = "mol.biol")
@

The obtained probabilities can be plotted with \texttt{ProbabilitiesPlot}. A horizontal line
indicates the 50\% threshold, and samples that have a higher probability than 50\% are indicated
with blue dots. Apparently, using the expression of the gene \texttt{ABL1}, quite a lot of samples
predicted to with a high probability to be NEG, are indeed known to be NEG.  
 
<<LogisticRegressionPlot, fig=TRUE, term = FALSE, keep.source=TRUE>>=
probabilitiesPlot(proportions = logRegRes$fit, classVar = logRegRes$y,
    sampleNames = rownames(logRegRes), main = "Probability of being NEG")
@

<<LogisticRegressionPlotBars, fig=TRUE, term = FALSE, keep.source=TRUE>>=
probabilitiesPlot(proportions = logRegRes$fit, classVar = logRegRes$y, barPlot= TRUE,
    sampleNames = rownames(logRegRes), main = "Probability of being NEG")
@

\subsection{Receiver operating curve}
A ROC curve plots the fraction of true positives (TPR = true positive rate)
   versus the fraction of false positives (FPR = false positive rate) for a binary classifier
 when the discrimination threshold is varied. Equivalently, one can also plot
 sensitivity versus (1 - specificity).
 
<<ROC, fig=TRUE, term = FALSE>>=
ROCres <- ROCcurve(geneSymbol = "ABL1", object = bcrAblOrNeg, groups = "mol.biol")
@

\section{Visualization of interesting genes}

\subsection{Plot the expression levels of one gene}

Some potentially interesting genes can be visualized using \texttt{plot1gene}. 
Here the most significant gene is plotted.

<<plotProfile, fig = TRUE, term = FALSE, keep.source=TRUE>>=
plot1gene(probesetId = rownames(tTestResult)[1],
    object = selBcrAblOrNeg, groups = "mol.biol", legendPos = "topright")
@

There are some variations possible on the default \texttt{plot1gene} function.
 For example, the labels of x-axis can be changed or omitted. 

<<otherSampleIDsInPlot1gene, fig = TRUE, term = FALSE, keep.source=TRUE>>=
plot1gene(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg,
    groups = "mol.biol", sampleIDs = "mol.biol", legendPos = "topright")
@

Another option is to color the samples by another categorical variable than used for ordering.

<<plot1gene2vars, fig = TRUE, term = FALSE, keep.source=TRUE>>=
plot1gene(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg,
    groups = "mol.biol", colgroups = 'BT', legendPos = "topright")
@

The above graphs plot one sample per tickmark in the x-axis. This is very useful to explore the data as one can
 directly identify interesting samples. If it is not interesting to know which sample has which expression level,
 one may want to plot in the x-axis not the samples but the groups of interest.
It is possible to pass arguments to the boxplot function to custopmize the graph. For example
 the \texttt{boxwex} argument allows to reduce the width of the boxes in the plot.
 
<<boxPlot, fig = TRUE, term = FALSE, keep.source=TRUE>>=
boxPlot(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg, boxwex = 0.3,
    groups = "mol.biol", colgroups = "BT", legendPos = "topright")
@

\subsection{Plot the expression levels of two genes versus each other}

<<plotTop2_3genesLasso, fig = TRUE, term = FALSE, keep.source=TRUE>>=
plotCombination2genes(geneSymbol1 = featResultLasso$topList[1, 1],
    geneSymbol2 = featResultLasso$topList[2, 1],
    object = bcrAblOrNeg, groups = "mol.biol",
    main = "Combination of\nfirst and second gene", addLegend = TRUE, 
    legendPos = "topright")
@

\subsection{Plot expression line profiles of multiple genes/probesets across samples}

Multiple genes can be plotted simultaneously on a graph using line profiles.
Each line reflects one gene and are colored differenly.
As an example, here three probesets that measure the gene \texttt{LCK}, found to be
  differentially expressed between B- and T-cells. Apparently, one probeset does not measure
  the gene appropriately.
  
<<profilesPlot, fig = TRUE, term = FALSE, keep.source=TRUE>>=
myGeneSymbol <- "LCK"
probesetPos <- which(myGeneSymbol == featureData(ALL)$SYMBOL)
myProbesetIds <- featureNames(ALL)[probesetPos]

profilesPlot(object = ALL, probesetIds = myProbesetIds,
    orderGroups = "BT", sampleIDs = "BT")
@

\afterpage{\clearpage}
\pagebreak{}
\subsection{Smoothscatter plots}
It may be of interest to look at correlations between samples.
As each dot represents a gene, there are typically many dots. It is therefore wise to color the
dots in a density dependent way.
 
<<plotComb2Samples, fig = FALSE, term = FALSE, keep.source=TRUE, eval=FALSE>>=
plotComb2Samples(ALL, "11002", "01003",
    xlab = "a T-cell", ylab = "another T-cell")
@

<<plotComb2Samples, fig = FALSE, term = FALSE, echo=FALSE, keep.source=TRUE>>=
png(filename="plotComb2Samples.png",width=500,height=500)
plotComb2Samples(ALL, "11002", "01003",
    xlab = "a T-cell", ylab = "another T-cell")
dev.off()
@
\begin{figure}[h]
  \begin{center}
    \includegraphics[width=.95\textwidth]{plotComb2Samples.png}
    \caption{Correlations in gene expression profiles between two T-cell samples (samples 11002 and 01003).}
    \label{fig:plotComb2Samples}
  \end{center}
\end{figure}

\afterpage{\clearpage}
\pagebreak{}

If there are outlying genes, one can label them by their gene symbol by specifying 
the expression intervals (X- or Y- axis or both) that contain the genes to be highlighted
using \texttt{trsholdX} and \texttt{trsholdY}.
 
<<plotComb2SamplesWithAnnotation, fig = FALSE, term = FALSE, keep.source=TRUE, eval=FALSE>>=
plotComb2Samples(ALL, "84004", "01003",
    trsholdX = c(10,12), trsholdY = c(4,6),
    xlab = "a B-cell", ylab = "a T-cell")
@

<<plotComb2SamplesWithAnnotation2, fig = FALSE, term = FALSE, echo=FALSE>>=
png(filename="plotComb2SamplesWithAnnotation.png",width=500,height=500)
plotComb2Samples(ALL,"84004", "01003",
    trsholdX = c(10,12), trsholdY = c(4,6),
    xlab = "a B-cell", ylab = "a T-cell")
dev.off()
@
\begin{figure}[h]
  \begin{center}
    \includegraphics[width=.95\textwidth]{plotComb2SamplesWithAnnotation.png}
    \caption{Correlations in gene expression profiles between a B-cell and a T-cell
    (samples 84004 and 01003). Some potentially interesting genes are indicated by
    their gene symbol.}
    \label{fig:plotComb2SamplesWithAnnotation}
  \end{center}
\end{figure}

\afterpage{\clearpage}
\pagebreak{}

One can also show multiple pairwise comparisons in a pairwise scatterplot matrix.

<<plotCombMultipleSamples, fig = FALSE, term = FALSE>>=
plotCombMultSamples(exprs(ALL)[,c("84004", "11002", "01003")])
# text.panel= function(x){x, labels = c("a B-cell", "a T-cell", "another T-cell")})
@

<<plotCombMultipleSamples2, fig = FALSE, term = FALSE, echo=FALSE>>=
png(filename="plotCombMultipleSamples.png", width=500, height=500)
plotCombMultSamples(exprs(ALL)[, c("84004", "11002", "01003")])
dev.off()
@
\begin{figure}[h]
  \begin{center}
    \includegraphics[width=.95\textwidth]{plotCombMultipleSamples.png}
    \caption{Correlations in gene expression profiles between a B-cell and two T-cell samples
     (respectively samples 84004, 11002 and 01003).}
    \label{fig:plotCombMultipleSamples}
  \end{center}
\end{figure}

\afterpage{\clearpage}
\pagebreak{}
\subsection{Gene lists of log ratios}

When analyzing treatments that are primarily interesting relative to a control treatment,
it may be of value to look at the log ratios of several treatments (in columns)
for a selected list of genes (in rows).
 
<<GeneLRlist, term = FALSE, keep.source=TRUE>>=
ALL$BTtype <- as.factor(substr(ALL$BT,0,1))
ALL2 <- ALL[,ALL$BT != 'T1']  # omit subtype T1 as it only contains one sample
ALL2$BTtype <- as.factor(substr(ALL2$BT,0,1)) # create a vector with only T and B

# Test for differential expression between B and T cells
tTestResult <- tTest(ALL, "BTtype", probe2gene = FALSE)
topGenes <- rownames(tTestResult)[1:20]

# plot the log ratios versus subtype B of the top genes 
LogRatioALL <- computeLogRatio(ALL2, reference = list(var="BT", level="B"))
a <- plotLogRatio(e = LogRatioALL[topGenes,], openFile = FALSE, tooltipvalues = FALSE,
    device = "pdf", filename = "GeneLRlist",
    colorsColumnsBy = "BTtype", 
    main = 'Top 20 genes most differentially between T- and B-cells',
    orderBy = list(rows = "hclust"), probe2gene = TRUE)
@

\begin{figure}[h]
  \begin{center}
    \includegraphics[width=.95\textwidth]{GeneLRlist.pdf}
    \caption{Log ratios of the 20 genes that are most differentially expressed between
     B-cell and two T-cells.}
    \label{fig:GeneLRlist}
  \end{center}
\end{figure}

The following example demonstrates how to display log ratios for four compounds
for which gene expression was measured on four timepoints.

<<plotLogRatioComplex, keep.source=TRUE>>=
  load(system.file("extdata", "esetExampleTimeCourse.rda", package = "a4"))
  logRatioEset <- computeLogRatio(esetExampleTimeCourse, within = "hours",
    reference = list(var = "compound", level = "DMSO"))

  # re-order
  idx <- order(pData(logRatioEset)$compound, pData(logRatioEset)$hours)
  logRatioEset <- logRatioEset[,idx]
  
  # plot LogRatioEset across all
  cl <- "TEST"
  compound <- "COMPOUND"
  shortvarnames <- unique(interaction(pData(logRatioEset)$compound, pData(logRatioEset)$hours))
  shortvarnames <- shortvarnames[-grep("DMSO", shortvarnames), drop=TRUE]
  
  plotLogRatio(e = logRatioEset, mx = 1, filename = "logRatioOverallTimeCourse.pdf",
      gene.fontsize = 8,
      orderBy = list(rows = "hclust", cols = NULL), colorsColumnsBy = c('compound'),
      within = "hours", shortvarnames = shortvarnames, exp.width = 1,
      main = paste("Differential Expression (trend at early time points) in", 
          cl, "upon treatment with", compound),
      reference = list(var = "compound", level = "DMSO"), device = 'pdf')
@

\begin{figure}[h]
  \begin{center}
    \includegraphics[width=.95\textwidth]{logRatioOverallTimeCourse.pdf}
    \caption{Log ratios for four compounds at four time points (for 20 genes).}
    \label{fig:logRatioOverallTimeCourse}
  \end{center}
\end{figure}


\afterpage{\clearpage}
\pagebreak{}
\section{Pathway analysis}

\subsection{Minus log p}
 
The MLP method is one method of pathway analysis that is 
commonly used by the a4 suite user base. Although the
method is explained in detail in the MLP package vignette
we briefly walk throught the analysis steps using the
same example dataset used in the preceding parts of the
analysis. In order to detect whether certain gene sets
are enriched in genes with low p values, we obtain the
vector of p values for the genes and the corresponding
relevant gene sets:
 
<<MLP, keep.source=TRUE>>=
require(MLP)
# create groups
labels <- as.factor(ifelse(regexpr("^B", as.character(pData(ALL)$BT))==1, "B", "T"))
pData(ALL)$BT2 <- labels

# generate p-values
limmaResult <- limmaTwoLevels(object =  ALL, group = "BT2")
pValues <- limmaResult@MArrayLM$p.value

pValueNames <- fData(ALL)[rownames(pValues), 'ENTREZID']
pValues <- pValues[,2]
names(pValues) <- pValueNames
pValues <- pValues[!is.na(pValueNames)]
@

<<geneSet, keep.source=TRUE>>=
geneSet <- getGeneSets(species = "Human", 
    geneSetSource = "GOBP", 
    entrezIdentifiers = names(pValues)
)
tail(geneSet, 3)
@

Next, we run the MLP analysis:
<<MLP, keep.source=TRUE>>=
mlpOut <- MLP(
    geneSet = geneSet, 
    geneStatistic = pValues, 
    minGenes = 5, 
    maxGenes = 100, 
    rowPermutations = TRUE, 
    nPermutations = 50, 
    smoothPValues = TRUE, 
    probabilityVector = c(0.5, 0.9, 0.95, 0.99, 0.999, 0.9999, 0.99999), 
    df = 9)   
@

The results can be visualized in many ways, but for Gene Ontology based
gene set definitions, the following graph may be useful:

<<GOgraph, fig=FALSE, results=hide>>=
  pdf(file = "GOgraph.pdf")
    plot(mlpOut, type = "GOgraph", nRow = 25)
  dev.off()
@

\begin{figure}
\includegraphics{GOgraph}
\end{figure}

\section{Software used}

<<sessionInfo, echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@

\end{document}