%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\section{Introduction}

The interchange between Baggerly, Coombes and Neeley (BCN) and
Dressman, Potti and Nevins (DPN) in XXX is of broad interest
to cancer bioinformaticians.  We are indebted to Dressman and
colleagues for voluntarily making available much of the
data and analytic resources underlying the ``Integrated
genomic-based approach'' paper of 2007.  We are indebted to
Baggerly and colleagues for their meticulous approach to
re-assessing the paper in conjunction with the data.  

To review, BCN use the archive at \url{http://data.cgt.duke.edu/platinum.php}
along with the GEO submission of Bild et al. (XXX, GSE3156) to raise
a number of concerns regarding the main paper on targeting treatment
of ovarian cancer.  I select those that are
simplest to illustrate and interpret.

\section{Identifier scrambling}

\textbf{First}, the ``corrected RMA'' expression quantifications

\url{https://discovery.genome.duke.edu/express/resources/193/correctedplatinum_RMA.xls}, 

retrieved online May 8 2009, appear to have \textbf{mislabeled columns}.

This was apparently discovered through an attempt to understand
the difference between RMA preprocessed arrays, readily generated
using the  CEL files in

\url{https://discovery.genome.duke.edu/express/resources/1144/PlatinumJCO.zip},

and the ``corrected RMA'' data, which consists of RMA quantifications
that are additionally processed by ``sparse factor regression''
to remove artifacts.

Figure 1 shows how the mislabeling can be discovered and corrected;
full details are provided in the original supplements of BCN's
letter, but using a Bioconductor experimental data package,
\textit{dressCheck}, we can illustrate the problem readily.

<<ged,keep.source=TRUE>>=
library(dressCheck)
if (!exists("c119")) data(c119) # pure RMA on CEL files, with trimming of sample names
if (!exists("DrAsGiven")) data(DrAsGiven) # read of corrected platinum XLS to CSV
    # some names are not preserved between two images
setdiff(sampleNames(c119), sampleNames(DrAsGiven))
    # use the common ones
okn = intersect(sampleNames(DrAsGiven), sampleNames(c119))
    # the corrected XLS does not have all genes, so c119 needs trim
c119r = c119[ featureNames(DrAsGiven), ]
    # now demonstrate
allc1 = sapply(okn, function(i) cor(exprs(DrAsGiven)[,okn[1]],
   exprs(c119r[,i])))
png(file="corChk.png")
par(mfrow=c(2,2))
remap1 = which.max(allc1)
plot(exprs(DrAsGiven)[,okn[1]], exprs(c119r)[,okn[1]],
 xlab=paste("id", okn[1], "in RMA+SFR XLS"),
 ylab=paste("id", okn[1], "in pure RMA from CEL"),
 main="(a)")
plot(exprs(DrAsGiven)[,okn[1]], exprs(c119r)[,names(remap1)],
 xlab=paste("id", okn[1], "in RMA+SFR XLS"),
 ylab=paste("id", names(remap1), "in pure RMA from CEL"),
 main="(b)")
allc2 = sapply(okn, function(i) cor(exprs(DrAsGiven)[,okn[2]],
   exprs(c119r[,i])))
remap2 = which.max(allc2)
plot(exprs(DrAsGiven)[,okn[2]], exprs(c119r)[,okn[2]],
 xlab=paste("id", okn[2], "in RMA+SFR XLS"),
 ylab=paste("id", okn[2], "in pure RMA from CEL"),
 main="(c)")
plot(exprs(DrAsGiven)[,okn[2]], exprs(c119r)[,names(remap2)],
 xlab=paste("id", okn[2], "in RMA+SFR XLS"),
 ylab=paste("id", names(remap2), "in pure RMA from CEL"),
 main="(d)")
dev.off()
@

\includegraphics{corChk}

Panels (a) and (c) show the kind of relationship expected when
two arrays on different samples are compared.  Panels
(b) and (d) shows the kind of relationship expected when slightly
different preprocessing method are applied to the same
sample.  BCN did this exercise very systematically,
concluding that the 'corrected RMA' arrays 1024 and 1447
in the XLS
matche the CEL files (and thus the clinical data
series for) 2505 and 1578 respectively.  In their response, DPN acknowledge
the identifier scrambling and remark that mislabeling only
affected the publication of data on the web site, not
the analyses underlying the paper.

The unscrambling using correlation to CEL succeeds for
116/119 samples.  The unscrambled data are in
ExpressionSet
\texttt{corrp116}.

\section{Persistence of batch effects through sparse
factor regression corrections}

Once the corrected quantifications are properly relabeled
they can be associated with the clinical data
appropriately.  They can also be associated with
processing information in the CEL files.  The
\texttt{corrp116} ExpressionSet includes this
information.

<<lk116>>=
library(chron)
if (!exists("corrp116")) data(corrp116)
dt = table(chron(corrp116$rundate))
cdt = chron(as.numeric(names(dt)))
names(dt) = cdt
dt
@

We will test for differential expression by batch.
<<lkte,fig=TRUE>>=
library(limma)
corrp116$cdate = factor(chron(corrp116$rundate))
des = model.matrix(~cdate, pData(corrp116))
if (!exists("f1")) f1 = lmFit(corrp116, des)
ef1 = eBayes(f1)
options(digits=4)
tt = topTable(ef1,2:16)[,-c(2:16)]
bigtt = topTable(ef1,2:16,n=1000)[,-c(2:16)]
mm = max(bigtt[,5])
tops = tt[,1]
par(mfrow=c(2,2))
plot(chron(corrp116$rundate), exprs(corrp116)[tops[1],],
   xlab="array rundate", ylab=tops[1])
plot(chron(corrp116$rundate), exprs(corrp116)[tops[2],],
   xlab="array rundate", ylab=tops[2])
plot(chron(corrp116$rundate), exprs(corrp116)[tops[3],],
   xlab="array rundate", ylab=tops[3])
plot(chron(corrp116$rundate), exprs(corrp116)[tops[4],],
   xlab="array rundate", ylab=tops[4])
@

The maximum adjusted $p$-value (BH FDR) in the top 1000
genes is \Sexpr{round(mm,5)}, so there is statistical
evidence for an association between mean expression
and run date for many genes.

The display shows that this association can take various
forms.  For the top two panels, there is indication
of a downward trend for all samples in late 2002,
early 2003.  The bottom left panel suggests an
increasing trend in mean in the same period, and
the bottom right panel shows that outliers can
be identified in tests of batch effects.

Baggerly et al show that there is an association
between survival time of sample and run date
(and I do so as well in the short letter).

The basic upshot of this section: It is probably
incorrect to dismiss the possibility of confounding
of expression-survival associations with run batch.
We don't know how to adjust for this in a way that
is completely reliable.  If we take the standard
epidemiological approach of introducing a factor
for batch in a survival regression model, we may
be over-adjusting.  But the retort in the DPN rebuttal
to BCN, asserting that batch effects cannot be present
because sparse factor regression corrects for them,
is incorrect.

\section{Non-reconstructibility of E2F3-survival
relationship}

The short letter is pretty clear on this.


\end{document}