%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Reproducibility of Dressman JCO 2007}
\author{VJ Carey}
\maketitle

In the light of recent high-level challenges to reproducibility
of microarray studies (Ioannidis 2009 and others)
the dispute between Baggerly and Coombes (BC) and Dressman, Potti and Nevins (DPN) in
J Clin Oncology, 2008; 26(7):1186-1187, is of broad interest.  

In the 2007 paper that is the subject of BC's correspondence, 
Dressman et al. (JCO 25(5):517-25) studied 119
ovarian cancers with respect to responsiveness to platinum therapy and
presented analyses showing association of Src and E2F3 pathway
activation with survival time in platinum non-responsive tumors.
BC discovered a number of clerical errors in the published archive
associated with the 2007 paper, corrected the errors to the best
of their ability, extracted information to group arrays into chronologically
ordered batches, and presented graphics and analyses indicating that
expression distributions of some genes, and
survival time distributions, varied systematically across batches.
In analyses of the pathway activation:survival associations that
adjust for batch effects, pathway activation was no longer statistically significant.
One way of reading BC's contribution is as an identification of a published
association that is based, at least in part, on uncontrolled confounding.  It happens; the
title given to the correspondence is consistent with this gentle
reading: ``Run batch effects potentially compromise the usefulness of
genomic signatures for ovarian cancer''.

The response by DPN is vehement but has two main substantive components.
First, they argue that confounding could not be present because RMA preprocessing
was followed by sparse factor regression corrections that would eliminate artifacts
like batch effects.  Second, they argue that BC have no basis challenging reproducibility
of the original paper because BC do not `follow the methods outlined in [the] original report.''

The first response indicates excessive trust in complex statistical modeling 
to remove confounding effects in
the analysis of observational data.  Although there are indications that sparse factor
regression can ameliorate batch effects, there is no reason to believe that it
can eliminate them completely, and BC provide evidence, using Dressman's published
quantifications, which incorporate the benefits of sparse factor regression, that
batch effects remain.  See Figure 1(a) below, which shows clear structural
variation in both location and spread of RPS11, an 
element of the Src pathway signature, across
batches.  The expression measures are those used by Dressman et al., and
thus are denoted RMA+SFR (RMA followed by sparse factor regression).

The second response by DPN is flatly incorrect.  BC do follow the methods outlined in the
original report with one minor exception.  Because coefficients used to score tumors
for pathway activation have not been published, not by the original authors of
the cell-line based method for decisionmaking on pathway activation (Bild et al., 2006),
and not by Dressman et al., BC use Bild's published data to derive their own coefficients.
The derivation seems straightforward enough, and in fact allows rather close reproduction
of Dressman's Figure 2B.

I was able to approximately reproduce Dressman's original Figure 2B myself, independently
of code disseminated by BC in the supplement to their letter -- see Figure 1(b) below.
This figure is an excellent sanity check -- it shows that one of the associations
published in the original data is approximately recoverable using published archival data.
BC had a similar display


<<do1,fig=TRUE, echo=FALSE>>=
psurv = function (x, digits = max(options()$digits - 4, 3), ...) 
{
    saveopt <- options(digits = digits)
    on.exit(options(saveopt))
    if (!inherits(x, "survdiff")) 
        stop("Object is not the result of survdiff")
    if (!is.null(cl <- x$call)) {
    }
    omit <- x$na.action
    if (length(omit)) {
    }
    if (length(x$n) == 1) {
        z <- sign(x$exp - x$obs) * sqrt(x$chisq)
        temp <- c(x$obs, x$exp, z, signif(1 - pchisq(x$chisq, 
            1), digits))
        names(temp) <- c("Observed", "Expected", "Z", "p")
        print(temp)
    }
    else {
        if (is.matrix(x$obs)) {
            otmp <- apply(x$obs, 1, sum)
            etmp <- apply(x$exp, 1, sum)
        }
        else {
            otmp <- x$obs
            etmp <- x$exp
        }
        df <- (sum(1 * (etmp > 0))) - 1
        temp <- cbind(x$n, otmp, etmp, ((otmp - etmp)^2)/etmp, 
            ((otmp - etmp)^2)/diag(x$var))
        dimnames(temp) <- list(names(x$n), c("N", "Observed", 
            "Expected", "(O-E)^2/E", "(O-E)^2/V"))
        uu <- 1 - pchisq(x$chisq, df)
    }
    uu
}
library(dressCheck)
library(chron)
data(DrAsGiven)
data(srcWts)
data(corrp116)

an = as.numeric
plot(an(exprs(corrp116["213350_at",]))~chron(corrp116$rundate), main="(a)",
xlab="array run date", ylab="RMA+SFR expression of RPS11")

giv = DrAsGiven[intersect(featureNames(DrAsGiven), names(srcWts)),]
srcWtsL = srcWts[featureNames(giv)]
sco = t(exprs(giv))%*%srcWtsL
sdys = 1*(sco>median(sco))
par(mfrow=c(2,2))
library(survival)
#with(pData(DrAsGiven), plot(survfit(Surv(Survival, X0...alive...1...dead)~sdys,
# subset=response.0.NR..1.CR==0),col=c("blue", "yellow"), lwd=3,
# xlab="Months", ylab="survival (%)", main="(b)"))
#with(pData(DrAsGiven), d1 <<- survdiff(Surv(Survival, X0...alive...1...dead)~sdys,
# subset=response.0.NR..1.CR==0))
#text(37,.05, paste("logrank p=", round(psurv(d1),3)))
corr = corrp116[intersect(featureNames(corrp116), names(srcWts)),]
srcWtsL = srcWts[featureNames(corr)]
sco = t(exprs(corr))%*%srcWtsL
sdys = 1*(sco>median(sco))
with(pData(corrp116), plot(survfit(Surv(Survival, dead)~sdys,
 subset=CR==0),col=c("blue", "yellow"), lwd=3,
 xlab="Months", ylab="survival (%)", main="(a)"))
with(pData(corrp116), d2 <<- survdiff(Surv(Survival, dead)~sdys,
 subset=CR==0))
text(37,.05, paste("logrank p=", round(psurv(d2),3)))
data(e2f3Wts)
corr = corrp116[intersect(featureNames(corrp116), names(e2f3Wts)),]
eWtsL = e2f3Wts[featureNames(corr)]
sco = t(exprs(corr))%*%eWtsL
edys = 1*(sco>median(sco))
with(pData(corrp116), d1 <<- survdiff(Surv(Survival, dead)~edys,
 subset=CR==0))
with(pData(corrp116), plot(survfit(Surv(Survival, dead)~edys,
 subset=CR==0),col=c("blue", "yellow"), lwd=3,
 xlab="Months", ylab="survival (%)", main="(c)"))
text(37,.05, paste("logrank p=", round(psurv(d1),3)))
with(pData(corrp116), d2 <<- survdiff(Surv(Survival, dead)~edys,
 subset=CR==1))

with(pData(corrp116), plot(survfit(Surv(Survival, dead)~edys,
 subset=CR==1),col=c("blue", "yellow"), lwd=3,
 xlab="Months", ylab="survival (%)", main="(d)"))
text(57,.05, paste("logrank p=", round(psurv(d2),3)))

with(pData(corrp116), m2 <<- survreg(Surv(Survival, dead)~sdys+poly(chron(rundate),2),
 subset=CR==0))
summary(m2)

@

\end{document}