% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Reproducibility of Dressman JCO 2007} \author{VJ Carey} \maketitle In the light of recent high-level challenges to reproducibility of microarray studies (Ioannidis 2009 and others) the dispute between Baggerly and Coombes (BC) and Dressman, Potti and Nevins (DPN) in J Clin Oncology, 2008; 26(7):1186-1187, is of broad interest. But it seems that neither the editors of JCO nor the rebuttalists read the arguments of BC with much care. In preparing an invited chapter on reproducible research in a forthcoming monograph on cancer bioinformatics, I decided to look closely at the archive generated by BC at \url{http://bioinformatics.mdanderson.org/Supplements/ReproRsch-Ovary/} to see if a simple characterization of the dispute, perhaps with resolution, might be possible. The headline given to BC's letter was ``Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer''. This shows a focus on the confounding of array batch with gene expression measures and survival -- see Figures 1(a) and (b) below, which respectively show a clear association, nonlinear in form, between mean RPS11 expression and run date, and a difference in survival distributions for early and late batches among platinum non-responders. The expression measures are the ones published by Dressman at the supplementary data web site for the 2007 paper, and thus employ the sparse factor regression corrections that are asserted by DPN to remove the confounding. Confounding happens. DPN are within their rights to hope that statistical modeling will remove it, but BC (and my Figure 1(a)) show that residual confounding was present after the sparse factor regression corrections. BC show that specific forms of adjusting for batch effects eliminate the significance of association between, e.g., Src activation and survival among platinum non-responsive patients; I will examine this more closely below. Because we are not sure what the absolutely right adjustment for confounding would be, the terminology of the title (``batch effects potentially compromise usefulness'') may be called for. BC have many more arguments, however, spread out over seven supplementary documents and hundreds of primary and derived data files. These arguments do not concern batch effects and confounding, but instead describe complete non-reproducibility of findings reported in Dressman's original paper. DPN assert that BC have no right to dispute reproducibility because BC do not ``repeat'' Dressman's original methods. The fact is that BC do ``repeat'' Dressman's original methods with one exception -- derivation of coefficients for pathway activation scoring, which is such a simple task that neither Bild nor Dressman troubled to publish their specific coefficients. Figure 1(c) shows that it is possible to obtain a very close approximation to Dressman's Figure 2B using the data from the supplemental web site, coupled with a simple linear algebraic operation to derive pathway scoring coefficients from Bild's cell-line archive. When I obtained Figure 1(c) (BC have a similar display at p.5 of supplemental document ovca7.pdf, but based on RMA alone) I felt there might be some vindication at hand for Dressman's original paper. I then used the same methods in an attempt to reconstruct Figure 2C, involving the E2F3 pathway. Figure 1(d) shows no association between E2F3 and survival among platinum non-responders; Figure 1(e) shows an association but only among platinum responders. These are qualitatively similar to findings reported by BC, p51 of ovca7.pdf. %We do not use sparse factor regression or %stochastic shotgun search. These are not needed to reproduce the key %results of interest. Only the sparse factor regression-corrected %quantifications and the clinical data are %needed. By using Dressman's posted quantifications, BC (and I) \textit{do} meet DPN's standard for reproducing Dressman's original analyses. And part of Dressman's original publication is reproducible. But other parts are not. One important part that was reconstructed (Src pathway activation association with survival in platinum non-responders) appears to be affected by residual confounding. When a parametric (Weibull) model relating survival time to Src pathway activation is elaborated with a quadratic term (2 d.f.) for calendar time of array batch, the significance of the pathway effect is altered from an initial $p=0.035$ to an adjusted $p=0.47$. It may be of interest to note that in my reanalysis the effect of E2F3 activation remains significant ($p < 10^{-4}$) after the quadratic adjustment for calendar time of array batch, but this is only so for the platinum responders. In this group the test by Dressman was reported to be completely insignificant (Figure 2E of the original paper.) There is no question that reasonable investigation of the data published in conjunction with the 2007 paper does \textit{not} allow even approximate reconstruction of some of the key points of the paper. Clerical errors in the published archive have already been admitted by DPN and some may remain. If the findings of this letter are dependent on mistakes in the archive used, the investigations of this letter can be repeated and the non-reconstructibility claim can be reevaluated. However, any thorough re-evaluation of this study must consider both the problem of reconstructibility (really a computational concern regarding soundness of data representations and availability of analysis algorithms) and the problem of substantive reproducibility (an inferential condition satisfied by studies that will, when conducted independently in other settings, yield scientifically consistent interpretations). The confounding unearthed by BC is a special characteristic of the datasets generated by Dressman et al., and if batch effects are playing a role in the apparent association of survival and Src activation, the inferences are extremely likely to be non-reproducible in the inferential sense, because no validating experiment will be likely to possess the same relationships between batch, expression and survival as found in the Dressman data. Dressman and colleagues are to be commended for making so much information related to their paper available in the supplemental data archives. Their harsh response to BC, which includes the claim that BC's analysis is ``egregiously flawed'', is completely uncalled for. Many of the analyses conducted by BC were undertaken charitably, to find patterns that are consistent with the claims of the paper but that were not reconstructible in detail when the avenues suggested by the paper were first explored. Furthermore, much of the analytic effort expended by BC (and independently by me) relates to properly correcting the clerical errors that remain in the online supplements to this day. How to improve the reliability of highly complex analyses of genome-scale data remains an open question. When certain choices of data representation and software interface are made, complex analyses involving multiple data sources can be expressed in a small number of highly platform-independent scripts. To illustrate this idea, this letter and all related data, calculations and graphics, can be obtained and concretely reconstructed from conventionally organized source codes using the Bioconductor package \textit{dressCheck}, part of the Experimental data series at www.bioconductor.org. This approach exposes my work to scrutiny by external analysts, who can see and criticize, correct, or reuse every computation that leads to each graphic and $p$-value cited in this letter. BC exposed themselves to the same level of scrutiny in their supplemental archive. The work they have done to help clarify the role of pathway activation in platinum non-responsive tumors, and to show how to conduct careful and verifiable criticism of high-level publications involving genome-scale data analysis, is deserving of thanks, not approbation. %computation is accompanied by source code that can be run in any reasonably %modern version of the open source statistical system R. \begin{figure} <>= psurv = function (x, digits = max(options()$digits - 4, 3), ...) { saveopt <- options(digits = digits) on.exit(options(saveopt)) if (!inherits(x, "survdiff")) stop("Object is not the result of survdiff") if (!is.null(cl <- x$call)) { } omit <- x$na.action if (length(omit)) { } if (length(x$n) == 1) { z <- sign(x$exp - x$obs) * sqrt(x$chisq) temp <- c(x$obs, x$exp, z, signif(1 - pchisq(x$chisq, 1), digits)) names(temp) <- c("Observed", "Expected", "Z", "p") print(temp) } else { if (is.matrix(x$obs)) { otmp <- apply(x$obs, 1, sum) etmp <- apply(x$exp, 1, sum) } else { otmp <- x$obs etmp <- x$exp } df <- (sum(1 * (etmp > 0))) - 1 temp <- cbind(x$n, otmp, etmp, ((otmp - etmp)^2)/etmp, ((otmp - etmp)^2)/diag(x$var)) dimnames(temp) <- list(names(x$n), c("N", "Observed", "Expected", "(O-E)^2/E", "(O-E)^2/V")) uu <- 1 - pchisq(x$chisq, df) } uu } library(dressCheck) library(chron) library(survival) data(DrAsGiven) data(srcWts) data(corrp116) an = as.numeric pdf(file="twox3.pdf", width=8, height=5) par(mfrow=c(2,3)) plot(an(exprs(corrp116["213350_at",]))~chron(corrp116$rundate), main="(a)", xlab="array run date", ylab="RMA+SFR expression of RPS11") CC = cut(chron(corrp116$rundate),2) with(pData(corrp116), d0 <<- survdiff(Surv(Survival, dead)~CC, subset=CR==0)) with(pData(corrp116), plot(survfit(Surv(Survival, dead)~CC, subset=CR==0),col=c("blue", "yellow"), lwd=3, xlab="Months", ylab="survival (%)", main="(b)")) text(37,.03, paste("logrank p=", round(psurv(d0),3))) #giv = DrAsGiven[intersect(featureNames(DrAsGiven), names(srcWts)),] #srcWtsL = srcWts[featureNames(giv)] #sco = t(exprs(giv))%*%srcWtsL #sdys = 1*(sco>median(sco)) #with(pData(DrAsGiven), plot(survfit(Surv(Survival, X0...alive...1...dead)~sdys, # subset=response.0.NR..1.CR==0),col=c("blue", "yellow"), lwd=3, # xlab="Months", ylab="survival (%)", main="(b)")) #with(pData(DrAsGiven), d1 <<- survdiff(Surv(Survival, X0...alive...1...dead)~sdys, # subset=response.0.NR..1.CR==0)) #text(37,.05, paste("logrank p=", round(psurv(d1),3))) corr = corrp116[intersect(featureNames(corrp116), names(srcWts)),] srcWtsL = srcWts[featureNames(corr)] sco = t(exprs(corr))%*%srcWtsL sdys = 1*(sco>median(sco)) with(pData(corrp116), plot(survfit(Surv(Survival, dead)~sdys, subset=CR==0),col=c("blue", "yellow"), lwd=3, xlab="Months", ylab="survival (%)", main="(c)")) with(pData(corrp116), d2 <<- survdiff(Surv(Survival, dead)~sdys, subset=CR==0)) text(37,.05, paste("logrank p=", round(psurv(d2),3))) data(e2f3Wts) corr = corrp116[intersect(featureNames(corrp116), names(e2f3Wts)),] eWtsL = e2f3Wts[featureNames(corr)] sco = t(exprs(corr))%*%eWtsL edys = 1*(sco