%\VignetteIndexEntry{Polyfit}
\documentclass[12pt]{article}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=0.2in
\evensidemargin=0.2in
\headheight=0in
\headsep=0in

\begin{document}
\SweaveOpts{concordance=TRUE}

\title{Polyfit Vignette}
\author{Conrad Burden}
\date{\today}
\maketitle

Polyfit~\cite{Burden14} is an add-on to the negative-binomial 
based package {\tt DESeq}~\cite{Anders10} 
for two-class detection of differential expression.  Its purpose is to  
ensure the p-value distribution is close to uniform over the interval [0, 1] for 
the subset of genes satisfying the null hypothesis of no differential expression.  
The first component of Polyfit is the function 
{\tt pfNbinomTest} which replaces the function {\tt nbinomTest} in {\tt DESeq}.  
Its purpose is to smooth point singularities (or `flagpoles'), 
particularly one at  
$p = 1$, in the p-value distribution caused by calculating 
calculating p-values from a discrete distribution.  The output 
from this function should then be passed to the second 
component, the function {\tt levelPValues}.  Its purpose is 
to apply a variant of the Storey-Tibshirani procedure~\cite{Storey03} to shift the 
p-values so that those corresponding to the null hypothesis have 
a unform distribution, and to calculate corresponding q-values 
(or `adjusted p-values') for controlling errors via the false 
discovery rate.

To load and attach Polyfit, type 

<<>>=
library(Polyfit)
@
at the R prompt.  edgeR and DESeq are dependencies and will be 
automatically loaded.  

\section*{Removing the flagpoles}

When calculating p-values, DESeq assumes as a null hypothesis 
that the total number of counts $K_A$ and $K_B$ summed over replicates in 
each of conditions A and B is distributed 
according to a negative binomial distribution with parameters estimated 
from the data.  The distribution of $K_A$, conditonal on the observed 
value $k_A + k_B$ of the sum of counts in both conditions is thus a 
discrete distribution.  DESeq calculates p-values as the sum 
of probabilities from this distribution less than or equal to the 
probability of the observed counts $(k_A, k_B)$ 
(see Fig.~\ref{fig:polyfitFig1}).  This method 
invariably causes a spikes in the histogram of generated p-values, 
the most notable one being at $p = 1$ from those observed counts 
which happen to hit the mode of the null hypothesis distribution.  

\begin{figure}[t]
\begin{center}
\includegraphics[width=0.9\textwidth]{polyfitFig1.pdf}
\caption{Calculation of p-values by the DESeq functions 
{\tt nbinomTest} (left) and 
the replacement Polyfit function {\tt pfNbinomTest} 
(right).}  
\label{fig:polyfitFig1}
\end{center}
\end{figure}

Because it needs to fit a smooth polynomial to the p-value
histogram, Polyfit redistributes the spikes by replacing the 
discrete distribution with a ``squared off'' continuous distribution, as 
shown in the right half of Fig.~\ref{fig:polyfitFig1}.  If the data 
are generated according to the postulated null hypothesis, the p-values 
will then have a uniform distribution on $[0, 1]$.

In the following example we use simulated data generated by the 
DESeq function {\tt makeExampleCountDataSet()}.  To generate 
p-values with the flagploes removed, replace the DESeq function 
{\tt nbinomTest} with its Polyfit replacement {\tt pfNbinomTest}.  
<<>>=
cds <- makeExampleCountDataSet()
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
nbT <- nbinomTest( cds, "A", "B" )
pValuesDESeq <- nbT$pval # <-- Original DESeq code
nbTPolyfit <- pfNbinomTest( cds, "A", "B" )
pValuesPFDESeq <- nbTPolyfit$pval # <-- Polyfit replacement
@
Histograms of the resulting p-values are shown in Fig.~\ref{fig:polyfitFig2}.  

<<label=fig2plot,include=FALSE,echo=FALSE>>=
oldpar <- par(mfrow=c(1,2))
hist(pValuesDESeq,breaks=seq(0,1,by=0.01), 
       		xlab="P-value", main="DESeq")
hist(pValuesPFDESeq,breaks=seq(0,1,by=0.01), 
 					xlab="P-value", main="polyfit-DESeq")
par(oldpar)
@

\begin{center}
\begin{figure}
<<label=fig2,fig=TRUE,echo=FALSE>>=
<<fig2plot>>
@
\caption{Histogram of p-values generated by {\tt nbinomTest} 
and {\tt pfNbinomTest}}
\label{fig:polyfitFig2}
\end{figure}
\end{center}

\section*{Levelling the p-value histogram}

Because the parameters of the negative binomial distribution for 
each gene must be estimated from the available count data, p-values 
reported by DESeq for those genes which are not differentially 
expressed may not be uniformly distributed.  A fairly extreme case is 
shown in Fig.~\ref{fig:polyfitFig4}(a) generated from DESeq using 
synthetic data after the flagpole has been removed as described above.   

\begin{figure}[t]
\begin{center}
\includegraphics[width=0.9\textwidth]{polyfitFig4.pdf}
\caption{(a) Example p-value spectrum calculated by DESeq for synthetic 
data RNA-seq with 15\% genes up- or down-regulated after removal of 
`flagpoles' (taken from ref.~\cite{Burden14}).  The shaded 
histogram is the 85\% of transcripts which are unregulated.  
(b) Schematic representation of the Storey-Tibshirani procedure 
for estimating false discovery rates, assuming correctly 
calculated p-values.  
(c) Schematic representation of the analogous Polyfit procedure.  
(TP = true positives, FP = false positives, FN = false 
negatives and TN = true negatives at a specified significance 
point $\alpha$.)}
\label{fig:polyfitFig4}
\end{center}
\end{figure}

If p-values have been calculated exactly, the Storey-Tibshirani procedure 
calculates q-values, that is, estimates of the false discovery rate, 
essentially by fitting a uniform distribution to the right hand part 
of the p-value histogram (see Fig.~\ref{fig:polyfitFig4}(b)).  
Polyfit replaces the uniform distribution fit with a polynomial fit 
(see Fig.~\ref{fig:polyfitFig4}(c)), and estimates p-values and 
q-values for each gene by the formulae
$$
\mbox{corrected p-value} = \frac{\mbox{FP}}{\mbox{FP} + \mbox{TN}} \: , \qquad \mbox{corrected q-value} = \frac{\mbox{FP}}{\mbox{FP} + \mbox{TP}} \: .  
$$

The function {\tt levelPValues} generates the levelled p-values, estimated 
q-values calculated from the adapted Storey-Tibshirani procedure and, 
for comparison, also reports q-values calculated from the levelled p-values using 
the Benjamini-Hochberg procedure.  

The following code calculates the corrected p-values and q-values from 
the nominal p-values generated in the DESeq example above:
<<>>=
lP <- levelPValues(pValuesPFDESeq)
outTable <- cbind(origPval= pValuesPFDESeq, 
                  levelledPval=lP$pValueCorr, 
                  levelledQval=lP$qValueCorr, 
                  BH_Qval=lP$qValueCorrBH)
head(outTable)
@

If the option {\tt plot=TRUE} is used a diagnostic plot in the format of
Fig.~\ref{fig:polyfitFig5} showing the p-value distribution before and 
after levelling is produced.  The top left hand panel plots 
estimates of the fraction ($\pi_0$) of genes not DE obtained by 
fitting a quadratic to the nominal p-value historgam (without flagpoles) 
over the interval $[\lambda, 1]$. Beneath this is a density plot of 
obtained estimates $\hat\pi_0$.  The optimal $\lambda$ 
(the red dot) is obtained by choosing the $\hat\pi_0(\lambda)$ 
closest to the mode of the $\hat\pi_0$ density.  The mode is also 
indicated by the dotted line in the top left panel. 
The original and corrected p-value histograms are shown on the right, 
together with optimally fitted quadratic (upper plot) and its image after 
correction (lower plot).  The red part of the quadratic and its image below 
correspond to the interval over which the quadratic is fitted. 

<<label=fig5plot,include=FALSE,echo=FALSE>>=
lP <- levelPValues(pValuesPFDESeq, plot=TRUE)
@

\begin{center}
\begin{figure}[h]
<<label=fig5,fig=TRUE,echo=FALSE>>=
<<fig5plot>>
@
\caption{Diagnostic plot produced by {\tt levelPValues}}
\label{fig:polyfitFig5}
\end{figure}
\end{center}

\begin{thebibliography}{3}

\bibitem{Anders10}
{Anders, S.\ and Huber, W.\ Differential expression analysis 
for sequence count data. 
{\it Genome Biology}, 11(10):R106 (2010)
}

\bibitem{Burden14}
{Burden, C.J., Qureshi, S.\ and Wilson, S.R.
Error estimates for the analysis of differential expression from RNA-seq count data. 
{\it PeerJ PrePrints 2:e400v1}, {\tt http://dx.doi.org/10.7287/peerj.preprints.400v1} 
(2014)
}

\bibitem{Storey03}
{Storey, J.\ and Tibshirani, R.
Statistical significance for genomewide studies.
{\it Proceedings of the National Academy of Science}, 
100(16):9440--9445 (2003)
}

\end{thebibliography}

\end{document}