%\VignetteKeywords{runAbsoluteCN}
%\VignetteEngine{knitr::knitr}
%\VignetteDepends{PureCN}
%\VignettePackage{PureCN}
%\VignetteIndexEntry{PureCN}

\documentclass{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@ 

\title{PureCN: Estimating tumor purity, ploidy, LOH and SNV status using hybrid
capture NGS data}
\author{Markus Riester}

\begin{document}

\maketitle
\tableofcontents

\section{Background}

\Biocpkg{PureCN} is a purity and ploidy aware copy number caller for cancer
samples inspired by the \software{ABSOLUTE} algorithm \cite{Carter2012}.  It
was designed for hybrid capture sequencing data, especially with medium-sized
targeted gene panels without matching normal samples in mind.

It can be used to enhance existing segmentation and normalization algorithms.
If the correct purity and ploidy solution was identified, \Biocpkg{PureCN} can
also help in classifying variants as germline vs. somatic or clonal vs.
sub-clonal.

\Biocpkg{PureCN} was further designed to integrate well with industry standard
pipelines \cite{McKenna2010}, but it is straightforward to generate input data
from other pipelines.

<<load-purecn, echo=FALSE, message=FALSE>>=
library(PureCN)
@

\section{Limitations}

\Biocpkg{PureCN} currently assumes a completely diploid normal genome. It tries to
detect the sex of samples by calculating the coverage ratio of chromosomes X
and Y and will then remove any non-diploid chromosomes (i.e. XY for males and Y
for females)\footnote{Loss of Y chromosome (LOY) can result in wrong female
calls, especially in high purity samples or if LOY is in both tumor and
contaminating normal cells.  Since homozygous germline SNPs on the X
chromosome are removed, this will not affect purity/ploidy selection. A
future version might warn users in this case when germline SNPs are
provided.`}. Non-diploid chromosomes provide only limited information for
purity and ploidy selection, but are of interest for copy number calling.
Future versions might support full copy number calling on the XY
chromosomes. 

\Biocpkg{PureCN} currently also assumes a mostly clonal genome. For most
clinical samples, this is reasonable, but very heterogenous samples are likely
not possible to call without manual curation. Due to the lack of signal, manual
curation is also often necessary in low purity samples or very quiet genomes. 

In the absence of matched normals, the software currently requires some normal
contamination to infer germline genotypes. Since cell lines are rarely matched,
\Biocpkg{PureCN} will likely not work well with cell line data.  

\section{Basic input files}

The algorithm first needs to calculate copy number log-ratios of tumor vs.
normal control.  The for this calculation required coverage data needs to be
provided in \software{GATK DepthOfCoverage} format:

\begin{footnotesize}
\begin{verbatim}
    Target  total_coverage  average_coverage    Sample1_total_cvg   Sample1_mean_cvg   
    Sample1_granular_Q1 Sample1_granular_median Sample1_granular_Q3 Sample1_._above_15
    chr1:69091-70009    0   0   0   0   1   1   1   0
    chr1:367659-368598  6358    9.25121153819759    6358    9.25121153819759    1   7   13  11.6
    chr1:621096-622035  6294    9.16910019318401    6294    9.16910019318401    1   7   12  9.5
\end{verbatim}
\end{footnotesize}

\Biocpkg{PureCN} will only use data from the columns "Target",
"total\_coverage", and "average\_coverage", all other columns are optional. If
\software{GATK} is not available, then we refer to the \CRANpkg{ExomeCNV}
\cite{Sathirapongsasuti2011} package and documentation, which provides scripts
for generating coverage files from BAM files. 

Germline SNPs and somatic mutations are expected in a single VCF file. This VCF
should contain a DB info flag for dbSNP membership. VCF files generated by
\software{MuTect} \cite{Cibulskis2013} should work well and in general require
no post-processing. \Biocpkg{PureCN} can handle \software{MuTect} VCF files
generated in both single and matched normal mode. If a matched normal is
available, then somatic status information is currently expected in a SOMATIC
info flag in the VCF.

\section{Example data}

We now load a few example files:
<<example_files, message=FALSE, warning=FALSE, results='hide'>>=

gatk.normal.file <- system.file("extdata", "example_normal.txt",
    package="PureCN") 
gatk.normal2.file <- system.file("extdata", "example_normal2.txt", 
    package="PureCN") 
gatk.normal.files <- c(gatk.normal.file, gatk.normal2.file) 
gatk.tumor.file <- system.file("extdata", "example_tumor.txt", 
    package="PureCN") 
vcf.file <- system.file("extdata", "example_vcf.vcf", package="PureCN")
@

To obtain gene level copy number calls, we need an exon-to-gene mapping file.
This is provided together with GC content via the gc.gene.file argument. The
expected format:

\begin{verbatim}
    Target    gc_bias    Gene
    chr1:69091-70009    0.427638737758433    OR4F5
    chr1:367659-368598    0.459574468085106    OR4F29
    chr1:621096-622035    0.459574468085106    OR4F3
\end{verbatim}

If \software{GATK} is available, then the \software{GCContentByInterval} tool
can be used to generate the GC bias column. Otherwise, several
\software{Bioconductor} annotation packages provide GC content information. 

<<example_files2>>=
gc.gene.file <- system.file("extdata", "example_gc.gene.file.txt", 
    package="PureCN")
@

\section{GC bias}

The algorithm works best when the coverage data is GC normalized. The steps
below will not GC normalize, since the example data is already normalized.  The
function \Rfunction{correctCoverageBias}, which borrows normalization code from
the \Biocpkg{TitanCNA} package \cite{Ha2014}, can be used to correct for GC
bias.  We recommend correcting for GC bias, storing the corrected coverage files
and then do the following steps on GC corrected data. All other assay-specific
biases such as mappability or exon length are mitigated by using a control
sample (GC bias is commonly library-specific).

<<gccorrect>>=
gatk.normal.file <- system.file("extdata", "example_normal.txt", 
    package = "PureCN")
gc.gene.file <- system.file("extdata", "example_gc.gene.file.txt", 
    package = "PureCN")
coverage <- correctCoverageBias(gatk.normal.file, gc.gene.file, "example_normal_loess.txt")
@

\section{Pool of Normals}

\subsection{Selection of normals for log-ratio calculation}

For calculating copy number log-ratios of tumor vs. normal, \Biocpkg{PureCN}
requires coverage from a process-matched normal sample.  Using a normal that
was sequenced using a similar, but not identical assay, rarely works, since
differently covered genomic regions result in too many log-ratio outliers. This
section describes how to identify a good process-matched normal in case no
matched normal is available or in case the matched normal has low or uneven
coverage.
 
The \Rfunction{createNormalDatabase} function builds a database of coverage
files:

<<normaldb>>=
normalDB <- createNormalDatabase(gatk.normal.files)
@

Internally, this function determines the sex of the samples and trains a PCA
that is later used for clustering a tumor file with all normal samples in the
database. This clustering is performed by the \Rfunction{findBestNormal} function:

<<normaldbpca>>=
# get the best normal
gatk.best.normal.file <- findBestNormal(gatk.tumor.file, normalDB)
@

This function can also return multiple normal files that can be averaged into a
single pool:  

<<normaldbpcapool>>=
# get the best 2 normals and average them
gatk.best.normal.files <- findBestNormal(gatk.tumor.file, normalDB, 
    num.normals=2)
pool <- poolCoverage(lapply(gatk.best.normal.files, readCoverageGatk),
     remove.chrs=c('chrX', 'chrY'))
@

Pooling is only recommended when the coverage in normals is significantly lower
than in tumor.  Otherwise the PCA will typically do a good job in selecting a
normal with decent coverage and similar biases compared to tumor. But it is
worth experimenting with different strategies.  When pooling, it might be also
necessary to remove certain principal components, most likely the first one, to
avoid that coverage is the only selection criteria. The
\Rfunction{plotBestNormal} function might be helpful in finding a good strategy.
Note that this example removes sex chromosomes; if the normal database contains
a sufficient number of samples with matching sex, \Rfunction{findBestNormal}
will return only normal samples with matching sex.

\subsection{Artifact filtering}

It is important to remove as many artifacts as possible, since low ploidy
solutions are typically punished more by artifacts than high ploidy solutions.
High ploidy solutions are complex and usually find ways of explaining artifacts
reasonably well. A large selection of normal files is helpful for the
identification and removal of artifacts. We can for example use the normal
coverage data to estimate the expected variance in coverage per exon. This will
help the segmentation algorithm to filter noise. This step is optional, but
recommended with the default segmentation function and large pool of normals.

<<exonweightfile1>>=
exon.weight.file <- "exon_weights.txt"
createExonWeightFile(gatk.tumor.file, gatk.normal.files, exon.weight.file)
@

This function calculates exon-level copy number log-ratios for all normal
samples provided in the \Rcode{gatk.normal.files} argument.  Assuming that all
normal samples are in general diploid, an unusual high variance in log-ratio
is indicative of an exon with either common germline alterations or frequent
artifacts; high or low copy number log-ratios in those exons are unlikely
measuring somatic copy number events. 

For the log-ratio calculation, we provide a coverage file that is used as tumor
in the log-ratio calculation. The corresponding \Rcode{gatk.tumor.file}
argument can also be an array of coverage files, in which case the exon
coverage variance is averaged over all provided tumor files. As long as the
coverage of the tumor files is even, only 1 or 2 tumor files are necessary,
however. It is also safe to use a high coverage normal file instead of a tumor
file in this step:

<<exonweightfile2>>=
#createExonWeightFile(gatk.normal.files[1], gatk.normal.files[-1], 
#   "exon_weights.txt")
@

We can also use the pool of normals to find SNPs with biased allelic fractions
in low quality regions (very significantly different from 0.5 for heterozygous
SNPs). We do not do this here for simplicity. This step is recommended for
unpaired samples and optional for paired samples. If a matching normal is
provided, most variants in low quality regions with biased allelic fractions are
automatically ignored. Note that this is not meant to model non-reference bias
leading in expected allelic ratio only slightly below 0.5 (typically around
0.48). This bias is typically not strong enough to influence selection of SNV
states.

<<snpblacklist>>=
#mutect.normal.files <- dir("poolofnormals", pattern="vcf$", full.names=TRUE) 
#snp.blacklist <- createSNPBlacklist(mutect.normal.files)
#write.csv(snp.bl[[1]], file="SNP_blacklist.csv")
#write.csv(snp.bl[[2]], file="SNP_blacklist_segmented.csv", row.names=FALSE, 
#   quote=FALSE)
@

\subsection{Artifact filtering without a pool of normals}

By default, \Biocpkg{PureCN} will exclude exons with coverage below 15X from
segmentation. For SNVs, the same 15X cutoff is applied. \software{MuTect}
applies more sophisticated artifact tests and flags suspicious variants. If
\software{MuTect} was run in matched normal mode, then both potential artifacts
and germline variants are rejected, that means we cannot just filter by the
PASS/REJECT \software{MuTect} flags. The \Rfunction{filterVcfMuTect} function
optionally reads the \software{MuTect} stats file and will keep germline
variants, while removing potential artifacts.  Without the stats file,
\Biocpkg{PureCN} will use only the filters based on read depths as defined in
\Rfunction{filterVcfBasic}.  Both functions are automatically called by
\Biocpkg{PureCN}, but can be easily modified and replaced if necessary.

\section{Recommended run}

Finally, we can run \Biocpkg{PureCN} with all that information:

<<runpurecn>>=
ret <-runAbsoluteCN(gatk.normal.file=pool, gatk.tumor.file=gatk.tumor.file, 
    vcf.file=vcf.file, sampleid='Sample1', gc.gene.file=gc.gene.file, 
#args.filterVcf=list(snp.blacklist=snp.blacklist, stats.file=mutect.stats.file), 
    args.segmentation=list(exon.weight.file=exon.weight.file), 
    post.optimize=FALSE, plot.cnv=FALSE)
@

The post.optimize flag will increase the runtime by a lot, but might be worth it
if the copy number profile is not very clean. This is recommended with lower
coverage datasets or if there are significant capture biases. For high quality
whole exome data, this is typically not necessary. Note the uncommented line,
which is recommended, but not used here for simplicity. The
\Rfunction{segmentationPSCBS} function requires the \CRANpkg{PSCBS} package
\cite{Olshen2011} and sometimes gives better results than the default when the
coverage data is relatively noisy, especially for whole exome data with a large
number of heterozygous SNPs. The \Rcode{plot.cnv} argument is set to
\Rcode{FALSE} for this vignette, but generates often helpful additional plots
provided by the segmentation function.

We also need to create a few output files:

<<createoutput>>=
file.rds <- 'Sample1_PureCN.rds'
saveRDS(ret, file=file.rds)
pdf('Sample1_PureCN.pdf', width=10, height=12)
plotAbs(ret, type='all')
dev.off()
@

The first plot is an overview plot which shows the purity and ploidy local
optima, sorted by final likelihood score after fitting both copy number and
allelic fractions:

<<figureexample1, fig.show='hide'>>=
plotAbs(ret, type="overview")
@
\incfig{figure/figureexample1-1}{0.75\textwidth}{Overview.}
{The colors visualize the copy number fitting score from low (blue) to high
(red). The numbers indicate the ranks of the local optima.}

We now look at the main plots of the maximum likelihood solution in more detail.

<<figureexample2, fig.show='hide'>>=
plotAbs(ret, 1, type="hist")
@
\incfig{figure/figureexample2-1}{0.75\textwidth}{Log-ratio histogram.}

Figure~\ref{figure/figureexample2-1} displays a histogram of tumor vs. normal
copy number log-ratios for the maximum likelihood solution (number 1 in
Figure~\ref{figure/figureexample1-1}). The height of a bar in this plot is
proportional to the fraction of the genome falling into the particular
log-ratio copy number range. The vertical dotted lines and numbers visualize
the, for the given purity/ploidy combination, expected log-ratios for all
integer copy numbers from 0 to 7. It can be seen that most of the log-ratios of
the maximum likelihood solution align well to expected values for copy numbers
of 0, 1 and 2. 

<<figureexample3, fig.show='hide'>>=
plotAbs(ret, 1, type="BAF")
@
\incfig{figure/figureexample3-1}{0.95\textwidth}{B-allele frequency plot.}{Each
dot is a (predicted) germline SNP.}

Germline variant data are informative for calculating integer copy number,
because unbalanced maternal and paternal chromosome numbers in the tumor portion
of the sample lead to unbalanced germline allelic fractions.  Equations for
calculating expected allelic fractions are given in the Supplemental Information
of the \Biocpkg{PureCN} manuscript.  Figure~\ref{figure/figureexample3-1} shows
the allelic fractions of predicted germline SNPs.  In the middle panel, the
corresponding copy number log-ratios are shown. The lower panel displays the
calculated integer copy numbers, corrected for purity and ploidy.

<<figureexample4, fig.show='hide'>>=
plotAbs(ret, 1, type="AF")
@
\incfig{figure/figureexample4-1}{0.95\textwidth}{Allele fraction plots.}{Each
dot is again a (predicted) germline SNP. This plot normally also shows somatic
mutations in two additional panels. This toy example contains only germline 
SNPs however.}

Finally, Figure~\ref{figure/figureexample4-1} provides more insight into how
well the variants fit the expected values. The left panel shows the correlation
of expected and observed allelic fractions. The expected value is determined by
the most likely state. High ploidy solutions have a lot of states, so this
correlation is expected to be good for high ploidy solutions.  The right panel
plots the observed allelic fractions against copy number. The labels show the
expected values for all called states; 2m1 would be diploid, heterozygous, 2m2
diploid, homozygous.
\clearpage

\section{Manual curation}

For prediction of SNV status (germline vs. somatic, sub-clonal vs. clonal,
homozygous vs. heterozygous), it is important that both purity and ploidy are
correct. We provide functionality for curating results:

<<curationfile>>=
createCurationFile(file.rds) 
@

This will generate a CSV file, in which the correct purity and ploidy values
can be manually entered. It also contains a column "Curated", which should be
set to TRUE, otherwise the file will be overwritten when re-run.


Then in R, the correct solution (closest to the combination in the CSV file)
can be loaded with the \Rfunction{readCurationFile} function:

<<readcurationfile>>=
ret <- readCurationFile(file.rds)
@

This function has various handy features, but most importantly it will re-order
the local optima so that the curated purity and ploidy combination is ranked
first. This means \Rcode{plotAbs(ret,1,type="hist")} would show the plot for
the curated purity/ploidy combination, for example.

The default curation file will list the maximum likelihood solution:

<<curationfileshow>>=
read.csv('Sample1_PureCN.csv')
@

\Biocpkg{PureCN} currently only flags samples with warnings, it does not mark
any samples as failed. The \Rcode{Failed} column in the curation file can be
used to manually flag samples for exclusion in downstream analyses.

\section{Custom segmentation}

By default, we will use \CRANpkg{DNAcopy} \cite{Venkatraman2007} to segment the
log-ratio. It is straightforward to replace the default with other methods and
the \Rfunction{segmentationCBS} function can serve as an example. The
\Rfunction{segmentationPSCBS} function is another example which uses the
\CRANpkg{PSCBS} package \cite{Olshen2011}.

It is also possible to provide already segmented data, which we however only
recommend when matched SNP6 data is available. Otherwise it is usually better to
customize the segmentation function as described above, since the algorithm then
has access to the raw log-ratio distribution. The expected file format for
already segmented copy number data is:

\begin{verbatim}
    ID  chrom   loc.start   loc.end num.mark    seg.mean
    Sample1   1   61723   5773942 2681    0.125406444072723
    Sample1   1   5774674 5785170 10  -0.756511807441712
\end{verbatim}

\section{Output}

The \Rcode{plotAbs()} call above will generate the main plots shown in the
manuscript. The R data file (file.rds) contains gene level copy number calls,
SNV status and LOH calls.  The purity/ploidy combinations are sorted by
likelihood and stored in \Rcode{ret\$results}.

<<output1>>=
names(ret)
head(ret$results[[1]]$gene.calls, 3)
@

This data.frame also contains gene level LOH information. The SNV posteriors:

<<output2>>=
head(ret$results[[1]]$SNV.posterior$beta.model$posteriors, 3)
@
    
This lists all posterior probabilities for all possible SNV states. M0 to M7 are
multiplicity values, i.e. the number of chromosomes harboring the mutation (e.g.
1 heterozygous, 2 homozygous if copy number C is 2). Columns with the ML prefix
indicate maximum likelihood estimates, e.g. ML.AR is the expected allelic ratio
of the most likely state, AR is the observed allelic ratio.  GERMLINE.CONTHIGH
and GERMLINE.CONTLOW are the two contamination states. The former are
homozygous germline SNPs that were not filtered out because reference alleles
from another individual were sequenced, resulting in allelic fractions smaller
than 1. The latter are non-reference alleles only present in the contamination. 

\section{Prediction of somatic status and cellular fraction}

The SNV posteriors above provide posterior probabilities for all possible
states.  The \Rfunction{predictSomatic} function adjusts these posterior
probabilities by excluding germline states that do not correspond to the
maternal and paternal chromosome numbers (since \Biocpkg{PureCN} only considers
heterozygous variants, SNPs are either from the maternal or paternal
chromosome). For predicted somatic mutations, this function also provides
cellular fraction estimates, i.e. the fraction of cells with mutation. Fractions
significantly below 1 indicate sub-clonality\footnote{This number can be above 1
when the observed allelic fraction is higher than expected for a clonal
mutation. This maybe due to random sampling, wrong copy number, sub-clonal copy
number events, or wrong purity/ploidy estimates.}.

<<output3>>=
head(predictSomatic(ret), 3)
@

Segment 1 in this toy example has clearly copy number 2 and LOH, so
heterozygous mutations are impossible and the posterior probabilities for
\Rcode{GERMLINE.M1} are set to 0.

\appendix

\bibliography{PureCN}

\section*{Session Info}
<<sessioninfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\end{document}