%\VignetteIndexEntry{Detection of de novo copy number alterations in case-parent trios}
%\VignetteDepends{oligoClasses, VanillaICE, MinimumDistance}
%\VignetteKeywords{MinimumDistance, copy number, SNP, case-parent trios, de novo}
%\VignettePackage{MinimumDistance}
\documentclass{article}
\usepackage{graphicx}
%\usepackage[utf8]{inputenc}
%\usepackage[T1]{fontenc}
\usepackage{natbib}
\usepackage{url}
\usepackage{amsmath}
\usepackage{amssymb}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\R}{\textsf{R}}
\newcommand{\md}{\Rpackage{MinimumDistance}}
\newcommand{\F}{\mathrm{F}}
\newcommand{\M}{\mathrm{M}}
\newcommand{\Of}{\mathrm{O}}
\newcommand{\RR}{\mathrm{LR}}
\newcommand{\LR}{\mathrm{LR}}
\newcommand{\Blrr}{\mbox{\boldmath $R$}}
\newcommand{\Bbaf}{\mbox{\boldmath $B$}}
\newcommand{\logRratio}{$\log_2$ R ratio}
\newcommand{\lrrlong}{$\log_2$ R ratio}
\newcommand{\blrr}{\mbox{\boldmath $r$}}
%\newcommand{\bbaf}{\mbox{\boldmath $b$}}
\newcommand{\baf}{B allele frequency}
\newcommand{\bafs}{B allele frequencies}
\newcommand{\tsl}{trioSetList}
\newcommand{\ts}{trioSet object}
\newcommand{\mindist}{\ensuremath{\mbox{\boldmath $d$}}}
\DeclareMathOperator{\I}{\mathbb{I}}
\usepackage[margin=1in]{geometry}


\title{Detection of de novo copy number alterations in case-parent
  trios using the \R{} package \md{}}
\date{\today}

\author{Rob Scharpf}

\begin{document}
\maketitle

<<setup, echo=FALSE, results=hide>>=
options(prompt="R> ", continue=" ", device=pdf, width=65)
@

\begin{abstract}

  For the analysis of case-parent trio genotyping arrays, copy number
  variants (CNV) appearing in the offspring that differ from the
  parental copy numbers are often of interest (de novo CNV). This
  package defines a statistic, referred to as the minimum distance,
  for identifying de novo copy number alterations in the offspring. We
  smooth the minimum distance using the circular binary segmentation
  algorithm implemented in the Bioconductor package
  \Rpackage{DNAcopy}.  Trio copy number states are inferred from the
  maximum a posteriori probability of the segmented data,
  incorporating information from the log R ratios and B allele
  frequencies.  As both log R ratios and B allele frequencies are
  estimable from Illumina and Affymetrix arrays, this package supports
  de novo copy number inference in both platforms.

\end{abstract}


\section{Introduction}

There are numerous \R{} packages available from Bioconductor for
smoothing copy number alterations.  For example, the biocview
\texttt{CopyNumberVariants} in the 2.9 release of Bioconductor lists
27 packages.  For the analysis of germline diseases, hidden Markov
models have emerged as a popular tool as inference regarding the
latent copy number state incorporates information from both the
estimates of copy number (or relative copy number) and the allelic
frequencies, such as the B allele frequency \citep{Peiffer2006} or
genotype calls \citep{Colella2007, Wang2008, Scharpf2008}. For the
analysis of somatic cell diseases such as cancer, algorithms that
segment the genome into regions of constant copy number (referred to
here as segmentation algorithms) may be more preferable for the
detection of copy number aberrations (CNA) as a mixture of cells with
different copy numbers give rise to mosaic (non-integer) copy numbers.
Examples include circular binary segmentation implemented in the \R{}
package \Rpackage{DNAcopy} and the \Rpackage{GLAD}, both of which were
orginally developed for array CGH platforms
\cite{Olshen2004,Hupe2004,Venkat2007}.  One disadvantage of
segmentation algorithms is that inference regarding duplication and
deletions is not directly available.

More recently, HMMs and segmentation algorithms have been developed
for inferring common regions of copy number alterations in multiple
samples.  However, relatively few algorithms are available for
inferring copy number alterations, especially de novo copy number
alterations, in family-based study designs involving case-parent
trios. Instead, a common strategy has been a two-step approach of
identifying copy number alterations in the individual samples and then
comparing the results across samples to infer whether an alteration
observed in the offspring is de novo.  A disadvantage of the two-step
approach is that unadjusted sources of technical variation, such as
waves, can contribute to false positives.  The joint HMM implemented
in the PennCNV software is one of the few algorithms to date that
provides direct inference regarding de novo alterations in case parent
study designs.

This package develops an alternative framework for inferring regions
of de novo copy number alterations in case-parent trios.  Like the
PennCNV joint HMM, inference regarding de novo alterations integrates
information from both the log R ratios and B allele
frequencies. Differences in the two approaches are most apparent in
non-HapMap experimental datasets in which technical and experimental
sources of variation appear to contribute to a large number of false
positives. This vignette describes the analysis pipeline from
preprocessed and normalized estimates of copy number and allele
frequencies to inference of de novo copy number alterations.  The
workflow is illustrated using a publicly available HapMap trio and a
publicly available oral cleft trio, each assayed on the Illumina
610quad array.

\section{Data input}

<<registerBackend>>=
library(oligoClasses)
library(VanillaICE)
library(SummarizedExperiment)
library(MinimumDistance)
foreach::registerDoSEQ()
@

\subsection{Reading and organizing the annotation on the markers}

We require that marker-level annotation is represented as a
\Rclass{GRanges}-derived class.  To read the plain text annotation
file, we use the \Rfunction{fread} function provided in the
\Rpackage{data.table} package.  In addition, we define an indicator
for whether the marker is polymophic using the 'Intensity Only' flag.
The \Rclass{SnpGRanges} class created from the \Robject{fgr} object in
the following code-chunk ensures that this binary indicator is created
and can be reliably accessed.

<<FeatureAnnotation>>=
library(data.table)
extdir <- system.file("extdata", package="VanillaICE")
features <- suppressWarnings(fread(file.path(extdir, "SNP_info.csv")))
fgr <- GRanges(paste0("chr", features$Chr), IRanges(features$Position, width=1),
               isSnp=features[["Intensity Only"]]==0)
fgr <- SnpGRanges(fgr)
names(fgr) <- features[["Name"]]
@

Ideally, one should include the genome build and the chromosome
lengths appropriate to the build. Here, we extract the metadata on the
chromosomes using the BSgenome Bioconductor package for the hg18
build.  Finally, we sort the \Robject{fgr} object such that the
chromosomes are ordered by their \texttt{seqlevels} and the markers
are ordered by their genomic position along the chromosome.

<<seqinfo>>=
library(BSgenome.Hsapiens.UCSC.hg18)
sl <- seqlevels(BSgenome.Hsapiens.UCSC.hg18)
seqlevels(fgr) <- sl[sl %in% seqlevels(fgr)]
seqinfo(fgr) <- seqinfo(BSgenome.Hsapiens.UCSC.hg18)[seqlevels(fgr),]
fgr <- sort(fgr)
@

\subsection{Organizing the marker-level summaries}


The abbreviated plain text files included with this package that
contain the log R ratios and B allele frequencies from 2 trios are
listed below.


<<sourceFiles>>=
files <- list.files(extdir, full.names=TRUE, recursive=TRUE, pattern="FinalReport")
@

We parse these files into a specific format such that all downstream
steps for copy number estimation no longer depend on the format of the
source files.  At this point, we encapsulate the names of the source
files ('sourcePaths'), the location of where we intend to store the
parsed data ('parsedPath'), and the genomic marker annnotation created
in the preceding section ('rowRanges') in a single object called an
\Rclass{ArrayViews}.

<<ArrayViews>>=
##
## Where to keep parsed files
##
parsedDir <- "ParsedFiles"
if(!file.exists(parsedDir)) dir.create(parsedDir)
views <- ArrayViews(rowRanges=fgr, sourcePaths=files, parsedPath=parsedDir)
show(views)
@

Because the format of the source files depends on upstream software,
we read one file and store information regarding its parsing so that
subsequent files can be parsed in a similar fashion. We use the
\Rfunction{fread} to read in the first file.

<<fread>>=
## read the first file
dat <- fread(files[1], skip="[Data]")
head(dat,n=3)
@

Next, we select which columns we plan to keep. Again, the required
data for downstream processing is the name of the SNP identifier, the
log R ratios, and B allele frequencies.

<<select_columns>>=
## information to store on the markers
select_columns <- match(c("SNP Name", "Allele1 - AB", "Allele2 - AB",
                          "Log R Ratio", "B Allele Freq"), names(dat))
@

We also specify the order in which we will store the marker-level
summaries by matching the \Rfunction{rownames} of the \Robject{views}
object with the names of the markers in the source file:

<<order_of_markers>>=
index_genome <- match(names(fgr), dat[["SNP Name"]])
@

Similar to the parameter classes defined in \Rpackage{Rsamtools}, we
encapsulate the information for parsing the columns and rows of the
source files in a class.  In addition, we specify which variable names
in the source file refers to log R ratios ('cnvar'), B allele
frequences ('bafvar'), and genotypes ('gtvar').

<<scan_params>>=
scan_params <- CopyNumScanParams(index_genome=index_genome,
                                 select=select_columns,
                                 cnvar="Log R Ratio",
                                 bafvar="B Allele Freq",
                                 gtvar=c("Allele1 - AB", "Allele2 - AB"))
@


The \Rfunction{parseSourceFile} will parse a single file in the
\Robject{views} object (by default, the first file) according to the
parameters for reading the data in the \Robject{scan\_params} object
and write the output to the \Rfunction{parsedPath} directory. In
particular, the \Rfunction{parseSourceFile} returns \Rclass{NULL}.

<<parseSourceFile>>=
parsedPath(views)
parseSourceFile(views[, 1], scan_params)
@

To apply the function \Rfunction{parseSourceFile} to all arrays in the
\Robject{views} object, one can use the functional \Rfunction{sapply}.

<<applyParseSourceFile>>=
invisible(sapply(views, parseSourceFile, param=scan_params))
@

Apart from confirming their existance, the user should not have a need
to directly access the parsed files.  Utilities for querying these
files are provided through the \Robject{views} object.
<<list_parsed_files>>=
head(list.files(parsedPath(views)), n=3)
@

\subsection{Accessors for the parsed data}

The preference for writing the parsed data to disk rather than keeping
the data in RAM is simply that the latter does not scale to projects
involving thousands of samples. For the former, slices of the parsed
data easily be accessed from the \Rfunction{parsedPath} directory via
methods defined for the \Rclass{ArrayViews} class.  For example, one
can use accessors for the low-level summaries directly:
\Rfunction{lrr}, \Rfunction{baf}, and \Rfunction{genotypes} for log R
ratios, B allele frequencies, and genotypes, respectively.  The user
has the option of either subsetting the views object or subsetting the
matrix returned by the accessor to extract the appropriate data
slice. In the following examples, we access data on the first 2
markers and sample indices 2-4.

<<Views>>=
lrr(views)[1:2, 2:4]
## or
lrr(views[1:2, 2:4])
## B allele frequencies
baf(views[1:2, 2:4])
## potentially masked by function of the same name in crlmm
VanillaICE::genotypes(views)[1:2, 2:4]
@

 More often, it is useful to extract the low-level data in a
 \Rclass{RangedSummarizedExperiment}-derived class such that meta-data on
 the samples remains bound to the columns of the assay data (log R
 ratios / B allele frequencies) and meta-data on the rows remains
 bound to the rows of the assay data. This is accomplished by applying
 the \Rfunction{MinDistExperiment} function to a \Robject{views}
 object, as described in the following section.

\section{Binding the genomic, marker, and individual-level data}

\subsection{Pedigree annotation}

The container for a single pedigree is the \Rclass{ParentOffspring}
class.

<<pedigree_hapmap>>=
ped_hapmap <- ParentOffspring(id = "hapmap", father="12287_03",
                              mother="12287_02",
                              offspring="12287_01",
                              parsedPath=parsedPath(views))
@

For families with multiple affected offspring, the argument to
offspring can be a character-vector with length greater than 1.  We
can store any number of \Rclass{ParentOffspring} objects in a list
using the \Rclass{ParentOffspringList} class. In particular, for the 2
trios provided in the \Rpackage{VanillaICE} package

<<pedigree_list>>=
ped_list <- ParentOffspringList(pedigrees=list(
                                  ParentOffspring(id = "hapmap", father="12287_03",
                                                  mother="12287_02",
                                                  offspring="12287_01",
                                                  parsedPath=parsedPath(views)),
                                  ParentOffspring(id = "cleft",
                                                  father="22169_03",
                                                  mother="22169_02",
                                                  offspring="22169_01",
                                                  parsedPath=parsedPath(views))))
pedigreeName(ped_list)
@

For a single pedigree, the \Rclass{MinDistExperiment} encapsulates the
annotation on the pedigree, the genomic annotation of the markers, and
the marker-level summary statistics.  Note, however, that the sample
identifiers in the pedigree are not the same as the file identifiers
used by default to create the \R{} object \Robject{views}.  To resolve
the naming issue, we read in separate file provided in the
\Rpackage{VanillaICE} package:

<<sample_data>>=
sample_info <- read.csv(file.path(extdir, "sample_data.csv"), stringsAsFactors=FALSE)
ind_id <- setNames(gsub(" ", "", sample_info$IndividualID), sample_info$File)
colnames(views) <- ind_id[gsub(".txt", "", colnames(views))]
@

\subsection{MinDistExperiment class}

The constructor function for \Rclass{MinDistExperiment} extracts from
the views object only the data relevant for the provided pedigree.

<<mindistexperiment>>=
me <- MinDistExperiment(views, pedigree=ped_list[[2]])
colnames(me)
me
@

\section{Detection of de novo copy number variants}
\label{smoothing}

A container for the various parameters used to segment and call de
novo copy number variants is provided by the \Rclass{MinDistParam}. A
constructor function of the same name provides a default
parametrization that works well in most instances:

<<param_class>>=
params <- MinDistParam()
show(params)
@

The parameters in this class are organized according to function.  For
example, the argument \Robject{dnacopy} takes an argument of class
\Rclass{DNAcopyParam} that contains setting that are passed to the
\Rfunction{segment} function for the implementation of circular binary
segmentation in the \R{} package \Rpackage{DNAcopy}.  Changing the
parameters for the segmentation is most easily accomplished by a call
to the constructor.  E.g.,

<<dnacopy>>=
segment_params <- DNAcopyParam(alpha=0.01)
params <- MinDistParam(dnacopy=segment_params)
@

Several of the parameters relate to a priori assumptions of the
probability of a non-Mendelian transmission.  These a priori
assumptions are defined in PennCNV \cite{Wang2008} and encapsulated in
the \Rclass{PennParam} class.

<<penn_param>>=
penn_param <- PennParam()
show(penn_param)
@

\noindent Finally, parameters for the emission probabilities computed
by the \R{} package \Rpackage{VanillaICE} are encapsulated in the
\Rclass{EmissionParam} class.  Again, default values for the emission
probabilities will be created automatically as part of the
\Robject{params} object if none is specified.


\paragraph{Segmentation and posterior calls.}

%To keep the size of the \Rpackage{MinimumDistance} package small, the
%\Robject{trioSetList} object created in the previous section contained
%only 25 markers for each of the 22 autosomes.  While useful for
%illustrating construction of the \Rclass{TrioSetList} class, there are
%too few markers included in the example to illustrate the smoothing
%and posterior calling algorithm for detecting de novo copy number
%events. To demonstrate these steps, we load a \Robject{TrioSetList}
%object containing several thousand markers for chromosomes 7 and 22
%for 2 HapMap trios:
%
%<<loadTrioSetListExample>>=
%data(trioSetListExample)
%me <- as(trioSetList, "MinDistExperiment")
%@

For a given trio, the signed minimum absolute difference of the
offspring and parental \lrrlong{}s (\blrr{}) is defined as

\begin{eqnarray}
  \label{eq:distance} \mindist &\equiv& \left(\blrr_\Of - \blrr_\M\right)
\times \I_{[\left|\blrr_\Of - \blrr_\F\right| > \left|\blrr_\Of -
      \blrr_\M\right|]} + \left(\blrr_\Of - \blrr_\F\right) \times
  \I_{[\left|\blrr_\Of - \blrr_\F\right| \leq \left|\blrr_\Of - \blrr_\M\right|
    ]}.
\end{eqnarray}

If the offspring copy number changes within a minimum distance segment
(as determined by the segmentation of the offspring copy number), the
start and stop position of the minimum distance segments may be
edited.  The approach currently implemented is to define a new start
and stop position if a breakpoint for the offspring segmentation
occurs in the minimum distance interval. To illustrate, the following
diagram uses vertical dashes (\texttt{|}) to denote breakpoints:

\begin{verbatim}
1 ...--|--------------|--...     ## minimum distance segment (before editing)
2 ...----|--------|------...     ## segmenation of log R ratios for offspring

->

3 ...--|-|--------|---|--...     ## after editing
\end{verbatim}

\noindent In the above illustration, posterior calls are provided for
the 3 segments indicated in line 3 instead of the single segment in
line 1.  Two additional steps are therefore required: (1) segmentation
of the offspring log R ratios and (2) editing of the minimum distance
breakpoints when appropriate. The following codechunk, segments the
log R ratios for all chromosomes in the \Robject{me} object and edits
the ranges for the minimum distance when conflicts with the offspring
segmentation boundaries arise (as illustrated above).

<<computeMinimumDistance>>=
mdgr <- segment2(me, params)
@

For each minimum distance segment in which the mean minimum distance
is above a user-specificed cutoff in absolute value (specified in
terms of the median absolute deviations of the minimum distance), a
trio copy number state is assigned from the maximum a posteriori
estimate.

<<computeBayesFactor,results=hide>>=
## the threshold in terms of the number of median absolute deviations from zero
nMAD(params)
md_g <- MAP2(me, mdgr, params)
show(md_g)
@

\section{Inspecting, Filtering, and plotting de novo CNV inference}
\label{viz}

\subsection{Filtering}

There are several meta-data columns stored in the
\Rclass{GRanges}-derived summary useful for filtering the set of
genomic intervals. All the available parameters for filtering the
\Robject{fit} object are stored in the parameter class
\Rclass{FilterParam} of which an instance can be created by its
constructor of the same name without any arguments.

<<filter_param>>=
filter_param <- FilterParamMD()
show(filter_param)
@

To apply the default filter parameters to the \Robject{fit} object, we
use the \Rfunction{cnvFilter} function. The \Rfunction{cnvFilter}
function returns only the set of genomic ranges satisfying the
filters. For example,

<<cnvFilter>>=
cnvFilter(md_g, filter_param)
@

The helper functions \Rfunction{denovoHemizygous} and
\Rfunction{denovoHomozygous} create commonly used filters.

<<denovoFilters>>=
denovoHemizygous(md_g)
denovoHomozygous(md_g)
@

Equivalently, one could customize the filter parameteters to acheive
the same result:


<<cnvFilter_denovo>>=
select_cnv <- FilterParamMD(state=c("220", "221", "223"), seqnames="chr22")
cnvs <- cnvFilter(md_g, select_cnv)
cnvs
@

\subsection{Visualization}

This produces a somewhat unsatisfactory result in that there are 2
additional denovo hemizygous deletions nearby that likely comprise one
denovo segment.

<<denovo_hemizygous>>=
denovoHemizygous(md_g)
@

In the following code-chunk, we reduce the denovo hemizygous deletions
and update the posterior probabilities for the MAP estimates.  We use
a combination of lattice and grid to visualize the marker-level
summaries and copy number inference for the trio.

<<plotDenovo, fig=TRUE, width=9, height=6>>=
library(grid)
g2 <- reduce(denovoHemizygous(md_g), min.gapwidth=500e3)
post <- MAP2(me, g2, params)
g2 <- denovoHemizygous(post)
vps <- pedigreeViewports()
grid.params <-  HmmTrellisParam()
p <- plotDenovo(me, g2, grid.params)
pedigreeGrid(g=g2, vps=vps, figs=p)
@


\section{Acknowledgements}

Moiz Bootwalla contributed to early versions of this vignette.


\section{Session information}
<<sessionInfo, results=tex>>=
toLatex(sessionInfo())
@

\bibliography{MinimumDistance}{}
\bibliographystyle{plain}

\end{document}