%\VignetteIndexEntry{CopyNumber450k User's Guide}
%\VignetteDepends{minfiData}
%\VignetteDepends{CopyNumber450k}
%\VignetteDepends{CopyNumber450kData}
%\VignettePackage{CopyNumber450k}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}

\documentclass[12pt]{article}

\usepackage[utf8]{inputenc}
\usepackage{authblk}

\title{The CopyNumber450k User's Guide:\\Calling CNVs from Illumina 450k
Methylation Arrays}
\author[1,3]{Simon Papillon-Cavanagh}
\author[2]{Jean-Philippe Fortin}
\author[3]{Nicolas De Jay}
\affil[1]{McGill University and Genome Quebec Innovation Centre, Montreal, QC, Canada}
\affil[2]{Department of Biostatistics, Johns Hopkins Bloomberg School of Public
Health, Baltimore, MD, USA}
\affil[3]{Department of Human Genetics, McGill University, Montreal, QC, Canada}
\SweaveOpts{highlight=TRUE, tidy=TRUE, keep.space=TRUE, keep.blank.space=FALSE, keep.comment=TRUE}

<<setup,echo=FALSE,results=hide,eval=TRUE>>=
options(keep.source=TRUE)
@

\begin{document}

\maketitle
\tableofcontents

\section{Introduction}

The \Rpackage{CopyNumber450k} package provides tools for calling copy number
variantions (CNVs) from Illumina 450k methylation arrays.
\Rpackage{CopyNumber450k} uses the sum of methylated and unmethylated alleles as
a proxy of DNA molecule abundance. Aberrant intensities is used as an indication
of putative CNVs. \Rpackage{CopyNumber450k} provides a set of user-friendly
methods to infer, plot and assess statistical significance of CNVs, as
described below.

\subsection{Loading libraries}
<<load>>=
library(CopyNumber450k)
library(CopyNumber450kData)
library(minfiData)
@

\section{Example Dataset}
The starting point of \Rpackage{CopyNumber450k} is an \Rcode{RGChannelSet}: an
object that must be created with the \Rpackage{minfi} package directly from the
raw IDAT files. To learn how to create such an object, please refer to the
detailed vignette of the \Rpackage{minfi} package. The \Rcode{RGChannelSet}
object is read into a \Rcode{CNV450kSet} object, which internally extracts the
needed data and annotation.
We strongly suggest that the reader consults \Rpackage{minfi}'s vignette on how
to load data and to use quality control plots to ensure that the data is of acceptable
quality \cite{minfi}.

\subsection{Data requirements}
In order to construct a \Rcode{CNV450kSet} object, the input sample set must
contain at least three samples whose associated \Rcode{Sample\_Group} phenotypic
data value is \Rcode{control}. A sample whose \Rcode{Sample\_Group} value is not
\Rcode{control} is considered to be a case sample and will be analyzed
and segmented.

In order to facillitate the use of this package, we provide 52 control samples
in the \Rpackage{CopyNumber450kData} R/Bioconductor package. We strongly suggest
that you use those controls if you do not have sufficient control samples in your data set.

<<loaddata>>=
# Load control data (n=52) from CopyNumber450kData
data(RGcontrolSetEx)

# Load example data (n=6) from minfiData
data(RGsetEx)
# In order to reduce example time, let's use only one sample
# and 30 controls (instead of 52). In real life situations, it is advised to
# use all the available controls.
RGsetEx <- RGsetEx[, 5]
RGcontrolSetEx <- RGcontrolSetEx[, sample(1:ncol(RGcontrolSetEx), 30)]
@

\subsection{Creating the object}
Most users will directly load their \Rcode{RGChannelSet} using the
\Rcode{CNV450kSet(RGset)} method. However, the \Rpackage{minfi} package
provides the \Rcode{combine} function, a helper method which merges two
\Rcode{RGChannelSet}s into a single object. This method may be used when using
the \Rpackage{CopyNumber450kData} control set in order to combine it with the
user's data set.

It is important to ensure that
samples have been assigned to relevant groups (i.e. control samples
have \Rcode{control} as \Rcode{Sample\_Group}).  Note that the
provided control samples in \Rpackage{CopyNumber450kData} are already properly
annotated.

Please note that in the following code example we will use only a subset of
probes to proceed. This is to be avoided in real life cases as it will
significantly reduce the resolution and will output unrealistic results by
artificially smoothing the segmentation.

<<createobject>>=
# Ensure that Sample_Group values are valid
head(pData(RGcontrolSetEx)$Sample_Group)
head(pData(RGsetEx)$Sample_Group)

# Combine both RGsets in a single RGset
RGset <- combine(RGcontrolSetEx, RGsetEx)

# Create the object
mcds <- CNV450kSet(RGset)

# In order to speed up example computation, we will randomly subset the
# probes used by CopyNumber450k. THIS SHOULD NEVER BE DONE AS IT SERVES
# ONLY FOR SPEEDING UP THE EXAMPLE.
mcds <- mcds[sample(1:nrow(mcds), 10000), ]
@ 

\subsection{SNP probes}
Single nucleotide polymorphisms (SNPs) affect probe hydridization affinity and
may therefore artificially skew intensity levels. Hence, it is useful to drop
the probes that contain or target SNPs. Although \Rpackage{minfi} offers a function
to drop SNP probes, it does so only for probes that directly target SNP CpGs. We
recommend that the \Rpackage{CopyNumber450k} \Rcode{dropSNPprobes} method be
used with our package. Since \Rcode{dropSNPprobes} drops probes that both target
and contain known SNPs, it leads to a more representative normalization and
segmentation output.

<<dropSNPprobes>>=
mcds <- dropSNPprobes(mcds, maf_threshold=0.01)
@

\section{Data Visualization and Quality Assessment}
\Rpackage{CopyNumber450k} offers quality assessment plots to identify potential
batch and normalization effects.

\subsection{Density Plot}
The density plot allows the user to observe differences in density
distributions between samples. Certain effects, such as array position on the
chip, can be clearly observed using the density plot (Figure
~\ref{fig:densityplot}). Additionnaly, we offer coloring features which can be
based on different sample phenotypic features. Please refer to the method
documentation for all possible groupings.

\begin{figure} 
\begin{center}
<<densityplot,fig=TRUE,results=hide>>=
plotDensity(mcds, main="Pre-normalization density plot",
        color.by="array.row")
@
\end{center} 
\caption{Global sample intensity by array position}
\label{fig:densityplot}
\end{figure}

\subsection{Principal Component Analysis (PCA) Plot}
A PCA plotting method, with the same coloring features as \Rcode{plotDensity} is
provided (Figure~\ref{fig:PCAplot}). This method may allow users to identify
confounding factors influencing data variability, such a chip bias or batch
effects.

\begin{figure} 
\begin{center}
<<PCAplot,fig=TRUE,results=hide>>=
plotPCA(mcds, main="Pre-normalization PCA plot",
        color.by="sample.group")
@
\end{center} 
\caption{PCA plot of sample intensities}
\label{fig:PCAplot}
\end{figure}

\section{Normalization}
Currently, \Rpackage{CopyNumber450k} offers two normalization procedures:
quantile normalization \cite{Bolstad:2003} and functional normalization
\cite{Fortin:2013}. Underlying quantile normalization is the assumption that the
distributions of signal intensities are similar across subjects\cite{Bolstad:2003}. Therefore, we
suggest that this method be used for datasets in which no global
differences in CNVs across samples are expected. 

\subsection{Quantile normalization}
The quantile normalization will match all the samples distribution and mostly
remove any batch effects \cite{Bolstad:2003}. However, in samples with gross
chromosomal aberrations, the signal may be lost. We suggest
that you use the quantile normalization in samples where you expect small
aberrations.

<<quannorm,results=hide>>=
mcds.q <- normalize(mcds, "quantile")
@
\subsection{Functional normalization}
In situations where large-scale differences are expected (e.g. in rare cases
of polyploidy), we recommend the functional normalization, a new method
developed specifically for the 450k array in which the similarity of
distributions between samples is not assumed \cite{Fortin:2013}. In this sense
functional normalization is more conservative than quantile normalization.

\begin{figure} 
\begin{center}
<<funnormDensity,fig=TRUE,results=hide>>=
mcds.f <- normalize(mcds, "functional")
plotDensity(mcds.f, main="Density plot of functional normalized data")
@
\end{center} 
\caption{Density plot of sample intensities after functional normalization}
\label{fig:funnormDensity}
\end{figure}

Compared to the previous \Rcode{densityPlot} (Figure~\ref{fig:densityplot}),
post-normalization density distributions are much more similar to one another and the
array position bias is removed (Figure~\ref{fig:funnormDensity}).

\section{Segmentation}
After normalization, it is necessary to bin probes together in order to identify
possibly amplified or deleted genomic segments. We use a binary circular
segmentation algorithm provided by the \Rpackage{DNAcopy} package \cite{dnacopy}.
After segmentation, each segment is further processed by comparing its median
intensity to a distribution inferred from the control samples. This allows us to
assess whether the segment intensity value is attributable to random variability
or if it is significantly different from what would be expected by chance.
Finally, each segment is annotated with the genes it contains.

For simplicity, let us assume that the assayed sample does not contain gross
aberrations and proceed with the object that has been normalized with the
quantile normalization (\Rcode{mcds.q}).

<<segmentation>>=
mcds.q <- segmentize(mcds.q)
@

\section{Results}
Provided that the segmentation has been performed, calling \Rcode{getSegments}
on the object returns a list containing the genomic segments in which each row
represents a genomic segment found in a sample.

\subsection{Text format}
The segments can be saved in a \Rcode{CSV}
format, by using the common \Rcode{write.csv} function that was redefined to
\Rcode{CNV450kSet} objects.

<<writecsv,eval=FALSE>>=
write.csv(mcds.q, file="segments.csv")
@


\subsection{A plot of the sample's genome}
In order to obtain a visual representation of the sample's genome,
\Rpackage{CopyNumber450k} offers the \Rcode{plotSample} function, which takes
the segmentized \Rcode{CNV450kSet} object and the index of the sample to plot as
input. In this example, since we randomly
subsetted the probes, the resolution will be too
low to find any interesting results. (Figure~\ref{fig:plotSample}).

\begin{figure} 
\begin{center}
<<plotSample,fig=TRUE,results=hide>>=
plotSample(mcds.q, 1, main="Genomic view of Sample 1")
@
\end{center} 
\caption{Genomic view of a sample}
\label{fig:plotSample}
\end{figure}

The user may specify a region of interest and plot only this
region to observe chromosomal breakpoints, as illustrated by the focal
amplification on chr1 in the assayed sample (Figure~\ref{fig:plotSamplefocal}).

\begin{figure} 
\begin{center}
<<plotSampleFocal,fig=TRUE,results=hide>>= 
plotSample(mcds.q, 1, chr="chr1",
        main="Zoom of chr1", ylim=c(-.25,.25))
@
\end{center} 
\label{fig:plotSamplefocal}
\end{figure}

\section{SessionInfo}
<<sessionInfo,results=tex>>=
toLatex(sessionInfo())
@
\bibliographystyle{plain}
\bibliography{biblio}
\end{document}