%\VignetteIndexEntry{VanillaICE Vignette}
%\VignetteKeywords{copy number, genotype, SNP}
%\VignettePackage{VanillaICE}
\documentclass{article}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{color}

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\R}{\textsf{R}}
\newcommand{\hmmoptions}{\Robject{HmmOptions}}
\newcommand{\hmmparam}{\Robject{HmmParameter}}

\newcommand{\cne}{\widehat{\text{CN}}}
\newcommand{\gte}{\widehat{\text{GT}}}
\newcommand{\gtehom}{\widehat{\text{HOM}}}
\newcommand{\gtehet}{\widehat{\text{HET}}}
\newcommand{\pgte}{\text{S}_{\widehat{\text{\tiny GT}}}}
\newcommand{\pcne}{\text{S}_{\widehat{\text{\tiny CN}}}}
\newcommand{\pgtehom}{\text{S}_{\widehat{\text{\tiny HOM}}}}
\newcommand{\pgtehet}{\text{S}_{\widehat{\text{\tiny HET}}}}
\newcommand{\thom}{\text{HOM}}
\newcommand{\thet}{\text{HET}}
\newcommand{\bDelta}{\mbox{\boldmath $\Delta$}}
\newcommand{\real}{\mbox{$\mathbb R$}}      % real numbers
\newcommand{\bnu}{\mbox{\boldmath $\nu$}}
\newcommand{\ice}{\Rpackage{VanillaICE}}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\begin{document}
\title{\ice{}: Hidden Markov Models for the Assessment of Chromosomal
  Alterations using High-throughput SNP Arrays}
\author{Robert Scharpf}
\maketitle


<<setup, echo=FALSE>>=
options(width=80)
@ 

\begin{abstract}
  Chromosomal DNA is characterized by variation between individuals at
  the level of entire chromosomes (e.g. aneuploidy in which the
  chromosome copy number is altered), segmental changes (including
  insertions, deletions, inversions, and translocations), and changes
  to small genomic regions (including single nucleotide
  polymorphisms). A variety of alterations that occur in chromosomal
  DNA, many of which can be detected using high density single
  nucleotide polymorphism (SNP) microarrays, are linked to normal
  variation as well as disease and therefore of particular
  interest. These include changes in copy number (deletions and
  duplications) and genotype (e.g. the occurrence of regions of
  homozygosity).  Hidden Markov models (HMM) are particularly useful
  for detecting such abnormalities, modeling the spatial dependence
  between neighboring SNPs.  Here, we extend previous approaches that
  utilize HMM frameworks for inference in high throughput SNP arrays
  by integrating copy number, genotype calls, and the corresponding
  measures of uncertainty when available.  Using simulated and
  experimental data, we demonstrate how confidence scores control
  smoothing in a probabilistic framework.  
  
  % The goal of this vignette
  % is to provide a simple interface for fitting HMMs and plotting
  % functions to help visualize the predicted states alongside the
  % experimental data.
\end{abstract}

\section{Overview}

This vignette describes how to fit a hidden Markov model to locus-level
estimates of genotype or copy number. This vignette requires that you
have

\begin{itemize}
  
\item an absolute estimate of the \emph{total} copy number organized
  such that rows correspond to loci and columns correspond to samples
  
  and / or
    
\item a matrix of genotype calls  (1=AA, 2 = AB, 3= BB): rows correspond
  to loci and columns correspond to samples
    
\end{itemize}
  
\noindent Additional options can improve the HMM predictions include

\begin{itemize}
    
  \item a CRLMM / oligo confidence score of the genotype call
    
  \item standard errors of the copy number estimates
    
  \end{itemize}

  \noindent If using the \R{} package \Rpackage{crlmm}, see the the
  vignette copynumber.Rnw for locus-level estimation of copy number and
  suggestions for fitting a hidden Markov model to allele-specific
  estimates of copy number.  Note that several HMM implementations are
  now available for the joint analysis of copy number and genotype,
  including QuantiSNP \citep{Colella2007} and PennCNV \citep{Wang2007a}.

\paragraph{Citing this software.} 
% \bibitem{Scharpf2008}
Robert~B Scharpf, Giovanni Parmigiani, Jonathan Pevsner, and Ingo Ruczinski.
\newblock Hidden {M}arkov models for the assessment of chromosomal alterations
  using high-throughput {SNP} arrays.
\newblock {\em Annals of Applied Statistics}, 2(2):687--713, 2008.

\section{Simple Usage}
\label{sec:simpleUsage}


* Before beginning, verify that it is reasonable to assume integer copy
number by plotting the locus-level estimates as a function of the
physical position.


Assuming that a integer copy number hidden state is reasonable, we
create an object of class \Robject{oligoSnpSet} from the
\emph{simulated} data available provided in this package:

<<createLocusSet>>=
library(VanillaICE)
library(Biobase)
library(oligoClasses)
data(locusLevelData)
copynumber <- locusLevelData[["copynumber"]]/100
chromosome <- locusLevelData[["annotation"]][, "chromosome"]
position <- locusLevelData[["annotation"]][, "position"]
names(position) <- names(chromosome) <- rownames(locusLevelData[["annotation"]])
locusAnnotation <- data.frame(list(chromosome=chromosome, position=position), row.names=names(chromosome))
featuredata <- new("AnnotatedDataFrame", data=locusAnnotation, varMetadata=data.frame(labelDescription=colnames(locusAnnotation)))
locusset <- new("oligoSnpSet",
		copyNumber=copynumber,
		calls=locusLevelData[["genotypes"]],
		callsConfidence=locusLevelData[["crlmmConfidence"]],
		phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE),		
		featureData=featuredata,
		annotation=locusLevelData[["platform"]])
stopifnot(all(c("chromosome", "position") %in% varLabels(featuredata)))
locusset <- locusset[order(chromosome(locusset), position(locusset)), ]
@ 

The following components are required to fit the HMM:
\begin{enumerate}
\item initial state probabilities
\item emission probabilities
\item transition probabilities
\end{enumerate}

Several of the functions in the VanillaICE package are wrappers to
facilitate the specification of these arguments. In the following code
chunk, we assume the hidden states are hemizygous deletion (hemDel; copy
number = 1, probability of a homozygous genotype call is 0.999), normal
(copy number = 2, probability of a homozygous genotype calls is 0.7),
regions of homozygosity (ROH: copy number = 1, probability of a
homozygous genotype call is 0.999), and amplification (copy number
greater than 2).
  
<<joint.states>>=
joint.states <- c("hemDel", "normal", "ROH", "amplification")
copynumber.states <- log2(c(1, 2, 2, 3))
prob.genotype.is.homozygous <- c(0.999, 0.7, 0.999, 0.7)
names(prob.genotype.is.homozygous) <- joint.states
initialP <- (rep(1, length(joint.states)))/length(joint.states)
tau <- transitionProbability(chromosome=chromosome,
			     position=position,
			     TAUP=1e8)
log.sds <- VanillaICE:::robustSds(copyNumber(locusset))
@ 


Conditional on the underlying hidden state, we assume that the copy
number is independent of the genotype and we calculate the emission
probabilities of each separately.
<<emissionProbabilities>>=
emission.cn <- copynumberEmission(copynumber=log2(copyNumber(locusset)),
				  states=joint.states,
				  mu=copynumber.states,
				  sds=log.sds,
				  takeLog=FALSE,
				  verbose=TRUE)
emission.cn[1:5, , ]
emission.gt <- genotypeEmission(genotypes=calls(locusset),
				states=joint.states,
				probHomCall=prob.genotype.is.homozygous)
@ 

As the emission probabilities returned by the above functions are on the
log scale, we simply add the emission probabilties to obtain the joint
emission probabilities:
<<jointEmission>>=
emission.joint <- emission.gt + emission.cn
@ 

The sequence of states that maximizes the likelihood is obtained from
the \Robject{viterbi} algorithm:
<<fit1>>=
fit1 <- viterbi(initialStateProbs=log(initialP),
		emission=emission.joint,
		tau=tau[, "transitionPr"],
		arm=tau[, "arm"], 
		normalIndex=2)
(brks <- breaks(x=fit1, states=joint.states, position=tau[, "position"],
		chromosome=tau[, "chromosome"]))
rohBoundary <- as.integer(as.matrix(brks[brks$state == "ROH", ][c("start", "end")]))
@ 

A plot of the locus-level data with predicted states overlaid:

<<plotPredictions, fig=TRUE>>=
require(SNPchip)
gp <- plot(locusset[chromosome(locusset) == 1, ])
cns <- fit1
cns[cns == 3] <- 2
##show copy-neutral LOH by vertical lines
gp$abline.v <- TRUE ##plots vertical dashed lines
gp$col.predict[3] <- "white"
gp$hmm.ycoords <- c(0.7,0.9)
show(gp)
lines(position(locusset)[chromosome(locusset)==1], cns[chromosome(locusset) == 1, ])
abline(v=rohBoundary, col="royalblue", lty=2)
legend("topright", lty=c(2, 1), col=c("royalblue", "black"), legend=c("ROH boundary",
							     "copy number"), bty="n")
@ 

\section{Additional options}

\subsection{CRLMM confidence scores for the genotypes}

The HMM in Section \ref{sec:simpleUsage} ignores the CRLMM confidence
estimates that are available for the genotype calls. The following
platforms are currently supported:

<<supportedPlatforms>>=
path <- system.file("extdata", package="VanillaICE")
supportedPlatforms <- list.files(path)[grep("Conf", list.files(path))]
as.character(sapply(supportedPlatforms, function(x) strsplit(x, "Conf")[[1]][[1]]))
@ 

The emission probabilities for the genotypes are computed by calling the
\Robject{genotypeEmissionCrlmm} function.  In the simulated dataset used
for this vignette, two heterozygous genotype calls were located in the
middle of the ROH region.  

<<hetIndices>>=
hetIndices <- which(calls(locusset) == 2 & fit1 == 3)
confs <- round(1-exp(-callsConfidence(locusset)[hetIndices]/1000), 5)
@ 

\noindent The confidence scores for these two SNPs are \Sexpr{confs[1]}
and \Sexpr{confs[2]}.  Because these two heterozygous SNPs were
genotyped with high confidence, the HMM will be less likely to call the
region homozygous as evidenced by the higher emission probability for
the normal state at the above loci:

<<emissionCompare>>=
emission.crlmm <- genotypeEmissionCrlmm(genotypes=calls(locusset),
					conf=callsConfidence(locusset),
					annotation=annotation(locusset),
					pHetCalledHom=0.001,
					pHetCalledHet=0.995,
					pHomInNormal=0.8,
					pHomInRoh=0.999)
emission.crlmm[hetIndices, , ]
@ 

We recompute the HMM predictions as follows:

<<updateEmission>>=
joint.emission2 <- emission.cn
joint.emission2[, , c("normal", "amplification")] <- joint.emission2[, , c("normal", "amplification")] + emission.crlmm[, , "norm"]
joint.emission2[, , c("hemDel", "ROH")] <- joint.emission2[, , c("hemDel", "ROH")] + emission.crlmm[, ,"ROH"]
fit2 <- viterbi(initialStateProbs=log(initialP),
		emission=joint.emission2,
		tau=tau[, "transitionPr"],
		arm=tau[, "arm"], 
		normalIndex=2)
table(fit2)
breaks(x=fit2, states=joint.states, position=tau[, "position"],
       chromosome=tau[, "chromosome"])
@ 

\noindent Note that after incorporating the confidence scores, the
previously called ROH region is now called \emph{normal}.


The following two sections describe how to fit the HMM to copy
number-only or diallelic genotype-only data.

\subsection{Copy number only}

<<copyNumberOnly>>=
states <- 0:5
mus <- log2(c(0.05, 1:5))
emission.cn <- copynumberEmission(copynumber=log2(copyNumber(locusset)),
				  states=states,
				  mu=mus,
				  sds=log.sds,
				  takeLog=FALSE,
				  verbose=TRUE)
initialP <- log(rep(1/length(states), length(states)))
fit3 <- viterbi(initialStateProbs=initialP,
		emission=emission.cn,
		tau=tau[, "transitionPr"],
		arm=tau[, "arm"], 
		normalIndex=3)
breaks(x=fit3, states=states, position=tau[, "position"],
       chromosome=tau[, "chromosome"])
@ 

For homozygous deletion, I just specified a small number. An empirical
estimate of 'zero copy number' could be obtained from chromosome Y for
females.


\subsection{Genotypes only}

Note the hemizygous deletions are called 'ROH' when copy number is
ignored:
<<genotypesOnly>>=
states <- c("normal", "ROH")
prob.genotype.is.homozygous <- c(0.7, 0.999)
##without confidence scores
emission.gt <- genotypeEmission(genotypes=calls(locusset),
				states=states,
				probHomCall=prob.genotype.is.homozygous)
initialP <- log(rep(1/length(states), length(states)))
fit4 <- viterbi(initialStateProbs=initialP,
		emission=emission.gt,
		tau=tau[, "transitionPr"],
		arm=tau[, "arm"], 
		normalIndex=1)
breaks(x=fit4, states=states, position=tau[, "position"],
       chromosome=tau[, "chromosome"])
@ 


\subsection{Modeling assumptions}


The emission probabilities are calculated under the assumption that the
estimates, or a suitable transformation, are approximately
Gaussian. Hence, in the simulated data we check that the middle 90\% is
approximately Gaussian.

<<qq, fig=TRUE>>=
par(pty="s", las=1)
qqnorm(log2(copyNumber(locusset)), cex=0.6)
qqline(log2(copyNumber(locusset)))
abline(v=c(-1.645, 1.645))
@ 

\subsection{Copy number outliers}

When true uncertainty estimates for the copy number are not available,
copy number outliers are more likely to result in extreme emission
probabilities than can influence the HMM segmentation. There are several
approaches to reducing the influence of outliers on the HMM
segmentation: (i) improved uncertainty estimates (preferred-- see the
\Rpackage{crlmm} package), (ii) increase \Robject{TAUP} for the
transition probabilities, (iii) threshold the emission probabilities,
and (iv) threshold extreme values for the copy number estimates.  For
example of (iii), one could do

<<epsilon, eval=FALSE>>=
emission.cn[emission.cn[, , "normal"] < -10, , "normal"] <- -10
@ 

\Robject{copynumberEmission} estimates the scale parameter for the
Normal distribution from the supplied data using the median absolute
deviation (MAD). However, different standard deviations can be supplied
by the user with the argument \Robject{sds}. The supplied \Robject{sds}
must be of the same dimension as the copy number matrix. The following
discussion elaborates briefly on the default procedure used to estimate
the standard deviations.

In the example dataset, we have only one sample and no estimates of the
copy number uncertainty.  Therfore, we obtain a robust estimate of the
copy number standard deviation across SNPs and use this as a rough
estimate of the uncertainty.  As the log-transformed copy number
estimates are more nearly Gaussian, we calculate a robust estimate of
the standard deviation using the median absolute deviation (MAD) on the
log scale:

<<robustSds>>=
cn.sds <- VanillaICE:::robustSds(copyNumber(locusset), takeLog=TRUE)
dim(cn.sds)
@ 

When multiple samples are available (e.g., 10 or more), SNP-specific
estimates of the copy number uncertainty can be obtained by scaling an
estimate of the variability across samples by a sample-specific estimate
of noise.  In the following code chunk, we simulate a matrix of copy
number for 200 samples and then compute a robust SNP-specific estimate
of the standard deviation.

<<multipleSamples, eval=TRUE>>=
CT <- matrix(sample(copyNumber(locusset), 100*200, replace=TRUE), 100, 200)
sds <- VanillaICE:::robustSds(CT, takeLog=TRUE)
@ 

The \Robject{robustSds} function returns a SNP-specific matrix of
standard deviations provided that the copy number matrix has at least 3
samples.  The {\it preferred} approach when the sample size is small
(say, less than 10), is to develop SNP-specific estimates of the
variance on a larger reference set, such as HapMap, using the same
software, and then scale these estimates by a measure of the sample
variance.

\subsection{Missing genotype calls}

If any of the genotype calls are missing and missingness is not
independent of the underlying hidden state, one may specify the
probability of a missing genotype calls for each hidden state
(\texttt{probMissing}).  By default, the HMM will assume that missing
genotype calls are independent of the underlying hidden state.  In
particular, this assumption may not be reasonable for homozygous
deletions.  Again, the emph{preferred} approach is to use the confidence
scores provided by crlmm and the function
\Robject{genotypeEmissionCrlmm}.

\subsection{Transition probabilities}

The sequence of states that maximizes the likelihood is obtained by the
Viterbi algorithm.  Note that the argument \Robject{arm} to this
function is a factor indicating the chromosomal arms -- the Viterbi
algorithm is computed for independently for each chromosomal arm.  We
may scale the probability of transitioning between states by setting the
arguments \Robject{normal2altered}, \Robject{altered2normal}, and
\Robject{altered2altered}.  For example, to facilitate comparisons to
the Birdseye HMM \citep{Korn2008} one may pass the following arguments
to \Robject{viterbi}:

<<fit, eval=FALSE>>=
fit <- viterbi(initialStateProbs=log(initialP),
	       emission=emission,
	       tau=tau[, "transitionPr"],
	       arm=tau[, "arm"], 
	       normalIndex=2,
	       normal2altered=0.005,
	       altered2normal=0.5,
	       altered2altered=0.0025)
@ 

%\subsection{\emph{Smoothness}}

The \Robject{TAUP} argument to the function
\Robject{transitionProbability} together with the above arguments to the
\Robject{viterbi} function can be used to control the 'smoothness' of
the resulting predictions.  In particular, higher values of
\Robject{TAUP} encourages fewer jumps.

%\section{Appendix}
%
%\subsection{Confidence scores for genotype calls.}
%
%The \Rfunction{CRLMM} (in the R package \Rpackage{oligo}) provides
%confidence scores ($\pgte$) of the genotype estimates ($\gte$).  From
%269 HapMap samples assayed on the Affymetrix 50k Xba and Hind chips, we
%have a gold standard of the true genotype defined by the consensus of
%the HapMap centers.  We use kernal-based density estimates to obtain
%
%{\scriptsize
%\begin{eqnarray}
%\label{ingo2}
%f\left\{\ \pgtehom\ |\ \gtehom,\thom\ \right\},\
%f\left\{\ \pgtehom\ |\ \gtehom,\thet\ \right\},\
%f\left\{\ \pgtehet\ |\ \gtehet,\thom\ \right\},\ \mbox{~ and~}
%f\left\{\ \pgtehet\ |\ \gtehet,\thet\ \right\}
%\end{eqnarray}
%}
%
%\noindent separately for the Xba and Hind 50k chips. The first term in
%(\ref{ingo2}), for example, denotes the density of the scores when the
%genotype is correctly called homozygous ($\gtehom$) and the true
%genotype is homozygous ($\thom$). See \cite{Scharpf2008} for a more
%complete description of the methods. 

\section{Session Information}

The version number of R and packages loaded for generating the vignette
were:

<<echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@  

\bibliography{ice}{}
\bibliographystyle{plain}


\end{document}