%\VignetteIndexEntry{VanillaICE Vignette} %\VignetteKeywords{copy number, genotype, SNP} %\VignettePackage{VanillaICE} \documentclass{article} \usepackage{amsmath} \usepackage{graphicx} \usepackage{natbib} \usepackage{color} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\hmmoptions}{\Robject{HmmOptions}} \newcommand{\hmmparam}{\Robject{HmmParameter}} \newcommand{\cne}{\widehat{\text{CN}}} \newcommand{\gte}{\widehat{\text{GT}}} \newcommand{\gtehom}{\widehat{\text{HOM}}} \newcommand{\gtehet}{\widehat{\text{HET}}} \newcommand{\pgte}{\text{S}_{\widehat{\text{\tiny GT}}}} \newcommand{\pcne}{\text{S}_{\widehat{\text{\tiny CN}}}} \newcommand{\pgtehom}{\text{S}_{\widehat{\text{\tiny HOM}}}} \newcommand{\pgtehet}{\text{S}_{\widehat{\text{\tiny HET}}}} \newcommand{\thom}{\text{HOM}} \newcommand{\thet}{\text{HET}} \newcommand{\bDelta}{\mbox{\boldmath $\Delta$}} \newcommand{\real}{\mbox{$\mathbb R$}} % real numbers \newcommand{\bnu}{\mbox{\boldmath $\nu$}} \newcommand{\ice}{\Rpackage{VanillaICE}} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \title{\ice{}: Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays} \author{Robert Scharpf} \maketitle <>= options(width=80) @ \begin{abstract} Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g. aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g. the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such abnormalities, modeling the spatial dependence between neighboring SNPs. Here, we extend previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and experimental data, we demonstrate how confidence scores control smoothing in a probabilistic framework. % The goal of this vignette % is to provide a simple interface for fitting HMMs and plotting % functions to help visualize the predicted states alongside the % experimental data. \end{abstract} \section{Overview} This vignette describes how to fit a hidden Markov model to locus-level estimates of genotype or copy number. This vignette requires that you have \begin{itemize} \item an absolute estimate of the \emph{total} copy number organized such that rows correspond to loci and columns correspond to samples and / or \item a matrix of genotype calls (1=AA, 2 = AB, 3= BB): rows correspond to loci and columns correspond to samples \end{itemize} \noindent Additional options can improve the HMM predictions include \begin{itemize} \item a CRLMM / oligo confidence score of the genotype call \item standard errors of the copy number estimates \end{itemize} \noindent If using the \R{} package \Rpackage{crlmm}, see the the vignette copynumber.Rnw for locus-level estimation of copy number and suggestions for fitting a hidden Markov model to allele-specific estimates of copy number. Note that several HMM implementations are now available for the joint analysis of copy number and genotype, including QuantiSNP \citep{Colella2007} and PennCNV \citep{Wang2007a}. \paragraph{Citing this software.} % \bibitem{Scharpf2008} Robert~B Scharpf, Giovanni Parmigiani, Jonathan Pevsner, and Ingo Ruczinski. \newblock Hidden {M}arkov models for the assessment of chromosomal alterations using high-throughput {SNP} arrays. \newblock {\em Annals of Applied Statistics}, 2(2):687--713, 2008. \section{Simple Usage} \label{sec:simpleUsage} * Before beginning, verify that it is reasonable to assume integer copy number by plotting the locus-level estimates as a function of the physical position. Assuming that a integer copy number hidden state is reasonable, we create an object of class \Robject{oligoSnpSet} from the \emph{simulated} data available provided in this package: <>= library(VanillaICE) library(Biobase) library(oligoClasses) data(locusLevelData) copynumber <- locusLevelData[["copynumber"]]/100 chromosome <- locusLevelData[["annotation"]][, "chromosome"] position <- locusLevelData[["annotation"]][, "position"] names(position) <- names(chromosome) <- rownames(locusLevelData[["annotation"]]) locusAnnotation <- data.frame(list(chromosome=chromosome, position=position), row.names=names(chromosome)) featuredata <- new("AnnotatedDataFrame", data=locusAnnotation, varMetadata=data.frame(labelDescription=colnames(locusAnnotation))) locusset <- new("oligoSnpSet", copyNumber=copynumber, calls=locusLevelData[["genotypes"]], callsConfidence=locusLevelData[["crlmmConfidence"]], phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE), featureData=featuredata, annotation=locusLevelData[["platform"]]) stopifnot(all(c("chromosome", "position") %in% varLabels(featuredata))) locusset <- locusset[order(chromosome(locusset), position(locusset)), ] @ The following components are required to fit the HMM: \begin{enumerate} \item initial state probabilities \item emission probabilities \item transition probabilities \end{enumerate} Several of the functions in the VanillaICE package are wrappers to facilitate the specification of these arguments. In the following code chunk, we assume the hidden states are hemizygous deletion (hemDel; copy number = 1, probability of a homozygous genotype call is 0.999), normal (copy number = 2, probability of a homozygous genotype calls is 0.7), regions of homozygosity (ROH: copy number = 1, probability of a homozygous genotype call is 0.999), and amplification (copy number greater than 2). <>= joint.states <- c("hemDel", "normal", "ROH", "amplification") copynumber.states <- log2(c(1, 2, 2, 3)) prob.genotype.is.homozygous <- c(0.999, 0.7, 0.999, 0.7) names(prob.genotype.is.homozygous) <- joint.states initialP <- (rep(1, length(joint.states)))/length(joint.states) tau <- transitionProbability(chromosome=chromosome, position=position, TAUP=1e8) log.sds <- VanillaICE:::robustSds(copyNumber(locusset)) @ Conditional on the underlying hidden state, we assume that the copy number is independent of the genotype and we calculate the emission probabilities of each separately. <>= emission.cn <- copynumberEmission(copynumber=log2(copyNumber(locusset)), states=joint.states, mu=copynumber.states, sds=log.sds, takeLog=FALSE, verbose=TRUE) emission.cn[1:5, , ] emission.gt <- genotypeEmission(genotypes=calls(locusset), states=joint.states, probHomCall=prob.genotype.is.homozygous) @ As the emission probabilities returned by the above functions are on the log scale, we simply add the emission probabilties to obtain the joint emission probabilities: <>= emission.joint <- emission.gt + emission.cn @ The sequence of states that maximizes the likelihood is obtained from the \Robject{viterbi} algorithm: <>= fit1 <- viterbi(initialStateProbs=log(initialP), emission=emission.joint, tau=tau[, "transitionPr"], arm=tau[, "arm"], normalIndex=2) (brks <- breaks(x=fit1, states=joint.states, position=tau[, "position"], chromosome=tau[, "chromosome"])) rohBoundary <- as.integer(as.matrix(brks[brks$state == "ROH", ][c("start", "end")])) @ A plot of the locus-level data with predicted states overlaid: <>= require(SNPchip) gp <- plot(locusset[chromosome(locusset) == 1, ]) cns <- fit1 cns[cns == 3] <- 2 ##show copy-neutral LOH by vertical lines gp$abline.v <- TRUE ##plots vertical dashed lines gp$col.predict[3] <- "white" gp$hmm.ycoords <- c(0.7,0.9) show(gp) lines(position(locusset)[chromosome(locusset)==1], cns[chromosome(locusset) == 1, ]) abline(v=rohBoundary, col="royalblue", lty=2) legend("topright", lty=c(2, 1), col=c("royalblue", "black"), legend=c("ROH boundary", "copy number"), bty="n") @ \section{Additional options} \subsection{CRLMM confidence scores for the genotypes} The HMM in Section \ref{sec:simpleUsage} ignores the CRLMM confidence estimates that are available for the genotype calls. The following platforms are currently supported: <>= path <- system.file("extdata", package="VanillaICE") supportedPlatforms <- list.files(path)[grep("Conf", list.files(path))] as.character(sapply(supportedPlatforms, function(x) strsplit(x, "Conf")[[1]][[1]])) @ The emission probabilities for the genotypes are computed by calling the \Robject{genotypeEmissionCrlmm} function. In the simulated dataset used for this vignette, two heterozygous genotype calls were located in the middle of the ROH region. <>= hetIndices <- which(calls(locusset) == 2 & fit1 == 3) confs <- round(1-exp(-callsConfidence(locusset)[hetIndices]/1000), 5) @ \noindent The confidence scores for these two SNPs are \Sexpr{confs[1]} and \Sexpr{confs[2]}. Because these two heterozygous SNPs were genotyped with high confidence, the HMM will be less likely to call the region homozygous as evidenced by the higher emission probability for the normal state at the above loci: <>= emission.crlmm <- genotypeEmissionCrlmm(genotypes=calls(locusset), conf=callsConfidence(locusset), annotation=annotation(locusset), pHetCalledHom=0.001, pHetCalledHet=0.995, pHomInNormal=0.8, pHomInRoh=0.999) emission.crlmm[hetIndices, , ] @ We recompute the HMM predictions as follows: <>= joint.emission2 <- emission.cn joint.emission2[, , c("normal", "amplification")] <- joint.emission2[, , c("normal", "amplification")] + emission.crlmm[, , "norm"] joint.emission2[, , c("hemDel", "ROH")] <- joint.emission2[, , c("hemDel", "ROH")] + emission.crlmm[, ,"ROH"] fit2 <- viterbi(initialStateProbs=log(initialP), emission=joint.emission2, tau=tau[, "transitionPr"], arm=tau[, "arm"], normalIndex=2) table(fit2) breaks(x=fit2, states=joint.states, position=tau[, "position"], chromosome=tau[, "chromosome"]) @ \noindent Note that after incorporating the confidence scores, the previously called ROH region is now called \emph{normal}. The following two sections describe how to fit the HMM to copy number-only or diallelic genotype-only data. \subsection{Copy number only} <>= states <- 0:5 mus <- log2(c(0.05, 1:5)) emission.cn <- copynumberEmission(copynumber=log2(copyNumber(locusset)), states=states, mu=mus, sds=log.sds, takeLog=FALSE, verbose=TRUE) initialP <- log(rep(1/length(states), length(states))) fit3 <- viterbi(initialStateProbs=initialP, emission=emission.cn, tau=tau[, "transitionPr"], arm=tau[, "arm"], normalIndex=3) breaks(x=fit3, states=states, position=tau[, "position"], chromosome=tau[, "chromosome"]) @ For homozygous deletion, I just specified a small number. An empirical estimate of 'zero copy number' could be obtained from chromosome Y for females. \subsection{Genotypes only} Note the hemizygous deletions are called 'ROH' when copy number is ignored: <>= states <- c("normal", "ROH") prob.genotype.is.homozygous <- c(0.7, 0.999) ##without confidence scores emission.gt <- genotypeEmission(genotypes=calls(locusset), states=states, probHomCall=prob.genotype.is.homozygous) initialP <- log(rep(1/length(states), length(states))) fit4 <- viterbi(initialStateProbs=initialP, emission=emission.gt, tau=tau[, "transitionPr"], arm=tau[, "arm"], normalIndex=1) breaks(x=fit4, states=states, position=tau[, "position"], chromosome=tau[, "chromosome"]) @ \subsection{Modeling assumptions} The emission probabilities are calculated under the assumption that the estimates, or a suitable transformation, are approximately Gaussian. Hence, in the simulated data we check that the middle 90\% is approximately Gaussian. <>= par(pty="s", las=1) qqnorm(log2(copyNumber(locusset)), cex=0.6) qqline(log2(copyNumber(locusset))) abline(v=c(-1.645, 1.645)) @ \subsection{Copy number outliers} When true uncertainty estimates for the copy number are not available, copy number outliers are more likely to result in extreme emission probabilities than can influence the HMM segmentation. There are several approaches to reducing the influence of outliers on the HMM segmentation: (i) improved uncertainty estimates (preferred-- see the \Rpackage{crlmm} package), (ii) increase \Robject{TAUP} for the transition probabilities, (iii) threshold the emission probabilities, and (iv) threshold extreme values for the copy number estimates. For example of (iii), one could do <>= emission.cn[emission.cn[, , "normal"] < -10, , "normal"] <- -10 @ \Robject{copynumberEmission} estimates the scale parameter for the Normal distribution from the supplied data using the median absolute deviation (MAD). However, different standard deviations can be supplied by the user with the argument \Robject{sds}. The supplied \Robject{sds} must be of the same dimension as the copy number matrix. The following discussion elaborates briefly on the default procedure used to estimate the standard deviations. In the example dataset, we have only one sample and no estimates of the copy number uncertainty. Therfore, we obtain a robust estimate of the copy number standard deviation across SNPs and use this as a rough estimate of the uncertainty. As the log-transformed copy number estimates are more nearly Gaussian, we calculate a robust estimate of the standard deviation using the median absolute deviation (MAD) on the log scale: <>= cn.sds <- VanillaICE:::robustSds(copyNumber(locusset), takeLog=TRUE) dim(cn.sds) @ When multiple samples are available (e.g., 10 or more), SNP-specific estimates of the copy number uncertainty can be obtained by scaling an estimate of the variability across samples by a sample-specific estimate of noise. In the following code chunk, we simulate a matrix of copy number for 200 samples and then compute a robust SNP-specific estimate of the standard deviation. <>= CT <- matrix(sample(copyNumber(locusset), 100*200, replace=TRUE), 100, 200) sds <- VanillaICE:::robustSds(CT, takeLog=TRUE) @ The \Robject{robustSds} function returns a SNP-specific matrix of standard deviations provided that the copy number matrix has at least 3 samples. The {\it preferred} approach when the sample size is small (say, less than 10), is to develop SNP-specific estimates of the variance on a larger reference set, such as HapMap, using the same software, and then scale these estimates by a measure of the sample variance. \subsection{Missing genotype calls} If any of the genotype calls are missing and missingness is not independent of the underlying hidden state, one may specify the probability of a missing genotype calls for each hidden state (\texttt{probMissing}). By default, the HMM will assume that missing genotype calls are independent of the underlying hidden state. In particular, this assumption may not be reasonable for homozygous deletions. Again, the emph{preferred} approach is to use the confidence scores provided by crlmm and the function \Robject{genotypeEmissionCrlmm}. \subsection{Transition probabilities} The sequence of states that maximizes the likelihood is obtained by the Viterbi algorithm. Note that the argument \Robject{arm} to this function is a factor indicating the chromosomal arms -- the Viterbi algorithm is computed for independently for each chromosomal arm. We may scale the probability of transitioning between states by setting the arguments \Robject{normal2altered}, \Robject{altered2normal}, and \Robject{altered2altered}. For example, to facilitate comparisons to the Birdseye HMM \citep{Korn2008} one may pass the following arguments to \Robject{viterbi}: <>= fit <- viterbi(initialStateProbs=log(initialP), emission=emission, tau=tau[, "transitionPr"], arm=tau[, "arm"], normalIndex=2, normal2altered=0.005, altered2normal=0.5, altered2altered=0.0025) @ %\subsection{\emph{Smoothness}} The \Robject{TAUP} argument to the function \Robject{transitionProbability} together with the above arguments to the \Robject{viterbi} function can be used to control the 'smoothness' of the resulting predictions. In particular, higher values of \Robject{TAUP} encourages fewer jumps. %\section{Appendix} % %\subsection{Confidence scores for genotype calls.} % %The \Rfunction{CRLMM} (in the R package \Rpackage{oligo}) provides %confidence scores ($\pgte$) of the genotype estimates ($\gte$). From %269 HapMap samples assayed on the Affymetrix 50k Xba and Hind chips, we %have a gold standard of the true genotype defined by the consensus of %the HapMap centers. We use kernal-based density estimates to obtain % %{\scriptsize %\begin{eqnarray} %\label{ingo2} %f\left\{\ \pgtehom\ |\ \gtehom,\thom\ \right\},\ %f\left\{\ \pgtehom\ |\ \gtehom,\thet\ \right\},\ %f\left\{\ \pgtehet\ |\ \gtehet,\thom\ \right\},\ \mbox{~ and~} %f\left\{\ \pgtehet\ |\ \gtehet,\thet\ \right\} %\end{eqnarray} %} % %\noindent separately for the Xba and Hind 50k chips. The first term in %(\ref{ingo2}), for example, denotes the density of the scores when the %genotype is correctly called homozygous ($\gtehom$) and the true %genotype is homozygous ($\thom$). See \cite{Scharpf2008} for a more %complete description of the methods. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliography{ice}{} \bibliographystyle{plain} \end{document}