%\VignetteIndexEntry{VanillaICE Vignette} %\VignetteKeywords{copy number, genotype, SNP} %\VignettePackage{VanillaICE} \documentclass{article} \usepackage{amsmath} \usepackage{graphicx} \usepackage[numbers]{natbib} \usepackage{color} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\texttt{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\hmmoptions}{\Robject{HmmOptions}} \newcommand{\hmmparam}{\Robject{HmmParameter}} \newcommand{\crlmm}{\Rpackage{crlmm}} \newcommand{\oligo}{\Rpackage{oligo}} \newcommand{\cne}{\widehat{\text{CN}}} \newcommand{\gte}{\widehat{\text{GT}}} \newcommand{\gtehom}{\widehat{\text{HOM}}} \newcommand{\gtehet}{\widehat{\text{HET}}} \newcommand{\pgte}{\text{S}_{\widehat{\text{\tiny GT}}}} \newcommand{\pcne}{\text{S}_{\widehat{\text{\tiny CN}}}} \newcommand{\pgtehom}{\text{S}_{\widehat{\text{\tiny HOM}}}} \newcommand{\pgtehet}{\text{S}_{\widehat{\text{\tiny HET}}}} \newcommand{\thom}{\text{HOM}} \newcommand{\thet}{\text{HET}} \newcommand{\bDelta}{\mbox{\boldmath $\Delta$}} \newcommand{\real}{\mbox{$\mathbb R$}} % real numbers \newcommand{\bnu}{\mbox{\boldmath $\nu$}} \newcommand{\ice}{\Rpackage{VanillaICE}} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \title{\ice{}: Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays} \author{Robert Scharpf} \maketitle <>= options(width=70) @ \begin{abstract} This package provides an implementation of a hidden Markov Model for high throughput SNP arrays. Users of this package should already have available locus-level estimates of copy number. Copy number estimates can be relative or absolute. \end{abstract} \section{Overview} This vignette requires that you have \begin{itemize} \item an absolute estimate of the \emph{total} copy number organized such that rows correspond to loci and columns correspond to samples and / or \item a matrix of genotype calls (1=AA, 2 = AB, 3= BB): rows correspond to loci and columns correspond to samples \end{itemize} \noindent Additional options that can improve the HMM predictions include \begin{itemize} \item a CRLMM confidence score of the genotype call \item standard errors of the copy number estimates \end{itemize} \noindent Other HMM implementations are available for the joint analysis of copy number and genotype, including QuantiSNP \citep{Colella2007} and PennCNV \citep{Wang2007a}. \paragraph{Data considerations.} The HMM implemented in this package is most relevant for heritable diseases for which integer copy numbers are expected. For somatic cell diseases such as cancer, we suggest circular binary segmentation, as implemented in the \R{} package DNAcopy \citep{Olshen2004}. \paragraph{Citing this software.} % \bibitem{Scharpf2008} Robert~B Scharpf, Giovanni Parmigiani, Jonathan Pevsner, and Ingo Ruczinski. \newblock Hidden {M}arkov models for the assessment of chromosomal alterations using high-throughput {SNP} arrays. \newblock {\em Annals of Applied Statistics}, 2(2):687--713, 2008. \section{Organizing the locus-level data} \label{sec:simpleUsage} This package includes simulated genotype and copy number data for 9165 SNPs. <>= library(VanillaICE) data(locusLevelData) @ \noindent The copy number estimates in the locusLevelData object were multiplied by 100 and saved as an integer. Verify that it is reasonable to assume integer copy number for the HMM by plotting the locus-level estimates as a function of the physical position. <>= par(las=1) plot(locusLevelData[["copynumber"]][, 1]/100, pch=".", ylab="copy number", log="y") abline(h=1:3, col="grey70") @ \noindent Next, create an object of class \Robject{oligoSnpSet} from the simulated data: %get rid of this <>= oligoSet <- new("oligoSnpSet", copyNumber=log2(locusLevelData[["copynumber"]]/100), call=locusLevelData[["genotypes"]], callProbability=locusLevelData[["crlmmConfidence"]], annotation=locusLevelData[["platform"]]) oligoSet <- oligoSet[!is.na(chromosome(oligoSet)), ] oligoSet <- oligoSet[chromosome(oligoSet) < 3, ] @ If confidence scores or inverse standard errors for the copy number estimates are available, these should be supplied to the \Robject{cnConfidence} slot in the \Robject{assayData}. For illustration, in the following code chunk we transform the copy number estimates to the log scale and calculate a robust estimate of the standard deviation across autosomes. If uncertainty estimates are not available for copy number, the HMM will calculate the median absolute deviation (MAD). See the the function \Robject{robustSds}. <>= sds <- sd(oligoSet) @ \noindent The inverse of the \Robject{sds} object can be assigned to the \Robject{cnConfidence} slot. <>= cnConfidence(oligoSet) <- 1/sds @ \section{Fitting the HMM} %Several scenarios are outlined for fitting the HMM. In general, the %following elements are required to fit the HMM: initial state %probabilities, emission probabilities, and transition probabilities. \subsection{Vanilla HMM} When jointly modeling the copy number and genotype data, we assume that the genotype estimates and copy number estimates are independent conditional on the underlying hidden state. The emission probabilities for the genotypes are then calculated using either (i) assumptions of the probability of observing a homozygous genotype call given the underlying state. Next we order the markers by chromosome and physical position. The \Rfunction{hmm} method will order the \Robject{oligoSnpSet} object automatically, so the following step is not required. <>= oligoSet <- order(oligoSet) hmmOpts <- hmm.setup(oligoSet, is.log=TRUE) @ \noindent The viterbi algorithm is used to obtain the most likely sequence of hidden states given the observed data. For efficiency, we return an object of class \Robject{RangedDataHMM} with genomic coordinates of the normal and altered regions. We also return the log-likelihood ratio (LLR) of the predicted sequence in an interval versus the null of normal copy number. For intervals with typical copy number (2) and percent heterozygosity (the 3rd state in the above codechunk), the LLR is zero. <>= ## first 2 chromosomes oligoSet <- oligoSet[chromosome(oligoSet) <= 2, ] fit.van <- hmm(oligoSet, hmmOpts) @ Next we plot the data for chromosome 1 and overlay the predictions from the hidden markov model. See Figure \ref{fig:chr1}. <>= library(RColorBrewer) cols <- brewer.pal(5, "YlOrBr")[2:5] chr1 <- oligoSet[chromosome(oligoSet)==1,] fit.chr1 <- fit.van[fit.van$chrom == 1, ] ##fit.chr1 <- fit.van[fit.van$chrom==1, ] isHet <- snpCall(chr1)==2 par(las=1) plot(position(chr1), copyNumber(chr1), pch=".", cex=2, col="royalblue", ylab="log2 copy number") points(position(chr1)[isHet], copyNumber(chr1)[isHet], col="red", pch=".", cex=2) abline(h=log2(1:3), col="grey70") sts <- start(fit.chr1); ends <- end(fit.chr1) xx <- range(c(sts,ends)) y <- c(-1,-1,-0.9,-0.9) polygon(x=c(xx, rev(xx)), y=y, col="white") for(i in 1:nrow(fit.chr1)){ polygon(x=c(sts[i], ends[i], ends[i], sts[i]), y=y, col=cols[fit.chr1$state[i]], border=cols[fit.chr1$state[i]]) } legend("topleft", fill=cols, legend=hmmOpts$states, bty="n") @ \begin{figure}[t] \includegraphics[width=0.9\textwidth]{VanillaICE-fig2} \caption{\label{fig:chr1} Plot of artificial data for one chromosome.} \end{figure} To find which markers are included in each genomic interval returned by the \Rfunction{hmm} method, one can use the \Rfunction{findOverlaps} method in the \Rpackage{oligoClasses}. This method returns the 'match matrix'. The first column in the match matrix are indices of the genomic intervals (rows) of the \Rclass{RangedDataHMM} object; the second column in the match matrix is a vector of indices for the markers in the \Robject{oligoSet} object. For example, the second interval in the \Robject{RangedDataHMM} object \Robject{fit.van} contains 102 markers. <>= fit.van[2, ] @ To find the names of the 102 markers that are included in this genomic interval, one could do the following <>= mm <- matchMatrix(findOverlaps(fit.van, oligoSet)) markersInRange <- featureNames(oligoSet)[mm[,1]==2] @ Multipanel displays can be useful for visualizing the low-level data for copy number alterations. We extend the \Rfunction{xyplot} method in the \R{} package lattice for two common use-cases: by-locus and by-sample. \paragraph{By locus.} To plot the genomic data for a set of ranges at a given locus, we provide an unevaluated code chunk below (our example contains only a single sample): <>= xyplot(cn ~ x | id, oligoSet, range=RangedDataObject, ylim=c(-0.5,4), panel=xypanel) @ \noindent Note that the default in the above command is to use a common x- and y-scale for each panel. To allow the x-axes to change for each panel, one could set the x-scales to 'free': <>= xyplot(cn ~ x | id, oligoSet, range=RangedDataObject, ylim=c(-0.5,4), scales=list(x="free"), panel=xypanel) @ \noindent The function \Rfunction{xypanel} provides default colors for annotating the plotting symbols by genotype and by whether the markers are polymorphic. The \Robject{RangedDataHMM} must be passed by the name \emph{range} to the \Rfunction{xyplot} method. \paragraph{By sample.} To plot the low-level data for multiple alterations occurring in a single sample, one can again pass a \Rclass{RangedDataHMM} object with name \emph{range} to the \Rfunction{xyplot}. For example, see Figure \ref{fig:xy}. The code for producing Figure \ref{fig:xy} is in the following code chunk. Note that the formula in the example below conditions on \emph{range} instead of \emph{id}. The conditioning variable for displaying multiple panels of a single sample must be called 'range'. We plot a 2 Mb window surrounding each of the alterations in the simulated data by specifying \texttt{frame=2e6}. <>= ranges.altered <- fit.van[state(fit.van) != 4, ] xy.example <- xyplot(cn ~ x | range, oligoSet, range=ranges.altered, frame=2e6, scales=list(x="free"), ylim=c(-0.5,4), panel=xypanel, cex=0.4, pch=21, border="grey", ylab=expression(log[2]("copy number"))) @ \noindent See \texttt{?xyplot} for additional details. <>= pdf("VanillaICE-xyplot.pdf", width=8, height=7) print(xy.example) dev.off() @ \begin{figure}[t] \includegraphics[width=\textwidth]{VanillaICE-xyplot} \caption{\label{fig:xy} The method \Rfunction{xyplot} is used to create a multi-panel display of alterations in a single sample. Each panel displays a single copy number alteration detected by the HMM that is boxed by a rectangle. The alteration is framed by specifying the number of basepairs to plot upstream and downstream of the alteration. Here, we used a frame of 2 Mb. Homozygous SNPs with diallelic geotypes `AA' and `BB' are shaded blue; SNPs with diallelic genotype call `AB' are shaded in red.} \end{figure} Note also that \texttt{scales} must be set to \emph{free} in the above call to \Rfunction{xyplot}. \subsection{ICE HMM} To compute emission probabilities that incorporate the \Rpackage{crlmm} genotype confidence scores, (i) set \Robject{ICE} to \texttt{TRUE} in the \Robject{hmm.setup} function and (ii) indicate which of the states are expected to be largely homozygous (\texttt{rohStates}). Note that this option is limited too a few platforms and the Affy 100k platforms is not one of them. <>= hmmOpts <- hmm.setup(oligoSet, ICE=TRUE, rohStates=c(FALSE, TRUE, TRUE, FALSE, FALSE)) res <- tryCatch(hmm(oligoSet, hmmOpts), error=function(e) "platform not supported") print(res) ## supported platforms VanillaICE:::icePlatforms() @ For the purpose of illustration, we assign 'genomewidesnp6' to the annotation slot since this platform is supported. <>= ann <- annotation(oligoSet) annotation(oligoSet) <- "genomewidesnp6" fit.ice <- hmm(oligoSet, hmmOpts) fit.ice annotation(oligoSet) <- ann @ <>= fit.chr1 <- fit.ice[chromosome(fit.ice)==1, ] widths <- width(fit.chr1) fit.chr1 <- fit.chr1[order(widths,decreasing=TRUE),] par(las=1) plot(position(chr1), copyNumber(chr1), pch=".", ylab="log2 copy number", xlab="physical position", cex=2, col="royalblue") points(position(chr1)[isHet], copyNumber(chr1)[isHet], col="red", pch=".", cex=2) abline(h=log2(1:3), col="grey70") sts <- start(fit.chr1); ends <- end(fit.chr1) xx <- range(c(sts,ends)) y <- c(-1,-1,-0.9,-0.9) polygon(x=c(xx, rev(xx)), y=y, col="white") for(i in 1:nrow(fit.chr1)){ polygon(x=c(sts[i], ends[i], ends[i], sts[i]), y=y, col=cols[state(fit.chr1)[i]], border=cols[state(fit.chr1)[i]]) } legend("topleft", fill=cols, legend=hmmOpts$states, bty="n") @ \subsection{Other options} \paragraph{Copy number.} A HMM for copy number only (e.g., if genotypes are ignored or are unavailable) can be fit as follows. <>= cnSet <- new("CopyNumberSet", copyNumber=log2(locusLevelData[["copynumber"]]/100), annotation=locusLevelData[["platform"]]) cnSet <- cnSet[chromosome(cnSet) <= 2 & !is.na(chromosome(cnSet)), ] hmmOpts <- hmm.setup(cnSet, is.log=TRUE) fit.cn <- hmm(cnSet, hmmOpts) @ \paragraph{Regions of homozygosity.} %\subsection{Region of homozygosity (ROH) HMM} A HMM for genotype-only data can be used to find long stretches of homozygosity. Note that hemizygous deletions are also identified as 'ROH' when copy number is ignored (as the biallelic genotypte call in a hemizygous deletions tends to be all homozygous calls). <>= snpSet <- new("SnpSet", call=locusLevelData[["genotypes"]], callProbability=locusLevelData[["crlmmConfidence"]], annotation=locusLevelData[["platform"]]) featureData(snpSet) <- addFeatureAnnotation(snpSet) snpSet <- snpSet[chromosome(snpSet) < 3, ] hmmOpts <- hmm.setup(snpSet) fit.gt <- hmm(snpSet, hmmOpts) @ \noindent A suggested visualization: <>= fit.chr1 <- fit.gt[chromosome(fit.gt)==1, ] widths <- width(fit.chr1) fit.chr1 <- fit.chr1[order(widths,decreasing=TRUE),] gt <- ifelse(snpCall(chr1) == 1 | snpCall(chr1) == 3, 1, 0) par(las=1) plot(position(chr1), jitter(gt, amount=0.05), pch=".", ylab="", xlab="physical position", ylim=c(-3, 1.2), yaxt="n") ##points(position(chr1)[isHet], copyNumber(chr1)[isHet,], pch=".", ylab="log2 copy number", xlab="physical position", cex=2, col="red") axis(side=2, at=c(0,1), labels=c("AB", "AA or BB"), cex.axis=0.7) sts <- start(fit.chr1); ends <- end(fit.chr1) xx <- range(c(sts,ends)) y <- c(-1,-1,-0.5,-0.5) polygon(x=c(xx, rev(xx)), y=y, col="white") for(i in 1:nrow(fit.chr1)){ polygon(x=c(sts[i], ends[i], ends[i], sts[i]), y=y, col=cols[fit.chr1$state[i]], border=cols[fit.chr1$state[i]]) } legend("bottomleft", fill=cols, legend=hmmOpts$states, bty="n") @ \section{Quality control} \subsection{Outliers} %% could remove this section and discuss Uniform-Gaussian mixture Copy number outliers can cause the HMM to become too jumpy. One approach to reduce the influence of outliers is to some \textit{light}-smoothing prior to fitting the HMM, as suggested in the \R{} package \texttt{DNAcopy}. For instance, one could identify outliers by some criteria and then average the outliers using the estimates from neigboring probes. Here, we use the defaults in \Robject{smooth.CNA}. <>= if(require("DNAcopy")){ ##create an outlier copyNumber(cnSet)[50] <- 10 copyNumber(cnSet)[45:55] cnaObj <- CNA(genomdat=copyNumber(cnSet), chrom=chromosome(cnSet), maploc=position(cnSet), data.type="logratio", sampleid=sampleNames(cnSet)) smoothed.cnaObj <- smooth.CNA(cnaObj) copyNumber(cnSet) <- matrix(smoothed.cnaObj[, "NA06993"], nrow(cnSet), 1) copyNumber(cnSet)[50] } @ \noindent One could also increase the value of \Robject{TAUP} in the viterbi algorithm to encourage a fit with fewer jumps. Note that with improved estimates of copy number uncertainty, many of these \textit{post-hoc} approaches for addressing outliers would be less critical. \subsection{Batch effects} \Rpackage{VanillaICE} can be used in conjunction with the \Rpackage{crlmm} package to reduce batch effects. See \citep{Scharpf2010} for details regarding the \Rpackage{crlmm} package. %\section{Alternatives} % %See the \texttt{crlmmDownstream} vignette located in %\texttt{inst/test} for an alternative approach that computes emission %probabilities directly on the bivariate normal log(A) versus log(B) %space \citep{Korn2008}. The estimation procedure for copy number in %\Rpackage{crlmm} is described elsewhere \citep{Scharpf2010}. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliography{ice}{} \bibliographystyle{plain} \end{document}