%\VignetteIndexEntry{Segmentation Overview} %\VignetteDepends{limma, DNAcopy, GLAD} %\VignetteKeywords{Segmentation, aCGH} %\VignettePackage{snapCGH} \documentclass[11pt,a4paper]{article} \title{snapCGH: Segmentation,\\ Normalization and Processing of aCGH Data\\ Users' Guide} \author{ML Smith, JC Marioni, TJ Hardcastle, NP Thorne} %\settocname{Table of Contents} \setlength{\oddsidemargin}{0.5cm} %margin \setlength{\evensidemargin}{0.5cm} %margin \setlength{\textwidth}{15.2cm} \addtolength\footskip{1cm} \pagestyle{plain} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \usepackage{Sweave} \begin{document} \maketitle \section*{Citing snapCGH} If you have used \Rpackage{snapCGH} in your work please cite the package using the following:\\ Smith, M.L., Marioni, J.C., Hardcastle, T.J., Thorne, N.P.\\ \indent snapCGH: Segmentation, Normalization and Processing of aCGH Data Users' Guide,\\ \indent Bioconductor, 2006\\ \noindent If you make use of the function \Rfunction{runBioHMM}, please cite:\\ Marioni, J. C., Thorne, N. P., and Tavaré, S. (2006).\\ \indent BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data.\\ \indent \textit{Bioinformatics} \textbf{22}, 1144 - 1146 \section*{Introduction} This document outlines some of the commands used to read in, investigate and subsequently segment array CGH data. The files analysed represent 2 breast cancer cell lines obtained from Jessica M Pole and Paul AW Edwards. <<1,results=hide>>= library(snapCGH) @ \Rpackage{snapCGH} is designed to be used in conjunction with \Rpackage{limma} and so it will automatically load that library before proceeding. In addition to \Rpackage{limma}, the following packages are also loaded: \Rpackage{GLAD}, \Rpackage{DNAcopy}, \Rpackage{tilingArray} and \Rpackage{aCGH}. Each of these impliments an alternative segmentation method that may be applied to the data.\\ \section*{Reading Data} We read in the samples and create the initial RG object using the following commands. <<2, results=hide>>= datadir <- system.file("testdata", package="snapCGH") targets <- readTargets("targets.txt", path=datadir) RG1 <- read.maimages(targets$FileName, path=datadir, source = "genepix") @ % % Positional information about the clones on the array (e.g. which chromosome and the position on the chromosome a clone if from) can be incorporated using the \Rfunction{readPositionalInfo} function. This function accepts either an \Robject{RGList} or \Robject{MAList} along with an argument specifying which array platform the data was produced on. Currently the only supported platforms are Agilent, Bluefuse and Nimblegen, but this should expand in the near future. If your platform isn't supported or \Rfunction{readPostitionalInfo} returns an error (e.g. if the a new version of the array output data becomes incompatible with the current function) then the positional information can be read in separately using the \Rfunction{read.clonesinfo} function. In order to do this it is necessary to create a clones info file. Such a file (which can be created using Excel and saved as a 'txt' file) must contain columns called Position and Chr which give the position along a chromosome (in Mb) and the chromosome to which a clone belongs. The X and Y chromosomes can be labelled either as X and Y or 23 and 24. Both instances are handled by \Rfunction{read.clonesinfo}. The clones info file must be ordered in the same way as the clones are ordered in the array output data. The second command adds information about the structure of the slide (blocks/rows/columns) to the \Robject{RG} object. Finally we read in a spot types file. This file contains information about the control status of particular spots on the array and allows specific spots to be highlighted in many of the plotting functions. The content of a spot types file is covered extensively within the \Rpackage{limma} manual. <<3, results=hide>>= RG1 <- read.clonesinfo("cloneinfo.txt", RG1, path=datadir) RG1$printer <- getLayout(RG1$genes) types <- readSpotTypes("SpotTypes.txt", path=datadir) RG1$genes$Status <- controlStatus(types, RG1) @ Commonly, when aCGH experiments are carried out the reference channel is dyed using Cy5 and the test channel is dyed using Cy3. This is the opposite way to expression data. In order to take this into account we need to specify which channel is the reference within our \Rclass{RGList}. To do this we create a design vector with each column corresponding to an array in the experiment. A value of 1 indicates that the Cy3 channel is the reference, whilst a value of -1 equates to Cy5 being the reference, as is the case in this example. <<3a>>= RG1$design <- c(-1,-1) @ We now proceed to use the function \Rfunction{backgroundCorrect} to remove the background intensity for each spot. In this example we have chosen the method 'minimum' which subtracts the background value from the forground. Any intensity which is zero or negative after the background subtraction is set equal to half the minimum of the positive corrected intensities for that array. For other background correction methods please see the appropriate help file. <<4>>= RG2 <- backgroundCorrect(RG1, method="minimum") @ %% Drop this step below?????? this should be a weight function Next, we normalise the data. Here we will carry out a (global) median normalisation. Other options for normalization methods are: \Rfunarg{none}, \Rfunarg{loess}, \Rfunarg{printtiploess}, \Rfunarg{composite} and \Rfunarg{robustspline}. The output of the normalization function is a new type of object called an \Rclass{MAList}. This is composed of the $log_{2}$ ratios, intensities, gene and slide layout information which it gleans from the \Robject{RG} object. If you feel that no normalisation is required then it is possible to skip this stage and proceed directly onto the next step maintaining the \Rclass{RGList}. <<5>>= MA <- normalizeWithinArrays(RG2, method="median") @ We are now ready to process the data with the purpose of segmenting the dataset into regions corresponding to sections of the genome where there are the same number of copy number gains or losses.\\ Firstly, we use the \Rfunction{processCGH} to 'tidy up' the \Rclass{MAList} object. The \Rfunarg{method.of.averaging} option defines how clones of the same type should be averaged. If this is specified the duplicates are removed following the averaging leaving only one occurence of each clone set. The \Rfunarg{ID} argument is used to indicate the name of the column containing a unique identifier for each clone type. It is this identifier that is used when averaging replicates. As mentioned above this function accepts both \Rclass{RGLists} and \Rclass{MALists}. <<6, results=hide>>= MA2 <- processCGH(MA,method.of.averaging=mean, ID = "ID") @ \section*{Segmentation} We are now ready to fit the segmentation method. For larger data sets this step can take a long time (several hours). In this example we call the homogeneous hidden Markov model available in the \Rpackage{aCGH} package. <<segmentation1,results=hide>>= SegInfo.Hom <- runHomHMM(MA2, criteria = "AIC") @ At the current time there are methods for calling four other segmentation algorithms in addition to the method shown above. Three of these are included as wrapper functions to methods available in other packages, specifically: \Rpackage{DNAcopy}, \Rpackage{GLAD} and \Rpackage{tilingArray}. These functions can be called using the following code: <<segmentation2,results=hide, eval = FALSE>>= SegInfo.GLAD <- runGLAD(MA2) SegInfo.DNAcopy <- runDNAcopy(MA2) SegInfo.TilingArray <- runTilingArray(MA2) @ The final method, called BioHMM, is a heterogeneous hidden Markov model and is maintained within \Rpackage{snapCGH}. It can be called using the following command. By default it incorporates the distance between clones into the model assigning a higher probability of state change to clones that are a larger distance apart on a chromosome. This option is control using the argument \Rfunarg{useCloneDists}. <<segmentation3, results=hide, eval = FALSE>>= SegInfo.Bio <- runBioHMM(MA2) @ We now deal with the fact that the segmentation methods sometimes have a tendency to fit states whose means are very close together. We overcome this problem by merging states whose means are within a given threshold. There are two different methods for carrying out the merging process. For more information on their differences please see the appropriate page in the helpfiles. <<segmentation4>>= SegInfo.Hom.merged <- mergeStates(SegInfo.Hom, MergeType = 1) @ We are now ready to use any of the plotting functions available in the library. \section*{Plotting Functions} The library comes with a variety of plotting functions that provide visual representations of the data at various stages of the analysis process. Firstly we will look at the \Rfunction{genomePlot} function. This function takes either an \Rclass{MAList} or a \Rclass{SegList} object (in this example we've used an \Rclass{MAList}) and plots the M-value for each gene against it's position on the genome. The \Rfunarg{array} argument indicates which array is plotted. This function utilizes the spot types data that was read in earlier to highlight specific genes of interest. <<label=plotting1, include=FALSE>>= genomePlot(MA2, array = 1) @ \begin{figure}[!ht] \begin{center} <<label=plotting2, fig=TRUE,echo=FALSE>>= <<plotting1>> @ \end{center} \end{figure} \pagebreak It is also possible to look at specific chromosomes, rather than the entire genome as in the previous example. Which particular chromosome is to be plotted is specified using the \Rfunarg{chrom.to.plot} argument. <<label = plotting3, include=FALSE>>= genomePlot(MA2, array = 1, chrom.to.plot = 8) @ \begin{figure}[!ht] \centering <<label=plotting4,fig=TRUE,echo=FALSE>>= <<plotting3>> @ \end{figure} \pagebreak %Here we cluster the samples. Again this function will except an %object of class \Rclass{MAList} or \Rclass{SegList}. It is possible %to specify which chromosomes to cluster using the \Rfunarg{vecchrom} %argument as well as defining the colours that represent areas of %amplification or deletion. Please see the help file for more details. %<<label=13,include=FALSE>>= %heatmapGenome(MA2) %@ %\begin{figure}[!ht] %\begin{center} %<<label=13a,fig=TRUE,echo=FALSE>>= %<<13>> %@ %\end{center} %\end{figure} %\pagebreak The \Rfunction{plotSegmentedGenome} function provides a visual representation of the observed M-values overlayed with the predicted states produced by the segementation algorithm. It requires a \Rclass{SegList} as input. <<label=14,include=FALSE>>= plotSegmentedGenome(SegInfo.Hom.merged, array = 1) @ \begin{figure}[!ht] \begin{center} <<label=14a,fig=TRUE,echo=FALSE>>= <<14>> @ \end{center} \end{figure} \pagebreak Using the argument \Rfunarg{chrom.to.plot} it is possible to specify individual chromosomes to plot. Additionally the function can accept more than one \Robject{SegList} allowing visual comparison between segmentation methods. The following example applies the DNAcopy algorithm to the data, merges it and then plots both that segmentation method and the homogeneous HMM on the same axis, coloring them blue and green respectively. <<label=15, results=hide>>= Seg.DNAcopy <- runDNAcopy(MA2) SegInfo.DNAcopy.merged <- mergeStates(Seg.DNAcopy) plotSegmentedGenome(SegInfo.DNAcopy.merged, SegInfo.Hom.merged, array = 1, chrom.to.plot = 1, colors = c("blue", "green")) @ \begin{figure}[!ht] \begin{center} <<label=15a,fig=TRUE,results = hide, echo=FALSE>>= <<15>> @ \end{center} \end{figure} \pagebreak \section*{Interactive Plotting Functions} \Rpackage{snapCGH} includes several interactive plotting functions to allow users to view data on a large scale and the focus in on particular areas of intrest. The first of these is \Rfunction{zoomGenome}. This function splits the screen in two horizontally and plots the same as \Rfunction{plotSegmentedGenome} in the upper half. It is then possible to click on any of the chromosomes displayed and the selected chromosome is plotted in the lower half of the graphics window. Clicking to either side of the plot will end the interactivity. In the example below we have plotted the first array in the \Robject{SegInfo.Hom.merged} object and then clicked on chromosome eight. <<label=iplotting1, eval=FALSE>>= zoomGenome(SegInfo.Hom.merged, array = 1) @ \begin{figure}[!ht] \centering \scalebox{1.0}[0.9]{\includegraphics{zoomGenome.jpg}} \end{figure} \pagebreak \Rfunction{zoomChromosome} is an equivalent function for focusing on specific areas in a specific chromosome. The layout is the same as in the \Rfunction{zoomGenome} but this time the upper plot is of a specified chromosome. Click in two location on this plot and the selected region of the chromosome will be plotted below. The region of focus is also indicated in the upper plot. Clicking twice again will change the focus region. This function accepts either an \Rclass{MAList} or multiple \Rclass{SegList} objects. <<label=iplotting2, eval=FALSE>>= zoomChromosome(SegInfo.Hom.merged, array = 1, chrom.to.plot = 8) @ \begin{figure}[!ht] \centering \includegraphics{zoomChromosome.jpg} \end{figure} \section*{Simulating Data} Simulated data is often used to assess the efficacy of segmentation methods so \Rpackage{snapCGH} also provides facilities for simulating aCGH data. None of the currently available simulation schemes take into account the spatial nature of aCGH data, they simply generate ordered $log_{2}$ ratios. However with the development2 of new segmentation methods, such as BioHMM, new simulation schemes are needed. In particular the simulation scheme available in \Rpackage{snapCGH} allows differentiation between tiled and non-tiled regions to emulated different array technologies. The function \Rfunction{simulateData} generates a \Rclass{SegList} of 22 chromosomes with the number of arrays specified by the argument \Rfunarg{nArrays}. The function has a mulitude of additional arguments for specifying the specific type of data you wish to simulate. These are documented in the appropriate helpfile. <<label=Simulation1>>= simulation <- simulateData(nArrays = 4) @ In the following example we run the homogeneous HMM and the DNAcopy algorithm on the simulated data. The function \Rfunction{compareSegmentations} can be used to evaluate the performance of a segmentation methods against the known truth of the simulated data. It returns true positive and false discovery rates for breakpoint detection in each supplied \Rclass{SegList}. The argument \Rfunarg{offset}, which can take an integer between 0 and 2, specifies how close the algorithm needs to be to a real breakpoint before it is scored. <<label=Simulation2, results=hide>>= Sim.HomHMM <- runHomHMM(simulation) Sim.DNAcopy <- runDNAcopy(simulation) rates <- compareSegmentations(simulation, offset = 0, Sim.HomHMM, Sim.DNAcopy) @ The output from \Rfunction{compareSegmentations} is a list containing two matrices. The first of these, \texttt{\$TPR}, contains the true positive rate, whilst the second, \texttt{\$FDR}, holds the false discovery rate. Both of these matrices are arranged such that a row represents a segmentation method and each column is an array. <<label=Simulation3>>= rates @ Shown below are boxplots of the \Robject{rates} object. \begin{figure}[!ht] \centering <<label=Simulation4, fig=TRUE>>= par(mfrow = c(1,2)) boxplot(rates$TPR ~ row(rates$TPR), col = c("red", "blue"), main = "True Positive Rate") boxplot(rates$FDR ~ row(rates$FDR), col = c("red", "blue"), main = "False Discovery Rate") @ \end{figure} \end{document}