%\VignetteIndexEntry{The \Rpackage{ChAMP} Package} %\VignetteKeywords{Illumina DNA methylation normalisation array normalization performance } %\VignettePackage{ChAMP} %% http://www.bioconductor.org/help/course-materials/2010/AdvancedR/BuildPackage.pdf \documentclass[11pt]{article} \usepackage{Sweave} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage{natbib} \usepackage{url} \title{The \Rpackage{ChAMP} Package} \author{Morris TJ, Butcher L, Feber A, Teschendorff A, Yuan Tian, Chakravarthy A, Beck S} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{figure}[!tpb]%logo \centerline{\includegraphics[scale=1]{logo4.jpg}}%{system.file("extdata","logo4.jpg",package="ChAMP")}} \end{figure} \noindent The \Rpackage{ChAMP} package is a pipeline that integrates currently available 450k and EPIC analysis methods and also offers its own novel functionality. It utilises the data import, quality control and SWAN \citep{Maksimovic12} normalization functions offered by the \Rpackage{minfi} \citep{minfi} package. In addition, the \Rpackage{ChAMP} package includes the Peak Based Correction (PBC) method \citep{Dedeurwaerder11} and the BMIQ normalization \citep{Teschendorff13} is set as the default method. \noindent A number of other pipelines and packages are available for 450k analysis including \Rpackage{IMA} \citep{Wang12IMA}, \Rpackage{minfi} \citep{minfi}, \Rpackage{methylumi} \citep{methylumi}, \Rpackage{R n Beads} \citep{rnbeads} and \Rpackage{watermelon} \citep{Pidsley13}. However, \Rpackage{ChaMP} includes several analysis options that are not available in other packages. First of all \Rpackage{ChAMP} is both available for both 450k or EPIC array. Also this package contains multiple novel functions. The singular value decomposition (SVD) method \citep{Teschendorff09} allows an in-depth look at batch effects and for correction of batch related to slide number the ComBat method \citep{Johnson07} has been implemented. For the identification of differentially methylated regions (DMRs) \Rpackage{ChAMP} offers the new Probe Lasso method. Also another effective DMR detection function \Rpackage{Bumphunter} is also integrated and set default option \citep{Andrew1}. For the purpose of dealing with cell heterogeneity problem, we included two functions RefFreeEWAS \citep{Houseman1} and RefbaseEWAS \citep{Houseman2} into ChAMP. Finally, \Rpackage{ChAMP} has an additional function to analyse 450k or EPIC for copy number alterations \citep{Feber14}. %mention shinymethy \section{Installation} \noindent It is essential that you have R already installed on your computer. \Rpackage{ChAMP} is a pipeline that utilises many Bioconductor packages that are currently available from CRAN and Bioconductor. For all of the steps of the pipeline to work make sure that you have ungraded Bioconductor to newest version (3.2) and installed \Rpackage{minfi}, \Rpackage{DNAcopy}, \Rpackage{impute}, \Rpackage{marray}, \Rpackage{limma}, \Rpackage{preprocessCore}, \Rpackage{RPMM}, \Rpackage{sva}, \Rpackage{IlluminaHumanMethylation450kmanifest}, \Rpackage{plyr}, \Rpackage{GenomicRanges}, \Rpackage{RefFreeEWAS}, \Rpackage{qvalue}, \Rpackage{doParallel}, \Rpackage{bumphunter}, \Rpackage{quadprog}, \Rpackage{isva} and \Rpackage{wateRmelon}.\\ \\ This can be done in one go using the following commands:\\ <>= source("http://bioconductor.org/biocLite.R") biocLite(c('minfi', 'DNAcopy', 'impute', 'marray', 'limma', 'preprocessCore', 'RPMM', 'sva', 'IlluminaHumanMethylation450kmanifest', 'wateRmelon','isva','quadprog','bumphunter','doParallel', 'qvalue','RefFreeEWAS','GenomicRanges','plyr')) @ Load the \Rpackage{ChAMP} package. \\ <>= library(ChAMP) @ \noindent It is then easier to set your working directory to the folder containing your .idat files and sample sheet. There can only be one sample sheet in this folder. \\ \section{Test Dataset} The package contains a test dataset of HumanMethylation450 data which can be used to test functions available in \Rpackage{ChAMP}.\\ This can be loaded by pointing the directory to the testDataSet:\\ <>= testDir=system.file("extdata",package="ChAMPdata") @ Also a pre-filtered saved version can be loaded using <>= data(testDataSet) myLoad=testDataSet @ \section{Full Pipeline} Figure \ref{fig:pipeline} outlines the steps in the \Rpackage{ChAMP} pipeline. Each step can be run individually as a separate function. This allows integration with other analysis pipelines. It also enables the user to save the results of each step for future reference or further analysis. Alternatively the full pipeline can be run at once with one command:\\ <>= champ.process(directory = testDir) @ \noindent When running the full pipeline through the champ.process() function a number of parameters can be adjusted.\\ \begin{figure}[!tpb]%pipeline \centerline{\includegraphics[scale = 0.55]{figure1.pdf}}%{system.file("extdata","figure1.pdf",package="ChAMP")}} \caption{An overview of the \Rpackage{ChAMP} pipeline showing all major steps available in the pipeline. This steps can be run as a full pipeline or individually as functions combined with other packages. This vignette will explain in detail how to proceed.} \label{fig:pipeline} \end{figure} \section{A note on computational requirements} \noindent The ability to run the pipeline on a large number of samples depends somewhat on the memory available. The \Rpackage{ChAMP} pipeline runs 200 samples successfully on a computer with 8GB of memory. Beyond this it may be necessary to find a server/cluster to run the analysis on. \noindent The champ.load() function uses the most memory. If you plan to run the analysis more than once it is recommended to to run myLoad=champ.load() and save this list for future analyses. In this case, when the list of load objects is saved to the variable 'myLoad' you can simply run champ.process(fromIDAT=F). \noindent In champ.DMR() function, if Bumphunter DMR detection method is assigned, user may use parallel method to accelerate the speed. If your server or computer has more cores, you may specify more threads at the same time to make function faster, but which may cost more memory. <>= save(myLoad,file="currentStudyloadedData.RData") load("currentStudyloadedData.RData") champ.process(fromIDAT=FALSE) @ %Add info on combining datasets... \noindent Another option if you have time or space constraints or if you are combining \Rpackage{ChAMP} with other analysis pipelines is to run the analysis step by step and not use the champ.process() function. Each separate function is described in detail below, but the full pipeline in a step-wise process would go: <>= myLoad <- champ.load(directory = testDir) myNorm <- champ.norm() champ.SVD() batchNorm <- champ.runCombat() limma <- champ.MVP() myDMR <- champ.DMR() myRefBase <- champ.refbase() myRefFree <- champ.reffree() champ.CNA() @ \noindent Also we included a Simulation EPIC dataset in ChAMPdata package, which contains 16 simulation samples and manually modified CpGs and DMRs. We simply use this this dataset to show our EPIC array pipeline, which is also very easy to use. <>= # myLoad <- champ.load(directory = testDir,arraytype="EPIC") # We simulated EPIC data from beta value instead of .idat file, # but user may use above code to read .idat files directly. # Here we we started with myLoad. data(EPICSimData) myNorm <- champ.norm(arraytype="EPIC") champ.SVD() batchNorm <- champ.runCombat() myrefbase <- champ.refbase() myreffree <- champ.reffree() limma <- champ.MVP(arraytype="EPIC") myDMR <- champ.DMR(arraytype="EPIC") # champ.CNA(arraytype="EPIC") # champ.CNA() function call for intensity data, which is not included in # out Simulation data. @ \noindent Thus, it's very easy to use ChAMP on EPIC dataset as well, users just need to specify the 'arraytype' parameter as 'EPIC' in champ.load(), champ.norm(), champ.MVP(), champ.DMR(), champ.CNA() functions, then \Rpackage{ChAMP} will automatically perfor analysis with EPIC annotation. \section{Description of \Rpackage{ChAMP} functions} \noindent As previously mentioned, the user has the option to run each of the \Rpackage{ChAMP} functions individually to allow integration with other pipelines or to save the results of each step. Here each function is described in detail. \subsection{Load} \subsubsection{Load Data from idat files} \noindent The champ.load() function utilises \Rpackage{minfi} to load data from .idat files. By default this loads data from the current working directory, in this directory you should have all .idat files and a sample sheet. The sample sheet currently needs to be a .csv file as came with your results following hybridization. You can choose whether you want M-values or beta-values. For small datasets M-values are recommended \citep{Zhuang12}. \noindent The \Rpackage{minfi} function used to load the data from the idat files automatically filters out the 65 SNP probes that were included on the chip as internal controls which can useful for identifying sample mixups. As such, for 450k bead array data, before filtering for low quality probes the dataset will include 485,512 probes. And for EPIC bead array data, before filtering probes the dataset will include 867,531 probes. <>= myLoad$pd @ %\subsubsection{Load Data from Genome Studio files} \subsubsection{Filtering for failed probes} \noindent By default \Rpackage{ChAMP} filters the data for detection p-value (< 0.01). This utilises \Rpackage{minfi} method for calculating the detection p-value which differs from the method used in Genome Studio. A file failedSamples.txt is saved (see Figure \ref{fig:failed}) and also printed to the screen showing the fraction of failed probes per sample. If any of these values is high (>0.05) you may want to consider removing that sample from the analysis and rerunning. \noindent By default \Rpackage{ChAMP} will filter out probes with <3 beads in at least 5\% of samples per probe. This default can changed with the filterBeads parameter or the frequency can be adjusted with the beadCutoff parameter. \begin{figure}[!ht]%failedProbes \centerline{\includegraphics[scale = 1]{failedProbes.jpg}}%{system.file("extdata","densityPlot.jpg",package="ChAMP")}} \caption{An example of the output showing the portion of probes with a detection p-value above the specified cutoff (default is 0.01) for each sample. Users may want to consider removing samples above 0.05.} \label{fig:failed} \end{figure} <>= myLoad=champ.load(directory = testDir, filterBeads=TRUE) @ \subsubsection{Output} \noindent The load function saves 3 quality control images (see Figures \ref{fig:density}, \ref{fig:mds} and \ref{fig:cluster}. The clustering image will not be saved if there are more than 65 samples in the dataset. \begin{figure}[!ht]%densityPlot \centerline{\includegraphics[scale = 0.55]{densityPlot.jpg}}%{system.file("extdata","densityPlot.jpg",package="ChAMP")}} \caption{An example of a density plot showing the density of unnormalised beta values for all the samples that have been uploaded. They are coloured based on the Sample Group as defined in the sample sheet (Figure \ref{fig:sampleSheetexample}). This plot may identify potential outliers that have significantly different profiles.} \label{fig:density} \end{figure} \begin{figure}[!ht]%sampleSheet \centerline{\includegraphics[scale = 0.55]{sampleSheetexample.jpg}}%{system.file("extdata","sampleSheetexample.jpg",package="ChAMP")}} \caption{An example of the sample sheet.} \label{fig:sampleSheetexample} \end{figure} \begin{figure}[!ht]%mdsPlot \centerline{\includegraphics[scale = 0.55]{mdsPlot.jpg}}%{system.file("extdata","mdsPlot.jpg",package="ChAMP")}} \caption{An example of a MDS (multidimensional scaling) plot. This plot allows a visualisation of the similarity of samples based on the top 1000 most variable probes amongst all samples. The samples are coloured by Sample\_Group (as defined in the sample sheet Figure \ref{fig:sampleSheetexample}.)} \label{fig:mds} \end{figure} \begin{figure}[!ht]%clusterPlot \centerline{\includegraphics[scale = 0.55]{rawSampleCluster.jpg}}%{system.file("extdata","rawSampleCluster.jpg",package="ChAMP")}} \caption{An example of a cluster plot. This offers a second way to visualise the similarity of samples based on all probes using hierarchical clustering. } \label{fig:cluster} \end{figure} \subsubsection{Usage} <>= myLoad = champ.load(directory=testDir) @ \subsection{Normalization for adjustment of type-2 bias} There are several normalization methods available: BMIQ \citep{Teschendorff13}, SWAN \citep{Maksimovic12}, PBC \citep{Dedeurwaerder11} or NONE.\\ \noindent The default method is BMIQ. It will save three quality control images to the folder '/Normalization' for each sample (see Figures \ref{fig:type1}, \ref{fig:type2} and \ref{fig:check}). These images show the fit that the normalization is applying to each probe type. \noindent After normalization a second set of quality control images will be saved similar to pre-normalization (see Figures \ref{fig:density}, \ref{fig:mds} and \ref{fig:cluster}). The clustering image will not be saved if there are more than 65 samples in the dataset. \begin{figure}[!ht]%type1 \centerline{\includegraphics[scale = 0.55]{type1fit.jpg}}%{system.file("extdata","type1fit.jpg", package="ChAMP")}} \caption{An example of the Type 1 fit with BMIQ.} \label{fig:type1} \end{figure} \begin{figure}[!ht]%type2 \centerline{\includegraphics[scale = 0.55]{type2fit.jpg}}%{system.file("extdata","type2fit.jpg",package="ChAMP")}} \caption{An example of the Type 2 fit with BMIQ. } \label{fig:type2} \end{figure} \begin{figure}[!ht]%check \centerline{\includegraphics[scale = 0.55]{checkBMIQ.jpg}}%{system.file("extdata","checkBMIQ.jpg",package="ChAMP")}} \caption{An overview of both type 1 and 2 fit achieved with BMIQ. } \label{fig:check} \end{figure} \subsubsection{Usage} <>= myNorm=champ.norm() @ \subsection{SVD for identification of components of variation (batch effects)} \noindent The singular value decomposition method (SVD) implemented by Teschendorff \citep{Teschendorff09} for 27k data is used to identify the most significant components of variation. These components of variation would ideally be biological factors of interest, but it may be technical variation (batch effects). To get the most from this analysis it is useful to include as much information as possible. If samples have been loaded from .idat files then the 18 internal controls on the bead chip (including bisulphite conversion efficiency) are included as well as any study details defined in the sample sheet (well position, sentrix id, batch etc) (Figure \ref{fig:sampleSheetexample}). Additional phenotype/study information (age, gender, batch) can be included in a separate file (set studyInfo=TRUE) (Figure \ref{fig:studyInfo}). This must be a .txt file saved as studyInfo.txt and the first column must be labelled ''Sample\_Name'' with the sample names as used in the sample sheet. These new covariates will be considered categorical by default. If any are continuous it is necessary to include a vector in the parameter infoFactor=c() that defines them. TRUE = categorical, FALSE=continuous. These settings will print to the screen after the SVD analysis has been run to confirm they were correctly assigned. \noindent It is important to realise SVD \emph{does not} manipulate the data, but analyses it to compute p-values based on the significance of each component. The result is a heatmap (saved as SVDsummary.pdf) of the top 6 principle components correlated to the information provided (Figure \ref{fig:SVD}). The darker colours represent a lower p-value indicating a larger component of variation. If it becomes clear from this SVD analysis that the largest components of variation are technical factors (batch effects) then it is worth considering the experimental design and implementing other normalization methods that may help remove technical variation. ComBat is included in this pipeline to remove variation related to chip number but it or other methods may be implemented independently to remove other sources of technical variation revealed in the SVD analysis. \begin{figure}[!ht]%study info \centerline{\includegraphics[scale=0.9]{studyInfo.jpg}}%{system.file("extdata","studyInfo.jpg",package="ChAMP")}} \caption{An example of the study info file. This file must include the first column "Sample\_Name" and be identical to the first column of the sample sheet (Figure \ref{fig:sampleSheetexample}). It should be saved as studyInfo.txt and can include any addition study information including gender, age, batch, cell type, patientID etc.} \label{fig:studyInfo} \end{figure} \begin{figure}[!ht]%SVD \centerline{\includegraphics[scale = 0.55]{SVDsummary.pdf}}%{system.file("extdata","SVDsummary.pdf",package="ChAMP")}} \caption{An example of the SVD output. This heatmap shows all the components of variation that have been included in the analysis. The first 18 are internal controls on the bead chip representing potential technical variation, The remaining covariates are obtained from data in the sample sheet (Figure \ref{fig:sampleSheetexample}) and the study info file (Figure \ref{fig:studyInfo}).} \label{fig:SVD} \end{figure} \subsubsection{Usage} <>= champ.SVD() @ \subsection{Correction for batch effects related to bead chip using ComBat} \noindent This function implements the ComBat normalization method \citep{Johnson07} that was developed for microarrays. The \Rpackage{sva} R package is used to implement this function. ComBat is specifically defined in the \Rpackage{ChAMP} package to correct for batch effects related to the slide (Sentrix\_ID). More advanced users can implement ComBat using \Rpackage{sva} documentation to adjust for other batch effects. \subsubsection{Procedure} \noindent ComBat uses an empirical Bayes method to correct for technical variation. If ComBat were applied directly to beta values zeros could be introduced as beta values have a range between 0 and 1. For this reason \Rpackage{ChAMP} logit transforms beta values before ComBat adjustment and then computes the reverse logit transformation following the ComBat adjustment. If the user has chosen to use M-values no logit transformation will be done. If the dataset only includes data from one slide the adjustment will abort. As the CNA method utilises intensity values rather than beta or M -values the ComBat adjustment will be repeated for that function. The ComBat function can be a time consuming step in the pipeline if you have a large number of samples. \subsubsection{Usage} <>= batchNorm=champ.runCombat() @ \subsection{Calling MVPs - Methylation Variable Positions} \noindent This function implements the \Rpackage{limma} package to calculate the p-value for differential methylation using a linear model. At present, the function can only compute differential methylation between two groups. The two groups are determined by the Sample\_Group column in the sample sheet. If more than two groups are present the function will calculate the differential methylation between the first two unique groups in the file. As the function is running it will print which two groups the contrast is being computed for and will then printout how many MVPs were found for the given p-value. All probes are saved in a file ''MVP\_ALL.txt'' which can be filtered for a p-value of interest. The filename also includes the two groups compared and the method used to adjust the p-values (for example "MVP\_ALL\_CvsT\_BHadjust.txt"). The file includes the annotation for each probe, the average beta (for the sample group) and the delta beta for the two groups used in the comparison. In addition, a bedfile is saved with only the significant MVPs. This may be useful for downstream pathway analysis. It is planned that a future version of \Rpackage{ChAMP} will include the ability to compare multiple groups. However, a more advanced user can run \Rpackage{limma} with other settings and use the output (including all probes and their p-values) for the DMR hunter. %bedfile \subsubsection{Usage} <>= limma=champ.MVP() head(limma) @ \subsection{DMR Hunter - Bumphunter and ProbeLasso} \noindent This function computes and returns a data frame of probes binned into discreet differentially methylated regions (DMRs), with accompanying p-valuei. There are two algotithms integrated to do this job, Bumphunter and ProbeLasso. For Bumphunter algotithms, it would firstly cluster all probes into small clusters, then apply random permulation method to estimate candidate DMRs. This method is very user friendly and is not relying on any output of previous functions. The permulation steps in Bumphunter algotithms may be a little bit slow, but user may assign more core to accelerate it by parallel more threads. The result of bumphunter algotithm is a dataframe contain all detected DMRs, with their length, clusters, number of CpGs.e.g annotated. Also, Bumphunter algotithm will return the annotation of all DMR contained CpGs.\\ \noindent For ProbeLasso method, the final data frame is distilled from a \Rpackage{limma} output of probes and their association statistics. It additionally writes three informative images as pdfs: 1) a box plot of probe spacing for the input data as a function of genomic feature/CGI relation (see Figure \ref{fig:features}); 2) a cumulative quantile plot illustrating how the user-specified parameters affect the sizes of all windows employed for DMR-calling (see Figure \ref{fig:dmr}) and; 3) a dot plot with size-scaled dots to further illustrate the resulting window sizes for each genomic feature that follows from the user-specified parameters (see Figure \ref{fig:radius}).\\ \noindent Differentially methylated regions (DMRs) are extended and discreet segments of the genome that show a quantitative alteration in DNA methylation levels between two groups. A number of different approaches to the identification of DMRs have been implemented, most notably ''windows'' in which differential methylation is sought over a stretch of DNA of pre-defined - and often fixed - size. This approach is utilitarian and effective but is confounded by the fluctuating CpG density throughout a genome; it is further limited when using microarrays, in that coverage is often incomplete and non-uniform. Consequently, DMR identification is likely to be biased towards regions of densely tiled probes.To compensate for this the ProbeLasso DMR Hunter is based on a feature-oriented dynamic window (''lasso''), which aims to capture neighbouring, significant probes and bundle them into DMRs.\\ \subsubsection{Procedure} \noindent For Bumphunter algotithm, it will replying on no previous output. Firstly, Bumphunter all cluster all probes into small clusters. The size and distances between CpGs could be specified by users. Then Bumphunter will filter all clusters contain too few CpGs in it, the default threshold is 7, but can be specified by user as well. After filtering, Bumphunter will select candidate DMRs based on probes' differential t value between two groups of samples, and their location in one cluster. Finally random permulation technic will return significance of each candidate DMR and p value will be returned. For more information, user may turn to \Rpackage{Bumphunter} package. \begin{figure}[!ht] \centerline{\includegraphics[scale=1]{DMRsample.pdf}}%{system.file("extdata","radius.jpg",package="ChAMP")}} \caption{This image is one example of DMR generated from Bumphunter algotithm via champ.DMR() function. This DMR is located on chromosom 7. Totally there are 49 CpGs contained in this cluster, but real DMR stars from index 18 to the end of this cluster, thus totally this is a DMR with 32 probes. As we can see the DMR probes are marked as blue circle. Case and control are separatly plotted as red and black. Two loess lines are drawed across two points. Obviously, the last part of the cluster is a DMR between case and control.} \label{fig:DMRsample} \end{figure} \noindent For ProbeLasso algotithm, because the ProbeLasso DMR Hunter takes into account probe spacing, the first step to identify DMRs is to calculate a dataset-specific record of probe spacing. Although it is tempting to calculate probe spacing based on all probes on the Infinium 450k or EPIC BeadChip, the Probe Lasso DMR Hunter requires input for a number of user-specified parameters (e.g., filtering of X-chromosome probes and putatively polymorphic probes) that can truncate the dataset; additionally, probe failures are inevitable and will differ between experiments. As such, each dataset is bestowed a unique set of probes, which underscores the need for an experiment-specific catalogue of probe-spacing. Once the dataset has been defined, the nearest neighbouring probe is calculated for all available probes; this data is then partitioned into one of 28 different categories comprising information on the genomic feature (1st exon, 3$\prime$UTR, 5$\prime$UTR, body, intergenic region, TSS200, or TSS1500) and that features relation (if any) to a nearby CpG island (island, shore, shelf, or not associated). \noindent Figure \ref{fig:features} illustrates the wildly variable probe spacing as a function of genomic feature on the Infinium 450k BeadChip. Data were derived from an experiment in which the X-chromosome was omitted, polymorphic probes were included, and >98\% of probes were detected across all samples. \begin{figure}[!ht] \centerline{\includegraphics[scale=0.25]{probeFeatures.jpg}}%{system.file("extdata","probeFeatures.jpg",package="ChAMP")}} \caption{A study specific plot showing the number of probes found in each feature. Note the different spacing on the 450k or EPIC array between neighbouring probes for different features (1st exon, 3$\prime$UTR, 5$\prime$UTR, gene body, intergenic region, transcription start sites) and their relation to the nearest CpG island (in the island, in a shore, in a shelf, not associated). Intergenic regions, 3$\prime$UTRs and gene bodies with no CGI relation are very sparsely spaced; whereas probes in the 1st exon and TSSs are very closely spaced.} \label{fig:features} \end{figure} \noindent The user then specifies a maximum (or minimum) lasso size (bp) and the Probe Lasso DMR Hunter determines which feature/CGI category fulfils this criterion first and what quantile of the relevant feature/CGI category it corresponds to. This quantile is applied to all feature/CGI categories to define feature/CGI category-specific lasso sizes (see Figures \ref{fig:dmr} and \ref{fig:radius}).*Note a minimum lasso size would rarely be appropriate. \begin{figure}[!ht] \centerline{\includegraphics[scale=0.25]{DMR.jpg}}%{system.file("extdata","DMR.jpg",package="ChAMP")}} \caption{Defining the lasso for each feature type. This illustrates when someone has specified a minimum lasso of 10bp and how this sets the quantile for all feature/CGI relation-specific lassos to \~48\%. These lassos are then cast out (centred on each probe) to capture a minimum number of significant probes. Probes that fulfil this are set aside.} \label{fig:dmr} \end{figure} \begin{figure}[!ht] \centerline{\includegraphics[scale=0.25]{radius.jpg}}%{system.file("extdata","radius.jpg",package="ChAMP")}} \caption{This image shows the radius size for each feature.} \label{fig:radius} \end{figure} \noindent Next, operating solely on the significant probes in the dataset, an appropriately-sized lasso is thrown around each probe; if the lasso captures a user-specified number of significant probes (including itself), that probe (and the probes captured by its lasso) is set aside, in essence, a ''mini-DMR''. Next, ProbeLasso DMR Hunter attempts to close-up gaps between neighbouring mini-DMRs whose lasso boundaries are either overlapping or less than a user-specified distance apart (e.g., 1000bp), effectively defining a ''final DMR''. The coordinates for each DMR are calculated by throwing an appropriately-sized lasso around each probe in the DMR and taking the minimum and maximum genomic coordinates. \noindent Finally, ProbeLasso DMR Hunter pulls back all probes within the DMR coordinates and returns them as data frame containing genomic annotation and association statistics of each MVP. Additionally, a p-value for each DMR is calculated using Stouffler's method. This method combines the p-values of individual probes within a DMR by weighting them according to the underlying correlation structure of methylation scores between probes; p-values of probes whose methylation scores are correlated are down-weighted, while independent probes are not penalised. The concept of the probe lasso is visualized as a cartoon in Figure \ref{fig:lasso}. \begin{figure}[!ht] \centerline{\includegraphics[scale=0.5]{lasso.jpg}}%{system.file("extdata","lasso.jpg",package="ChAMP")}} \caption{The details on calling a DMR with ProbeLasso Method. This shows the application of feature/CGI relation-specific lassos to the dataset. In the above example, six probes are 'successful' in capturing the minimum number of probes (e.g., 3) in their lasso; however, because there is less than (the user-specified) 1kb separating the first and last 'successful' probes, the DMR boundaries are defined between these (+/- their lassos!). Note that 'unsuccessful' probes are now included in the DMR. } \label{fig:lasso} \end{figure} \subsubsection{Output} %DMR output file \begin{figure}[!ht] \centerline{\includegraphics[scale=0.5]{DMRoutput.jpg}}%{system.file("extdata","DMRoutput.jpg",package="ChAMP")}} \caption{This is an example of the champ.DMR() output on ProbeLasso algotithm that is saved if the dataset has produced a DMR list. The columns show the probeID, adjusted p-value, chromosome, map info, chromosome arm, feature relation, SNP allele frequency for the forward strand for the selected population (pol.af.f), SNP allele frequency for the reverse strand for the selected population (pol.af.r), the distance in base pairs to the next nearest significant probe (nrst.probe), the size in base pairs of the lasso used to capture this DMR (lasso.radii), the DMR number (dmr.no), the DMR start location (dmr.start), the DMR end location (dmr.end), the DMR size (dmr.size) and the p-value of significance for the DMR (dmr.p)} \label{fig:DMRoutput} \end{figure} %bedfile %DMR output file \begin{figure}[!ht] \centerline{\includegraphics[scale=0.3]{Bumphunteroutput.png}}%{system.file("extdata","DMRoutput.jpg",package="ChAMP")}} \caption{This is an example of the champ.DMR() output on Bumphunter algotithm that is saved if the dataset has produced a DMR list. The columns show the DMR located chromosome, start site and end site, average beta value, cluster index, start CpG index, end CpG index, number of CpG Probes in each DMR, total number of probes in each cluster, and p value calculated for each DMR.} \label{fig:Bumphunteroutput} \end{figure} %bedfile \subsubsection{Usage} <>= lasso <- champ.DMR(resultFiles=limma,method="ProbeLasso",arraytype="450K") bump <- champ.DMR(method="Bumphunter",arraytype="450K") if(!is.null(lasso)) { head(lasso) } @ \subsection{Copy Number Alterations} \noindent This function uses the HumanMethylation450 data to identify copy number alterations \citep{Feber14}. The function utilises the intensity values for each probe to count copy number and determine if copy number alterations are present. Copy number is determined using the \Rpackage{CopyNumber} package. \subsubsection{Procedure} \noindent Intensity values are quantile normalized and by default the ComBat normalization is run for correction of batch effects related to the slide before the copy number is calculated. \noindent Feber \citep{Feber14} compared the results obtained using this method to copy number data from Illumina CytoSNP array and Affymetrix SNP 6.0 arrays and found that using this method for 450k data was effective in identifying regions of gain and loss. \begin{figure}[!ht] \centerline{\includegraphics[scale = 0.55]{CNAtext.jpg}}%{system.file("extdata","CNAtext.jpg",package="ChAMP")}} \caption{This is an example of the champ.CNA() output that is saved for each sample. The columns show the sample name, chromosome, segment start, segment end, the number of probes in the segment and the segment mean. Typically 0.3 is used as a cut-off for the segment mean to call gains and losses.} \label{fig:cnatext} \end{figure} \begin{figure}[!ht] \centerline{\includegraphics[scale = 0.55]{CNAimage.jpg}}%{system.file("extdata","CNAimage.jpg",package="ChAMP")}} \caption{This image shows all the data for a single sample. Chromosomes are shown alternating green and black.} \label{fig:cna} \end{figure} \subsubsection{Usage} <>= CNA=champ.CNA() @ \subsection{Cell type heterogeneity correction} \noindent While many methylation sample are collected from whole blood, their methylation status actually are combination of various cell types. These cell-type specified methylation may cover ture epigenome signals as well as true relationships between CpGs sites and phenotypes, or themsleves are the reason of some diseases. Thus many methods have be invented to deal with cell heterogeneity problem, like RefbaseEWAS and RefFreeEWAS incoporated in ChAMP.\\ \noindent RefbaseEWAS method is a method similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples.\\ \noindent We integrates another cell type heterogeneity correction method, RefFreeEWAS in \Rpackage{ChAMP}. The new RefFreeEWAS method is similar to non-negative matrix factorization, with additional constraints and additional utilities. This function is specifically suitable for tissue datasets such as placenta, saliva, adipose or tumor tissue. This method closely related to surrogate variable analysis, relies on a simple projection based on singular value decomposition (SVD), as does SVA. In SVA, the residuals of a linear model are decomposed into a factor-analytic structure and the factors are used subsequently in a regression model, with iteration resulting in a final set of surrogate variables.\\ \subsubsection{Procedure} \noindent Here we prepared two cell-type purified methylation reference, one for 27K and the other for 450K, user may use any of them to do cell heterogeneity correction even for EPIC bead array, because the first step of the function is to extract common CpG sites between reference dataset and inputted beta valued dataset. Then by applying quadratic programming, cell propotions for each sample could be detected out. Then, linear regression will be applied to correct heterogeneity based on cell propotions of each sample.\\ \noindent However, though RefbaseEWAS method would return fairly accurate cell propotion, this method can ONLY be applied for whole blood samples, which contain cells in the reference. After champ.refbase() function, cell type heterogeneity corrected beta matrix, and cell-type specific propotions in each sample will be returned.\\ \noindent On the other hand, RefFreeEWAS method requires two parameters in champ.reffree() function, one is the covariates matrix for inputted dataset, which could be a matrix or a vector. containing various phenotypes of each sample. The other important parameter is number of latent variable K, which represent numbers of latent cell type mixed in the dataset. If users don't know the correct number of latent variable (cell types), they could ignore this parameter, and champ.reffree() will apply Random Matrix Theory from isva package to estimate number of latent variables.\\ \subsubsection{Usage} <>= refbase <- champ.refbase() reffree <- champ.reffree() @ \section{Further analysis} \noindent Additional options for downstream analysis are available using other packages. It is recommended that DMRs be investigated using pathway analysis software and Encode data. In addition data from previously published 450k datasets can be downloaded using \Rpackage{marmal-aid} and combined with study data for meta-analysis.\\ \bibliographystyle{natbib} \bibliography{450k} \end{document}