--- title: "derfinder advanced details and usage" author: "L Collado-Torres" date: "`r doc_date()`" package: "`r pkg_ver('derfinder')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{derfinder advanced details and usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `r Biocpkg('derfinder')` advanced details and usage ====================================== ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library('knitcitations') ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html') # Note links won't show for now due to the following issue # https://github.com/cboettig/knitcitations/issues/63 ## Write bibliography information bibs <- c(knitcitations = citation('knitcitations'), derfinder = citation('derfinder')[1], BiocStyle = citation('BiocStyle'), knitr = citation('knitr')[3], rmarkdown = citation('rmarkdown'), brainspan = RefManageR::BibEntry(bibtype = 'Unpublished', key = 'brainspan', title = 'Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.', author = 'BrainSpan', year = 2011, url = 'http://developinghumanbrain.org'), originalder = citation('derfinder')[2], R = citation(), IRanges = citation('IRanges'), devtools = citation('devtools'), testthat = citation('testthat'), GenomeInfoDb = citation('GenomeInfoDb'), GenomicRanges = citation('GenomicRanges'), ggplot2 = citation('ggplot2'), biovizBase = citation('biovizBase'), bumphunter = citation('bumphunter')[1], TxDb.Hsapiens.UCSC.hg19.knownGene = citation('TxDb.Hsapiens.UCSC.hg19.knownGene'), AnnotationDbi = citation('AnnotationDbi'), BiocParallel = citation('BiocParallel'), derfinderHelper = citation('derfinderHelper')[1], GenomicAlignments = citation('GenomicAlignments'), GenomicFeatures = citation('GenomicFeatures'), GenomicFiles = citation('GenomicFiles'), Hmisc = citation('Hmisc'), qvalue = citation('qvalue'), Rsamtools = citation('Rsamtools'), rtracklayer = citation('rtracklayer'), S4Vectors = citation('S4Vectors'), bumphunterPaper = RefManageR::BibEntry(bibtype = 'article', key = 'bumphunterPaper', title = 'Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies', author = 'Jaffe, Andrew E and Murakami, Peter and Lee, Hwajin and Leek, Jeffrey T and Fallin, M Daniele and Feinberg, Andrew P and Irizarry, Rafael A', year = 2012, journal = 'International Journal of Epidemiology') ) write.bibtex(bibs, file = 'derfinderAdvRef.bib') bib <- read.bibtex('derfinderAdvRef.bib') ## Assign short names names(bib) <- names(bibs) ``` # Overview This vignette explains in more detail the relationship between the different functions in `r Biocpkg('derfinder')` `r citep(bib[['derfinder']])` as well as add-on packages `r Biocpkg('derfinderPlot')` ([vignette](http://lcolladotor.github.io/derfinderPlot/)), `r Biocpkg('derfinderHelper')` ([vignette](http://lcolladotor.github.io/derfinderHelper/)) and `r Biocpkg('regionReport')` ([vignette](http://lcolladotor.github.io/regionReport/)). The vignette also includes some bash scripts for running `r Biocpkg('derfinder')` in a cluster, although there are probably other ways to do so using only R scripts. For example, by using `BatchJobsParam()` from `r Biocpkg('BiocParallel')`. This vignette assumes that you have read through the introductory vignette. Lets start by loading the `r Biocpkg('derfinder')` package. ```{r 'start', message=FALSE} ## Load libraries library('derfinder') ``` If you explore the package, you will notice that all exported functions use camel case name and have aliases with underscore names. For example, `analyzeChr()` and `analyze_chr()`. You should use whichever naming style you prefer. Regrettably, implementing the same flexibility at the argument naming level is not straightforward. # Advanced arguments Just like it is shown in the introductory vignette, you can find more information about the advanced arguments using `advancedArg()`. ```{r 'advancedArg'} ## URLs to advanced arguemtns sapply(c('analyzeChr', 'loadCoverage'), advancedArg, browse = FALSE) ## Set browse = TRUE if you want to open them in your browser ``` These arguments are options we expect not all users will want to change and which would otherwise make the help pages confusing to them. However, all arguments are documented in [roxygen2](http://cran.r-project.org/web/packages/roxygen2/index.html) style in the source code. Furthermore, note that using the `...` argument allows you to specify some of the documented arguments. For example, you might want to control the `maxClusterGap` from `findRegions()` in the `analyzeChr()` call. # Non-human data If you are working with data from an organism that is not _Homo sapiens_, then set the global options defining the `species` and the `chrsStyle` used. For example, if you are working with _Arabidopsis Thaliana_ and the _NCBI_ naming style, then set the options using the following code: ```{r 'exampleNameStyle', eval = FALSE} ## Set global species and chrsStyle options options(species = 'arabidopsis_thaliana') options(chrsStyle = 'NCBI') ## Then proceed to load and analyze the data ``` Internally `r Biocpkg('derfinder')` uses `extendedMapSeqlevels()` to use the appropriate chromosome naming style given a species in all functions involving chromosome names. Further note that the argument `subject` from `analyzeChr()` is passed to `bumphunter::annotateNearest(subject)`. So if you are using a genome different from __hg19__ remember to provide the appropriate annotation data or simply use `analyzeChr(runAnnotation = FALSE)`. You might find the discussion [Using bumphunter with non-human genomes](https://support.bioconductor.org/p/62781/) useful. # Advanced loading data ## Controlling loading from BAM files If you are loading data from BAM files, you might want to specify some criteria to decide which reads to include or not. For example, your data might have been generated by a strand-specific protocol. You can do so by specifying the arguments of `scanBamFlag()` from `r Biocpkg('Rsamtools')`. You can also control whether to include or exclude bases with `CIGAR` string `D` (deletion from the reference) by setting the advanced argument `drop.D = TRUE` in your `fullCoverage()` or `loadCoverage()` call. ## Unfiltered base-level coverage Note that in most scenarios, the `fullCov` object illustrated in the introductory vignette can be large in memory. When making plots or calculating the region level coverage, we don't need the full information. In such situations, it might pay off to create a smaller version by loading only the required data. This can be achieved using the advanced argument `which` to `fullCoverage()` or `loadCoverage()`. However, it is important to consider that when reading the data from BAM files, a read might align partially inside the region of interest. By default such a read would be discarded and thus the base-level coverage would be lower than what it is in reality. The advanced argument `protectWhich` extends regions by 30 kbp (15 kbp each side) to help mitigate this issue. We can illustrate this issue with the example data from `r Biocpkg('derfinder')`. First, we load in the data and generate some regions of interest. ```{r 'runExample', bootstrap.show.output=FALSE} ## Find some regions to work with example('loadCoverage', 'derfinder') example('getRegionCoverage', 'derfinder') ``` Next, we load the coverage again using `which` but without any padding. We can see how the coverage is not the same by looking at the maximum coverage for each sample. ```{r 'loadWhich'} ## Illustrate reading data from a set of regions test <- loadCoverage(files = files, chr = '21', cutoff = NULL, which = regions, protectWhich = 0, fileStyle = 'NCBI') ## Some reads were ignored and thus the coverage is lower as can be seen below: sapply(test$coverage, max) - sapply(genomeDataRaw$coverage, max) ``` When we re-load the data using some padding to the regions, we find that the coverage matches at all the bases. ```{r 'loadWhich2'} ## Illustrate reading data from a set of regions test2 <- loadCoverage(files = files, chr = '21', cutoff = NULL, which = regions, protectWhich = 3e4, fileStyle = 'NCBI') ## Adding some padding to the regions helps get the same coverage identical(sapply(test2$coverage, max), sapply(genomeDataRaw$coverage, max)) ## A more detailed test reveals that the coverage matches at every base all(mapply(function(x, y) { identical(x, y) }, test2$coverage, genomeDataRaw$coverage)) ``` How much padding you need to use will depend on your specific data set, and you might be comfortable getting approximately the same coverage values for the sake of greatly reducing the memory resources needed. ## Input files in a different naming style If you are under the case where you like to use a specific chromosome naming style but the raw data files use another one, you might need to use the `fileStyle` argument. For example, you could be working with _Homo sapiens_ data and your preferred naming style is _UCSC_ (chr1, chr2, ..., chrX, chrY) but the raw data uses _NCBI_ style names (1, 2, ..., X, Y). In that case, use `fullCoverage(fileStyle = 'NCBI')` or `loadCoverage(fileStyle = 'NCBI')` depending if you are loading one chromosome or multiple at a time. ## Loading data in chunks If you prefer to do so, `fullCoverage()` and `loadCoverage()` can load the data of a chromosome in chunks using [GenomicFiles](http://www.bioconductor.org/packages/release/bioc/html/GenomicFiles.html). This is controlled by whether you specify the `tilewidth` argument. Notice that you might run into slight coverage errors near the borders of the tiles for the same reason that was illustrated previously when loading specific regions. This approach is not necessarily more efficient and can be significantly time consuming if you use a small `tilewidth`. # Flow charts ## DER analysis flow chart The following figure illustrates how most of `r Biocpkg('derfinder')`'s functions interact when performing a base-level differential expression analysis by calculating base-level F-statistics.
DER flow
Flow chart of the different processing steps (black boxes) that can be carried out using `r Biocpkg('derfinder')` and the functions that perform these actions (in red). Input and output is shown in green boxes. Functions in blue are those applied to the results from multiple chromosomes (`mergeResults()` and `derfinderReport`). `r Biocpkg('regionReport')` functions are shown in orange while `r Biocpkg('derfinderPlot')` functions are shown in dark purple. Purple dotted arrow marks functions that require unfiltered base-level coverage. ## `analyzeChr()` flow chart
analyzeChr flow
This figure shows in more detail the processing flow in `analyzeChr()`, which is the main function for identifying candidate differentially expressed regions (DERs) from the base-level F-statistics. Many fine-tunning arguments can be passed to `analyzeChr()` to feed into the other functions. For example, you might want to use a smaller `chunksize` when pre-processing the coverage data (the default is 5 million): specially if you have hundreds of samples. Another useful argument is `scalefac` (by default it's 32) which controls the scaling factor to use before the log2 transformation. Furthermore, you might want to specify `maxClusterGap` to control the maximum gap between two regions before they are considered to be part of the same cluster. ## `regionMatrix()` flow chart
regionMatrix flow
The above figure shows the functions used internally by `regionMatrix()` and processing steps. Overall, it is much simpler than `analyzeChr()`. # Functions that use multiple cores Currently, the following functions can use multiple cores, several of which are called inside `analyzeChr()`. * `calculatePvalues()`: 1 core per chunk of data to process. * `calculateStats()`: 1 core per chunk of data to process. * `coerceGR()`: 1 core per chromosome. This function is used by `createBw()`. * `coverageToExon()`: 1 core per strand, then 1 core per chromosome. * `loadCoverage()`: up to 1 core per tile when loading the data with [GenomicFiles](http://www.bioconductor.org/packages/release/bioc/html/GenomicFiles.html). Otherwise, no parallelization is used. * `fullCoverage()`: 1 core per chromosome. In general, try to avoid using more than 10 cores as you might reach your maximum network speed and/or hard disk input/output seed. For the case described in `loadCoverage()`, you can specify how many cores to use per chromosome for the tiles using the `mc.cores.load` argument effectively resulting in `mc.cores` times `mc.cores.load` used (otherwise it's `mc.cores` squared). * `getRegionCoverage()`: 1 core per chromosome. * `regionMatrix()`: 1 core per chromosome. All parallel operations use `SnowParam()` from `r Biocpkg('BiocParallel')` when more than 1 core is being used. Otherwise, `SerialParam()` is used. Note that if you prefer to specify other types of parallelization you can do so by specifying the `BPPARAM.custom` advanced argument. Because `SnowParam()` requires `R` to load the necessary packages on each worker, the key function `fstats.apply()` was isolated in the `r Biocpkg('derfinderHelper')` package. This package has much faster loading speeds than `r Biocpkg('derfinder')` which greatly impacts performance on cases where the actual step of calculating the F-statistics is fast. You may prefer to use `MulticoreParam()` described in the `r Biocpkg('BiocParallel')` vignette. In that case, when using these functions use `BPPARAM.custom = MulticoreParam(workers = x)` where `x` is the number of cores you want to use. Note that in some systems, as is the case of the cluster used by `r Biocpkg('derfinder')`'s developers, the system tools for assessing memory usage can be misleading, thus resulting in much higher memory loads when using `MulticoreParam()` instead of the default `SnowParam()`. # Project organization For each project, we recommend the following organization. ``` ProjectDir |-derCoverageInfo |-derAnalysis |---analysis01 |---analysis02 ``` That is, a main project directory with two initial directories. One for storing the coverage data, and one for storing each analysis: you might explore different models, cutoffs or other parameters. You can then use `fullCoverage()`, save the result and also save the filtered coverage information for each chromosome separately. Doing so will result in the following structure. ``` ProjectDir |-derCoverageInfo |---Chr1Cov.Rdata |---Chr2Cov.Rdata ... |---ChrYCov.Rdata |---fullCov.Rdata |-derAnalysis |---analysis01 |---analysis02 ``` Next, you can use `analyzeChr()` for each of the chromosomes of a specific analysis (say _analysis01_). Doing so will create several Rdata files per chromosome as shown below. `bash` scripts can be useful if you wish to submit one cluster job per chromosome. In general, you will use the same model and group information for each chromosome, so saving the information can be useful. ``` ProjectDir |-derCoverageInfo |---Chr1Cov.Rdata |---Chr2Cov.Rdata ... |---ChrYCov.Rdata |---fullCov.Rdata |-derAnalysis |---analysis01 |-----models.Rdata |-----groupInfo.Rdata |-----chr1/ |-------chunksDir/ |-------logs/ |-------annotation.Rdata |-------coveragePrep.Rdata |-------fstats.Rdata |-------optionsStats.Rdata |-------regions.Rdata |-------timeinfo.Rdata |-----chr2/ ... |-----chrY/ |---analysis02 ``` Then use `mergeResults()` to pool together the results from all the chromosomes for a given analysis (here _analysis01_). ``` ProjectDir |-derCoverageInfo |---Chr1Cov.Rdata |---Chr2Cov.Rdata ... |---ChrYCov.Rdata |---fullCov.Rdata |-derAnalysis |---analysis01 |-----logs/ |-----fullAnnotatedRegions.Rdata |-----fullFstats.Rdata |-----fullNullSummary.Rdata |-----fullRegions.Rdata |-----fullTime.Rdata |-----optionsMerge.Rdata |-----chr1/ |-------chunksDir/ |-------logs/ |-------annotation.Rdata |-------coveragePrep.Rdata |-------fstats.Rdata |-------optionsStats.Rdata |-------regions.Rdata |-------timeinfo.Rdata |-----chr2/ ... |-----chrY/ |---analysis02 ``` Finally, you might want to use `derfinderReport()` from `r Biocpkg('regionReport')` to create a HTML report of the results. # bash scripts For interacting between `bash` and `R` we have found quite useful the [getopt](http://cran.r-project.org/web/packages/getopt/index.html) package. Here we include an example `R` script that is controlled by a `bash` script which submits a job for each chromosome to analyze for a given analysis. The two files, _derfinderAnalysis.R_ and _derAnalysis.sh_ should live under the _derAnalysis_ directory. ``` ProjectDir |-derCoverageInfo |---Chr1Cov.Rdata |---Chr2Cov.Rdata ... |---ChrYCov.Rdata |---fullCov.Rdata |-derAnalysis |---derfinder-Analysis.R |---derAnalysis.sh |---analysis01 |---analysis02 ``` Then, you can simply use: ```bash cd /ProjectDir/derAnalysis sh derAnalysis.sh analysis01 ``` to run `analyzeChr()` on all your chromosomes. ## derfinder-analysis.R ```{r 'derfinder-analysis', eval = FALSE} ## Run derfinder's analysis steps with timing info ## Load libraries library("getopt") ## Available at https://github.com/lcolladotor/derfinder library("derfinder") ## Specify parameters spec <- matrix(c( 'DFfile', 'd', 1, "character", "path to the .Rdata file with the results from loadCoverage()", 'chr', 'c', 1, "character", "Chromosome under analysis. Use X instead of chrX.", 'mcores', 'm', 1, "integer", "Number of cores", 'verbose' , 'v', 2, "logical", "Print status updates", 'help' , 'h', 0, "logical", "Display help" ), byrow=TRUE, ncol=5) opt <- getopt(spec) ## Testing the script test <- FALSE if(test) { ## Speficy it using an interactive R session and testing test <- TRUE } ## Test values if(test){ opt <- NULL opt$DFfile <- "/ProjectDir/derCoverageInfo/chr21Cov.Rdata" opt$chr <- "21" opt$mcores <- 1 opt$verbose <- NULL } ## if help was asked for print a friendly message ## and exit with a non-zero error code if (!is.null(opt$help)) { cat(getopt(spec, usage=TRUE)) q(status=1) } ## Default value for verbose = TRUE if (is.null(opt$verbose)) opt$verbose <- TRUE if(opt$verbose) message("Loading Rdata file with the output from loadCoverage()") load(opt$DFfile) ## Make it easy to use the name later. Here I'm assuming the names were generated using output='auto' in loadCoverage() eval(parse(text=paste0("data <- ", "chr", opt$chr, "CovInfo"))) eval(parse(text=paste0("rm(chr", opt$chr, "CovInfo)"))) ## Just for testing purposes if(test) { tmp <- data tmp$coverage <- tmp$coverage[1:1e6, ] library("IRanges") tmp$position[which(tmp$pos)[1e6 + 1]:length(tmp$pos)] <- FALSE data <- tmp } ## Load the models load("models.Rdata") ## Load group information load("groupInfo.Rdata") ## Run the analysis with lowMemDir analyzeChr(chr=opt$chr, coverageInfo=data, models=models, cutoffFstat=1e-06, cutoffType="theoretical", nPermute=1000, seeds=seq_len(1000), maxClusterGap=3000, groupInfo=groupInfo, subject="hg19", mc.cores=opt$mcores, lowMemDir=file.path(tempdir(), paste0("chr", opt$chr) , "chunksDir")), verbose=opt$verbose, chunksize=1e5) ## Done if(opt$verbose) { print(proc.time()) print(sessionInfo(), locale=FALSE) } ``` Remember to modify the the script to fit your project. ## derAnalysis.sh ```bash #!/bin/sh ## Usage # sh derAnalysis.sh analysis01 # Directories MAINDIR=/ProjectDir WDIR=${MAINDIR}/derAnalysis DATADIR=${MAINDIR}/derCoverageInfo # Define variables SHORT='derA-01' PREFIX=$1 # Construct shell files for chrnum in 22 21 Y 20 19 18 17 16 15 14 13 12 11 10 9 8 X 7 6 5 4 3 2 1 do echo "Creating script for chromosome ${chrnum}" if [[ ${chrnum} == "Y" ]] then CORES=2 else CORES=6 fi chr="chr${chrnum}" outdir="${PREFIX}/${chr}" sname="${SHORT}.${PREFIX}.${chr}" cat > ${WDIR}/.${sname}.sh <