--- title: "derfinder advanced details and usage" author: "L Collado-Torres" date: "`r doc_date()`" package: "`r pkg_ver('derfinder')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{derfinder advanced details and usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `r Biocpkg('derfinder')` advanced details and usage ====================================== ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library('knitcitations') ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html') # Note links won't show for now due to the following issue # https://github.com/cboettig/knitcitations/issues/63 ## Write bibliography information bibs <- c(knitcitations = citation('knitcitations'), derfinder = citation('derfinder')[1], BiocStyle = citation('BiocStyle'), knitr = citation('knitr')[3], rmarkdown = citation('rmarkdown'), brainspan = RefManageR::BibEntry(bibtype = 'Unpublished', key = 'brainspan', title = 'Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.', author = 'BrainSpan', year = 2011, url = 'http://developinghumanbrain.org'), originalder = citation('derfinder')[2], R = citation(), IRanges = citation('IRanges'), devtools = citation('devtools'), testthat = citation('testthat'), GenomeInfoDb = citation('GenomeInfoDb'), GenomicRanges = citation('GenomicRanges'), ggplot2 = citation('ggplot2'), biovizBase = citation('biovizBase'), bumphunter = citation('bumphunter')[1], TxDb.Hsapiens.UCSC.hg19.knownGene = citation('TxDb.Hsapiens.UCSC.hg19.knownGene'), AnnotationDbi = citation('AnnotationDbi'), BiocParallel = citation('BiocParallel'), derfinderHelper = citation('derfinderHelper')[1], GenomicAlignments = citation('GenomicAlignments'), GenomicFeatures = citation('GenomicFeatures'), GenomicFiles = citation('GenomicFiles'), Hmisc = citation('Hmisc'), qvalue = citation('qvalue'), Rsamtools = citation('Rsamtools'), rtracklayer = citation('rtracklayer'), S4Vectors = citation('S4Vectors'), bumphunterPaper = RefManageR::BibEntry(bibtype = 'article', key = 'bumphunterPaper', title = 'Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies', author = 'Jaffe, Andrew E and Murakami, Peter and Lee, Hwajin and Leek, Jeffrey T and Fallin, M Daniele and Feinberg, Andrew P and Irizarry, Rafael A', year = 2012, journal = 'International Journal of Epidemiology') ) write.bibtex(bibs, file = 'derfinderAdvRef.bib') bib <- read.bibtex('derfinderAdvRef.bib') ## Assign short names names(bib) <- names(bibs) ``` # Overview This vignette explains in more detail the relationship between the different functions in `r Biocpkg('derfinder')` `r citep(bib[['derfinder']])` as well as add-on packages `r Biocpkg('derfinderPlot')` ([vignette](http://lcolladotor.github.io/derfinderPlot/)), `r Biocpkg('derfinderHelper')` ([vignette](http://lcolladotor.github.io/derfinderHelper/)) and `r Biocpkg('regionReport')` ([vignette](http://lcolladotor.github.io/regionReport/)). The vignette also includes some bash scripts for running `r Biocpkg('derfinder')` in a cluster, although there are probably other ways to do so using only R scripts. For example, by using `BatchJobsParam()` from `r Biocpkg('BiocParallel')`. This vignette assumes that you have read through the introductory vignette. Lets start by loading the `r Biocpkg('derfinder')` package. ```{r 'start', message=FALSE} ## Load libraries library('derfinder') ``` If you explore the package, you will notice that all exported functions use camel case name and have aliases with underscore names. For example, `analyzeChr()` and `analyze_chr()`. You should use whichever naming style you prefer. Regrettably, implementing the same flexibility at the argument naming level is not straightforward. # Advanced arguments Just like it is shown in the introductory vignette, you can find more information about the advanced arguments using `advancedArg()`. ```{r 'advancedArg'} ## URLs to advanced arguemtns sapply(c('analyzeChr', 'loadCoverage'), advancedArg, browse = FALSE) ## Set browse = TRUE if you want to open them in your browser ``` These arguments are options we expect not all users will want to change and which would otherwise make the help pages confusing to them. However, all arguments are documented in [roxygen2](http://cran.r-project.org/web/packages/roxygen2/index.html) style in the source code. Furthermore, note that using the `...` argument allows you to specify some of the documented arguments. For example, you might want to control the `maxClusterGap` from `findRegions()` in the `analyzeChr()` call. # Non-human data If you are working with data from an organism that is not _Homo sapiens_, then set the global options defining the `species` and the `chrsStyle` used. For example, if you are working with _Arabidopsis Thaliana_ and the _NCBI_ naming style, then set the options using the following code: ```{r 'exampleNameStyle', eval = FALSE} ## Set global species and chrsStyle options options(species = 'arabidopsis_thaliana') options(chrsStyle = 'NCBI') ## Then proceed to load and analyze the data ``` Internally `r Biocpkg('derfinder')` uses `extendedMapSeqlevels()` to use the appropriate chromosome naming style given a species in all functions involving chromosome names. Further note that the argument `subject` from `analyzeChr()` is passed to `bumphunter::annotateNearest(subject)`. So if you are using a genome different from __hg19__ remember to provide the appropriate annotation data or simply use `analyzeChr(runAnnotation = FALSE)`. You might find the discussion [Using bumphunter with non-human genomes](https://support.bioconductor.org/p/62781/) useful. # Advanced loading data ## Controlling loading from BAM files If you are loading data from BAM files, you might want to specify some criteria to decide which reads to include or not. For example, your data might have been generated by a strand-specific protocol. You can do so by specifying the arguments of `scanBamFlag()` from `r Biocpkg('Rsamtools')`. You can also control whether to include or exclude bases with `CIGAR` string `D` (deletion from the reference) by setting the advanced argument `drop.D = TRUE` in your `fullCoverage()` or `loadCoverage()` call. ## Unfiltered base-level coverage Note that in most scenarios, the `fullCov` object illustrated in the introductory vignette can be large in memory. When making plots or calculating the region level coverage, we don't need the full information. In such situations, it might pay off to create a smaller version by loading only the required data. This can be achieved using the advanced argument `which` to `fullCoverage()` or `loadCoverage()`. However, it is important to consider that when reading the data from BAM files, a read might align partially inside the region of interest. By default such a read would be discarded and thus the base-level coverage would be lower than what it is in reality. The advanced argument `protectWhich` extends regions by 30 kbp (15 kbp each side) to help mitigate this issue. We can illustrate this issue with the example data from `r Biocpkg('derfinder')`. First, we load in the data and generate some regions of interest. ```{r 'runExample', bootstrap.show.output=FALSE} ## Find some regions to work with example('loadCoverage', 'derfinder') example('getRegionCoverage', 'derfinder') ``` Next, we load the coverage again using `which` but without any padding. We can see how the coverage is not the same by looking at the maximum coverage for each sample. ```{r 'loadWhich'} ## Illustrate reading data from a set of regions test <- loadCoverage(files = files, chr = '21', cutoff = NULL, which = regions, protectWhich = 0, fileStyle = 'NCBI') ## Some reads were ignored and thus the coverage is lower as can be seen below: sapply(test$coverage, max) - sapply(genomeDataRaw$coverage, max) ``` When we re-load the data using some padding to the regions, we find that the coverage matches at all the bases. ```{r 'loadWhich2'} ## Illustrate reading data from a set of regions test2 <- loadCoverage(files = files, chr = '21', cutoff = NULL, which = regions, protectWhich = 3e4, fileStyle = 'NCBI') ## Adding some padding to the regions helps get the same coverage identical(sapply(test2$coverage, max), sapply(genomeDataRaw$coverage, max)) ## A more detailed test reveals that the coverage matches at every base all(mapply(function(x, y) { identical(x, y) }, test2$coverage, genomeDataRaw$coverage)) ``` How much padding you need to use will depend on your specific data set, and you might be comfortable getting approximately the same coverage values for the sake of greatly reducing the memory resources needed. ## Input files in a different naming style If you are under the case where you like to use a specific chromosome naming style but the raw data files use another one, you might need to use the `fileStyle` argument. For example, you could be working with _Homo sapiens_ data and your preferred naming style is _UCSC_ (chr1, chr2, ..., chrX, chrY) but the raw data uses _NCBI_ style names (1, 2, ..., X, Y). In that case, use `fullCoverage(fileStyle = 'NCBI')` or `loadCoverage(fileStyle = 'NCBI')` depending if you are loading one chromosome or multiple at a time. ## Loading data in chunks If you prefer to do so, `fullCoverage()` and `loadCoverage()` can load the data of a chromosome in chunks using [GenomicFiles](http://www.bioconductor.org/packages/release/bioc/html/GenomicFiles.html). This is controlled by whether you specify the `tilewidth` argument. Notice that you might run into slight coverage errors near the borders of the tiles for the same reason that was illustrated previously when loading specific regions. This approach is not necessarily more efficient and can be significantly time consuming if you use a small `tilewidth`. # Flow charts ## DER analysis flow chart The following figure illustrates how most of `r Biocpkg('derfinder')`'s functions interact when performing a base-level differential expression analysis by calculating base-level F-statistics.