---
title: "An Introduction to MMAPPR2"
author:
- name: Kyle Johnsen
  affiliation: Brigham Young University, Provo, UT
package: MMAPPR2
output: 
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
abstract: |
  Instructions on mapping mutations from forward genetic screens using MMAPPR2.
vignette: |
  %\VignetteIndexEntry{An Introduction to MMAPPR2}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{bash, echo = FALSE}
# Cooking show code to account for cases when samtools is not installed
echo '#!/bin/bash' > /tmp/samtools
chmod 755 /tmp/samtools
```

```{r, echo = FALSE}
dataDir <- system.file('extdata', package = 'MMAPPR2data')
WTpileupFile <- file.path(dataDir, 'exwt.plp')
MTpileupFile <- file.path(dataDir, 'exmut.plp')
samtoolsScript <- file('/tmp/samtools', "a")
writeLines(c(
  'if [[ ${@:$#} == *"wt.bam"* ]];',
  'then',
  paste('cat', WTpileupFile),
  'else',
  paste('cat', MTpileupFile),
  'fi'
  ), samtoolsScript)
close(samtoolsScript)

origPath <- Sys.getenv('PATH')
Sys.setenv(PATH = paste(origPath, '/tmp', sep = ':')) 
```


# Getting Started

## Data
You'll need BAM file(s) for your wild-type pool, BAM file(s) for your mutant pool, and the
reference genome for your species in fasta format. We recommend that each pool contain at least 
20 individuals to ensure a good number of recombinations to measure.

## Installing Dependencies
MMAPPR2 depends on two system tools to function: Samtools and VEP. Both must be installed and in the PATH to be found by the appropriate functions. 

### Installing Samtools
Instructions to install samtools can be found at https://github.com/samtools/samtools and installation instructions are in the INSTALL file included with samtools. 

### Installing VEP
You'll need Ensembl VEP, which you can install like this, replacing `my_species` with
your species (e.g., `danio_rerio`):

```{r installVEP, eval=FALSE}
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl -a ac -s {my_species}
```

This installs the most recent VEP and allows you to create a cache for your desired species, which is what MMAPPR2 expects by default.
If you depart from the installation shown here, or if things don't go smoothly, see [Ensembl's instructions](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer)
and make sure any differences are accounted for in the
[`VEPFlags`](#configure-vepflags-object) object.

*Note:* If you have any trouble installing VEP, using
[their Docker image](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#docker)
may save you a lot of hassle.

*Note:* We have found that R sometimes has issues finding VEP, especially when perlbrew is used. If you encounter errors at the path to your perl installation to the .Rprofile file. For example:

```{r, eval = FALSE}
Sys.setenv(PATH=paste("/Path/to/Perlbrew", Sys.getenv("PATH"), sep=":"))
```

# Basic Use

## Setting Parameters

For our example, we will use just the golden gene from the GRCz11 zebrafish reference genome.

Here we also configure the VEPFlags object to use our example fasta and GTF files.
[See below for more info.](#configure-vepflags-object)

 **Make sure your reference genome is the same you'll use with VEP! This will be the most recent assembly available on Ensembl unless you customize. You should use the same genome in aligning your sequencing data as well.**

```{r param}
BiocParallel::register(BiocParallel::MulticoreParam())  ## see below for explanation of BiocParallel
library(MMAPPR2, quietly = TRUE)
library(MMAPPR2data, quietly = TRUE)
library(Rsamtools, quietly = TRUE)

# This is normally configured automatically:
vepFlags <- ensemblVEP::VEPFlags(flags = list(
    format = 'vcf',  # <-- this is necessary
    vcf = FALSE,  # <-- as well as this
    species = 'danio_rerio',
    database = FALSE,  # <-- these three arguments allow us to run VEP offline,
    fasta = goldenFasta(),  # <-â•¯|  which you probably won't need
    gff = goldenGFF(),  # <------â•¯
    filter_common = TRUE,
    coding_only = TRUE  # assuming RNA-seq data
))

param <- MmapprParam(refFasta = goldenFasta(),
                     wtFiles = exampleWTbam(),
                     mutFiles = exampleMutBam(),
                     species = 'danio_rerio',
                     vepFlags = vepFlags,  ## optional
                     outputFolder = tempOutputFolder())  ## optional
```

## Running MMAPPR2
With parameters set, running the pipeline should be as simple as the following:
```{r mmappr}
mmapprData <- mmappr(param)
```

The MMAPPR2 pipeline can also be run a step at a time:
```{r mmappr-steps}
md <- new('MmapprData', param = param) ## calculateDistance() takes a MmapprData object
postCalcDistMD <- calculateDistance(md)
postLoessMD <- loessFit(postCalcDistMD)
postPrePeakMD <- prePeak(postLoessMD)
postPeakRefMD <- peakRefinement(postPrePeakMD)
postCandidatesMD <- generateCandidates(postPeakRefMD)
outputMmapprData(postCandidatesMD)
```

If the pipeline fails midway, the `MmapprData` object is saved, which you can then load
and use for inspection and debugging:
```{r recover-md}
## Contents of output folder:
cat(paste(system2('ls', outputFolder(param(mmapprData)), stdout = TRUE)), sep = '\n')

mdFile <- file.path(outputFolder(param(mmapprData)), 'mmappr_data.RDS')
md <- readRDS(mdFile)
md
```

## Results
If everything goes well you should be able to track down your mutation using the `candidates` slot of your `MmapprData`
object or by looking at files in the output folder:
```{r results}
head(candidates(mmapprData)$`18`, n=2)

outputTsv <- file.path(outputFolder(param(mmapprData)), '18.tsv')
cat(paste(system2('head', outputTsv, stdout = TRUE)), sep = '\n')
```


# Advanced Configuration

## Configure VEPFlags Object
MMAPPR2 uses the `r BiocStyle::Biocpkg("ensemblVEP")` Bioconductor package to predict the effect of variants in the peak region. 
To customize this process, you'll need to configure a `VEPFlags` object. Look at [Ensembl's website for script options](http://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html). You can configure the `VEPFlags` object like this:
```{r vepFlags}
library(ensemblVEP, quietly = TRUE)
vepFlags <- VEPFlags(flags = list(
    ### DEFAULT SETTINGS
    format = 'vcf',  # <-- this is necessary
    vcf = FALSE,  # <-- as well as this
    species = 'danio_rerio',
    database = FALSE,
    cache = TRUE,
    filter_common = TRUE,
    coding_only = TRUE  # assuming RNA-seq data
    ### YOU MAY FIND THESE INTERESTING:
    # everything = TRUE  # enables many optional analyses, such as Polyphen and SIFT
    # per_gene = TRUE  # will output only the most severe variant per gene
    # pick = TRUE  # will output only one consequence per variant
))
```

## BiocParallel Configuration
MMAPPR2 simply uses the default `bpparam` registered. You can change this (for example, if `r BiocStyle::Biocpkg("BiocParallel")` isn't working correctly) with the `BiocParallel::register` command. For example:
```{r bpparam}
library(BiocParallel, quietly = TRUE)
register(SerialParam())
register(MulticoreParam(progressbar=TRUE))
registered()
```

The last param registered becomes the default.

## Reference Genome
The variant calling step requires a `BiocStyle::Biocpkg("gmapR")` `GmapGenome`, which is normally automatically generated from the `refFasta` parameter. If for some reason you want to generate your own, the process is like this:

```{r refGenome, eval=FALSE}
refGenome <- gmapR::GmapGenome(goldenFasta(), name='slc24a5', create=TRUE)
```



## Whole-genome Sequencing (WGS)
MMAPPR2, like its predecessor, was designed for and tested using RNA-Seq data. However, the principles at work should still apply for WGS data.


-----------------
# Session Info
```{r sessionInfo}
sessionInfo()
```

```{r, echo = FALSE}
Sys.setenv(PATH=origPath)
```