---
title: Introduction to TVTB
author:
    -   name: KÃ©vin Rue-Albrecht
        email: kevinrue67@gmail.com
        affiliation:
            - Department of Medicine, Imperial College London, UK
            - Nuffield Department of Medicine, University of Oxford, UK
date: "`r doc_date()`"
package: "`r pkg_ver('TVTB')`"
abstract:
    The VCF Tool Box (`r Biocpkg("TVTB")`) provides S4 classes and methods
    to filter,
    summarise and visualise genetic variation data stored in VCF files
    pre-processed by the Ensembl Variant Effect Predictor (VEP).
    In particular, the package extends the *FilterRules* class
    (`r Biocpkg("S4Vectors")` package) to define news classes of
    filter rules applicable to the various slots of `VCF` objects.
    A Shiny web-application, the Shiny Variant Explorer (*tSVE*),
    provides a convenient interface to demonstrate
    those functionalities integrated in a programming-free environment.
vignette: >
    %\VignetteIndexEntry{Introduction to TVTB}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
output:
    BiocStyle::html_document:
        toc_float: true
bibliography:
    TVTB.bib
editor_options: 
  chunk_output_type: console
---

```{r optChunkDefault, include=FALSE}
stopifnot(
    requireNamespace("pander")
)
library(knitr)
opts_chunk$set( # Set these first, to make them default
    message = FALSE,
    warning=FALSE,
    error=FALSE
)
optChunkDefault <- opts_chunk$get()
```

# Introduction {#Introduction}

The VCF Tool Box (`r Biocpkg("TVTB")`) offers S4 classes and methods
to filter, summarise and visualise genetic variation data
stored in VCF files pre-processed by the
[Ensembl Variant Effect Predictor](http://www.ensembl.org/info/docs/tools/vep)
(VEP) [@RN1].
An [RStudio/Shiny](http://shiny.rstudio.com) web-application,
the Shiny Variant Explorer (*tSVE*),
provides a convenient interface to demonstrate those
functionalities integrated in a programming-free environment.

Currently, major functionalities in the `r Biocpkg("TVTB")` package include:

**A class to store recurrent parameters of genetic analyses**

* List of reference homozygote, heterozygote, and alternate homozygote
    genotype encodings
* Key of the INFO field where Ensembl VEP predictions are stored
    in the VCF file
* Suffix of the INFO fields where calculated data will be stored
* List of genomic ranges to analyse and visualise
* Parameters for parallel calculations (using `r Biocpkg("BiocParallel")`)

**Genotype counts and allele frequencies**

* Calculated from the data of an `ExpandedVCF` objects
    (*i.e.* bi-allelic records)
* Stored in INFO fields defined by the above suffixes
    + overall counts and frequencies (*i.e.* across all samples)
    + counts and frequencies within level(s) of phenotype(s)

**Classes of VCF filter rules**

* Filter rules applicable to the `fixed` slot of an `VCF` object
* Filter rules applicable to the `info` slot of an `VCF` object
* Filter rules applicable to Ensembl VEP predictions stored in a
    given INFO field
* A container for combinations of filter rules listed above
* Subset `VCF` objects using the above filter rules

# Installation {#Installation}

Instructions to install the VCF Tool Box are available [here](http://bioconductor.org/packages/TVTB/).

Once installed, the package can be loaded and attached as follows:

```{r library}
library(TVTB)
```

# Recurrent settings: TVTBparam {#TVTBparam}

Most functionalities in `r Biocpkg("TVTB")` require recurrent information
such as:

* Genotype encoding (homozygote reference, heterozygote, homozygote alternate),
* INFO key that contains the Ensembl VEP predictions in the VCF file,
* List of genomic ranges within which data must be summarised or visualised
    + affecting the genomic ranges to import from the VCF file 
* Suffixes of INFO keys where counts and frequencies of genotypes must be
    stored:
    + Counts and frequencies may be calculated for individual levels of
        selected phenotypes, in which case the data will be stored
        under INFO keys formed as `<phenotype>_<level>_<suffix>`,
    + Counts and frequencies across all samples are stored in INFO keys
        simply named `<suffix>`.
* Settings for parallel calculations.

\bioccomment{
INFO key suffixes should be considered carefully to avoid conflicts
with INFO keys already present in the `VCF` object.
}

To reduce the burden of repetition during programming, and to 
facilitate analyses using consistent sets of parameters,
`r Biocpkg("TVTB")` implements the `TVTBparam` class.
The `TVTBparam` class offer a container for parameters recurrently used
across the package.
A `TVTBparam` object may be initialised as follows:

```{r TVTBparamConstructor}
tparam <- TVTBparam(Genotypes(
    ref = "0|0",
    het = c("0|1", "1|0", "0|2", "2|0", "1|2", "2|1"),
    alt = c("1|1", "2|2")),
    ranges = GenomicRanges::GRangesList(
        SLC24A5 = GenomicRanges::GRanges(
            seqnames = "15",
            IRanges::IRanges(
                start = 48413170, end = 48434757)
            )
        )
    )
```

`TVTBparam` objects have a convenient summary view and accessor methods:

```{r TVTBparamShow}
tparam
```

In this example:

* **Genotypes**
    + Accessor: `genos(x)`
    + Class: `Genotypes`
    + Homozygote reference genotype is encoded `"0|0"`.
    + Counts of reference genotypes are stored in INFO keys suffixed with
        `"REF"`.
    + Heterozygote genotypes are encoded as
        `"0|1"`, `"1|0"`, `"0|2"`, `"2|0"`, `"1|2"`, and `"2|1"`.
    + Counts of heterozygote genotypes are stored in INFO keys suffixed with
        `"HET"`.
    + Homozygote alternate  genotype is encoded `"1|1"`.
    + Counts of alternate genotypes are stored in INFO keys suffixed with
        `"ALT"`.
* **Genomic ranges**
    + Accessor: `ranges(x)`
    + Class: `GRangesList`
    + A gene-coding region on chromosome `"15"`.
* **Alternate allele frequency**
    + Accessor: `aaf(x)`
    + Calculated values will be stored under INFO keys suffixed with `"AAF"`.
* **Minor allele frequency**
    + Accessor: `maf(x)`
    + Calculated values will be stored under INFO keys suffixed with `"MAF"`.
* **Ensembl VEP predictions**
    + Accessor: `vep(x)`
    + Information will be imported from the INFO field `"CSQ"`.
* **Parallel calculations**
    + Accessor: `bp(x)`
    + Class: `BiocParallelParam`
    + Serial evaluation (*i.e.* do not run parallel code)
* **VCF scan parameters**
    + Accessor: `svp(x)`
    + Class: `ScanVcfParam`
    + `which` slot automatically populated with `reduce(unlist(ranges(x)))`

Default values are provided for all slots except genotypes, as these may vary
more frequently from one data set to another (*e.g.* phased, unphased,
imputed).

# Data import {#Import}

## Genetic variants {#ImportVariants}

Functionalities in `r Biocpkg("TVTB")` support
`CollapsedVCF` and `ExpandedVCF` objects
(both extending the virtual class `VCF`) of the 
`r Biocpkg("VariantAnnotation")` package.

Typically, `CollapsedVCF` objects are produced by the
`r Biocpkg("VariantAnnotation")` `readVcf` method after parsing a VCF file,
and `ExpandedVCF` objects result of the
`r Biocpkg("VariantAnnotation")` `expand` method applied to
a `CollapsedVCF` object.

Any information that users deem relevant for the analysis may be
imported from VCF files and stored in
`VCF` objects passed to `r Biocpkg("TVTB")` methods.
However, to enable the key functionalities of the package,
the slots of a `VCF` object should include
at least the following information:

* `fixed(x)`
    + fields `"REF"` and `"ALT"`.
    
* `info(x)`
    + field `<vep>`: where `<vep>` stands for the INFO key where
        the Ensembl VEP predictions are stored in the `VCF` object.
    
* `geno(x)`
    + `GT`: genotypes.

* `colData(x)`: phenotypes.

## List of genomic ranges {#ImportGRangesList}

In the near future, `r Biocpkg("TVTB")` functionalities are expected to
produce summary statistics and plots faceted by meta-features,
each potentially composed of multiple genomic ranges.

For instance, burden tests may be performed on a set of transcripts,
considering only variants in their respective sets of exons.
The `r Biocpkg("GenomicRanges")` `GRangesList` class is an ideal container
in example,
as each `GRanges` in the `GRangesList` would represent a transcript,
and each element in the `GRanges` would represent an exon.

Furthermore, `TVTBparam` objects may be supplied as the `param`
argument of the `r Biocpkg("VariantAnnotation")` `readVcf`. In this case, the
`TVTBparam` object is used to import only variants overlapping
the relevant genomic regions.
Moreover, the `readVcf` method also ensured that the `vep` slot of the
`TVTBparam` object is present in the header of the VCF file.

```{r TVTBparamAsScanVcfParam}
svp <- as(tparam, "ScanVcfParam")
svp
```

## Phenotypes {#ImportPhenotypes}

Although `VCF` objects may be constructed without attached phenotype data,
phenotype information is critical to interpret and compare genetic variants
between groups of samples
(*e.g.* burden of damaging variants in different phenotype levels).

`VCF` objects accept phenotype information
(as `r Biocpkg("S4Vectors")` `DataFrame`) in the `colData` slot.
This practice has the key advantage of keeping phenotype and genetic
information synchronised through operation such as subsetting and re-ordering,
limiting workspace [entropy](https://en.wikipedia.org/wiki/Software_entropy)
and confusion.



## Example {#ImportExample}

An `ExpandedVCF` object that contains the minimal data necessary for the rest
of the vignette can be created as follows:

**Step 1:** Import phenotypes

```{r importPhenotypes}
phenoFile <- system.file(
    "extdata", "integrated_samples.txt", package = "TVTB")
phenotypes <- S4Vectors::DataFrame(
    read.table(file = phenoFile, header = TRUE, row.names = 1))
```

**Step 2:** Define the VCF file to parse

```{r TabixFile}
vcfFile <- system.file(
    "extdata", "chr15.phase3_integrated.vcf.gz", package = "TVTB")
tabixVcf <- Rsamtools::TabixFile(file = vcfFile)
```

**Step 3:** Define VCF import parameters

```{r svpTVTBparam}
VariantAnnotation::vcfInfo(svp(tparam)) <- vep(tparam)
VariantAnnotation::vcfGeno(svp(tparam)) <- "GT"
svp(tparam)
```

Of particular interest in the above chunk of code:

* The `TVTBparam` constructor previously populated the `which` slot of `svp`
    with "reduced" (*i.e.* non-overlapping)
    genomic ranges defined in the `ranges` slot.
* Only the INFO key defined in the `vep` slot will be imported
* Only the matrix of called genotypes will be imported

**Step 4:** Import and pre-process variants

```{r readExpandVcf, message=FALSE}
# Import variants as a CollapsedVCF object
vcf <- VariantAnnotation::readVcf(
    tabixVcf, param = tparam, colData = phenotypes)
# Expand into a ExpandedVCF object (bi-allelic records)
vcf <- VariantAnnotation::expand(x = vcf, row.names = TRUE)
```

Of particular interest in the above chunk of code, the `readVcf` method is
given:

* `TVTBparam` parameters, invoking the corresponding method signature
* phenotypes
    + The `rownames` of those phenotypes defines the sample identifiers that
        are queried from the VCF file.
    + Those phenotypes are stored in the `colData` slot of the resulting
        `VCF` object.

The result is an `ExpandedVCF` object that includes variants
in the targeted genomic range(s) and samples:

```{r vcfShow, echo=FALSE}
vcf
```

# Adding allele frequencies {#Frequencies}

Although interesting figures and summary tables may be obtained
as soon as the first `ExpandedVCF` object is created
(see section [Summarising Ensembl VEP predictions](#Summarise)),
those methods may benefit from information added to additional INFO keys
after data import, either manually by the user, or through various
methods implemented in the `r Biocpkg("TVTB")` package.

## Adding overall frequencies {#FrequenciesOverall}

For instance, the method `addOverallFrequencies` uses the
reference homozoygote (*REF*), heterozygote (*HET*),
and homozygote alternate (*ALT*) genotypes defined in the `TVTBparam` object
stored in the `VCF` metadata
to obtain the count of each genotype in an `ExpandedVCF` object.
Immediately thereafter, the method uses those counts to calculate
alternate allele frequency (*AAF*) and minor allele frequency (*MAF*).
Finally, the method stores the five calculated values
(*REF*, *HET*, *ALT*, *AAF*, and *MAF*)
in INFO keys defined by suffixes also declared in the `TVTBparam` object.

```{r addOverallFrequencies}
initialInfo <- colnames(info(vcf))
vcf <- addOverallFrequencies(vcf = vcf)
setdiff(colnames(info(vcf)), initialInfo)
```

Notably, the `addOverallFrequencies` method is synonym to the `addFrequencies`
method missing the argument `phenos`:

```{r addFrequenciesOverall}
vcf <- addFrequencies(vcf = vcf, force = TRUE)
```

\bioccomment{
The optional argument `force = TRUE` is used here
to overwrite INFO keys created in the previous chunk of code.
}

## Adding frequencies within phenotype level(s) {#FrequenciesPhenoLevel}

Similarly, the method `addPhenoLevelFrequencies` obtains the count of each
genotype in samples associated with given level(s) of given phenotype(s),
and stores the calculated values in INFO keys defined as 
`<pheno>_<level>_<suffix>`, with suffixes defined in the `TVTBparam` object
stored in the `VCF` metadata.

```{r addPhenoLevelFrequencies}
initialInfo <- colnames(info(vcf))
vcf <- addPhenoLevelFrequencies(
    vcf = vcf, pheno = "super_pop", level = "AFR")
setdiff(colnames(info(vcf)), initialInfo)
```

Notably, the `addPhenoLevelFrequencies` method is synonym
to the `addFrequencies`
method called with the argument `phenos` given as a list where `names` are
phenotypes, and values are `character` vectors of levels to
process within each phenotype:

```{r addFrequenciesPhenoLevel}
initialInfo <- colnames(info(vcf))
vcf <- addFrequencies(
    vcf,
    list(super_pop = c("EUR", "SAS", "EAS", "AMR"))
)
setdiff(colnames(info(vcf)), initialInfo)
```

In addition, the `addFrequencies` method can be given a `character` vector
of phenotypes as the `phenos` argument, in which case frequencies are
calculated for _all_ levels of the given phenotypes:

```{r addFrequenciesPhenotype}
vcf <- addFrequencies(vcf, "pop")
head(grep("^pop_[[:alpha:]]+_REF", colnames(info(vcf)), value = TRUE))
```

# Filtering variants {#Filter}

Although `VCF` objects are straightforward to subset
using either indices and row names
(as they inherit from the `r Biocpkg("SummarizedExperiment")` 
`RangedSummarizedExperiment` class),
users may wish to identify variants that pass combinations of criteria based on
information in their `fixed` slot, `info` slot, and Ensembl VEP predictions,
a non-trivial task due to those pieces of information being stored in
different slots of the `VCF` object, and the *1:N* relationship
between variants and EnsemblVEP predictions.

## Definition of VCF filter rules {#FilterDefine}

To facilitate the definition of VCF filter rules, and their application to
`VCF` objects, `r Biocpkg("TVTB")` extends the `r Biocpkg("S4Vectors")`
`FilterRules` class in four new classes of filter rules:

```{r filterRulesMotivation, echo=FALSE, results='asis'}
motivations <- readRDS(
    system.file("misc", "motivations_rules.rds", package = "TVTB")
)
pander::pandoc.table(
    motivations,
    paste(
        "Motivation for each of the new classes extending `FilterRules`,
        to define VCF filter rules."
    ),
    style = "multiline",
    split.cells = c(15, 45),
    split.tables = c(Inf)
)
```

Note that `FilterRules` objects themselves are applicable to `VCF` objects,
with two important difference from the above specialised classes:

* Expressions must explicitely refer to the `VCF` slots
* As a consequence, a single expression can refer to fields from different
    `VCF` slots, for instance:

```{r FilterRules}
S4Vectors::FilterRules(list(
    mixed = function(x){
        VariantAnnotation::fixed(x)[,"FILTER"] == "PASS" &
            VariantAnnotation::info(x)[,"MAF"] >= 0.05
    }
))
```

Instances of those classes may be initialised as follows:

**VcfFixedRules**

```{r VcfFixedRules}
fixedR <- VcfFixedRules(list(
    pass = expression(FILTER == "PASS"),
    qual = expression(QUAL > 20)
))
fixedR
```

**VcfInfoRules**

```{r VcfInfoRules}
infoR <- VcfInfoRules(
    exprs = list(
        rare = expression(MAF < 0.01 & MAF > 0),
        common = expression(MAF > 0.05),
        mac_ge3 = expression(HET + 2*ALT >= 3)),
    active = c(TRUE, TRUE, FALSE)
)
infoR
```

The above code chunk illustrates useful features of `FilterRules`:

* `FilterRules` are initialised in an active state by default
    (evaluating an _inactive_ rule returns `TRUE` for all items)
    The `active` argument may be used to initialise specific filter rules in
    an inactive state.
* A single rule `expression` (or `function`) may refer to multiple columns of
    the relevant slot in the `VCF` object.
* Rules may include calculations, allowing filtering on values not necessarily
    stored as such in any slot of the `VCF` object.

**VcfVepRules**

```{r VcfVepRules}
vepR <- VcfVepRules(exprs = list(
    missense = expression(Consequence %in% c("missense_variant")),
    CADD_gt15 = expression(CADD_PHRED > 15)
    ))
vepR
```

**VcfFilterRules**

`VcfFilterRules` combine VCF filter rules of  different types
in a single object.

```{r VcfFilterRules}
vcfRules <- VcfFilterRules(fixedR, infoR, vepR)
vcfRules
```

This vignette offers only a brief peek into the utility and flexibility of
VCF filter rules. More (complex) examples are given in a separate
vignette, including filter rules using functions and pattern matching.
The documentation of the `r Biocpkg("S4Vectors")` package---where the parent
class `FilterRules` is defined---can also be a source of inspiration.

## Control of VCF filter rules {#FilterControl}

As the above classes of *VCF filter rules* inherit
from the `r Biocpkg("S4Vectors")` `FilterRules` class,
they also benefit from its accessors and methods.
For instance, VCF filter rules can easily be toggled
between active and inactive states:

```{r deactivateFilterRules}
active(vcfRules)["CADD_gt15"] <- FALSE
active(vcfRules)
```

A separate vignette describes in greater detail the use of classes
that contain *VCF filter rules*.

## Evaluation of VCF filter rules {#FilterEval}

Once defined, the above filter rules can be applied to `ExpandedVCF` objects,
in the same way as `FilterRules` are evaluated in a given environment
(see the `r Biocpkg("S4Vectors")` documentation):

```{r evalFilterRules}
summary(eval(expr = infoR, envir = vcf))
summary(eval(expr = vcfRules, envir = vcf))
summary(evalSeparately(expr = vcfRules, envir = vcf))
```

\bioccomment{
Note how the inactive rules (_i.e._ `CADD`, `pass`) returns `TRUE` for all
variants, irrespective of the filter result.
}

# Visualising data {#Visualise}
## Visualise INFO data by phenotype level on a genomic axis

Let us show the alternate allele frequency (AAF) of common variants, 
estimated in each super-population, in the context of the transcripts
ovelapping the region of interest.

In the `MAF` track:

* the points represent the MAF in each super-population on a common Y axis.
* the heatmap represents the MAF as the color intensity, given a row for each
    super-population.

```{r plotMAFCommon}
plotInfo(
        subsetByFilter(vcf, vcfRules["common"]), "AAF",
        range(GenomicRanges::granges(vcf)),
        EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75,
        "super_pop",
        zero.rm = FALSE
    )
```

Alternatively, the minor allele frequency (MAF) of missense variants
(as estimated from the entire data set) may be visualised in the same manner.
However, due to the nature of those variants, the `zero.rm` argument may be
set to `TRUE` to hide all data points showing a MAF of `0`; thereby
variants actually detected in each super-population are emphasised
even at low frequencies.

```{r plotMAFrare}
plotInfo(
        subsetByFilter(vcf, vcfRules["missense"]), "MAF",
        range(GenomicRanges::granges(vcf)),
        EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75,
        "super_pop",
        zero.rm = TRUE
    )
```

# Pairwise comparison of INFO data between phenotype levels

Using the `r CRANpkg("GGally")` `ggpairs` method,
let us make a matrix of plots for common variants, showing:

* in the _lower_ triangle, pairwise scatter plots of the alternate allele
    frequency estimated each super-population
* on the _diagonal_, the density plot for each super-population shown on the
    diagonal
* in the _upper_ triangle, the correlation[^1] value

[^1]: Pearson, by default

```{r pairsInfo}
pairsInfo(subsetByFilter(vcf, vcfRules["common"]), "AAF", "super_pop")
```

Note that the ellipsis `...` allows a high degree of customisation,
as it passes additional arguments to the underlying `ggpairs` method.

# A taste of future...

This section presents upcoming features.

## Summarising Ensembl VEP predictions {#Summarise}

As soon as genetic and phenotypic information are imported
into an `ExpandedVCF` object,
or after the object was extended with additional information,
the scientific value of the data may be revealed by
a variety of summary statistics and graphical representations.
This section will soon present several ideas *being implemented* in
`r Biocpkg("TVTB")`, for instance:

* Count of discrete Ensembl VEP predictions
    + by phenotype level
    + by genomic feature affected (*i.e.* transcript)
* Distribution of continuous Ensembl VEP predictions
    + by phenotype level
    + by genomic feature affected (*i.e.* transcript)

# Acknowledgement

Dr. Stefan GrÃ¤f and Mr. Matthias Haimel
for advice on the VCF file format and the Ensembl VEP script.
Prof. Martin Wilkins for his trust and support.
Dr. Michael Lawrence for his helpful code review and suggestions.

Last but not least, the amazing collaborative effort of the `rep("many",n)`
[Bioconductor](http://bioconductor.org) developers whose hard work
appears through the dependencies of this package.

# Session info {#SessionInfo}

Here is the output of `sessionInfo()` on the system on which this document was
compiled:

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References {#References}

[R]: http://r-project.org
[RStudio]: http://www.rstudio.com/