---
title: "VCFArray: DelayedArray objects with on-disk/remote VCF backend"
author:
- name: Qian Liu
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
date: "last edit: 08/23/2018"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
package: VCFArray
vignette: >
  %\VignetteIndexEntry{VCFArray: DelayedArray objects with on-disk/remote VCF backend}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r options, eval=TRUE, echo=FALSE}
options(showHeadLines=3)
options(showTailLines=3)
```
```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction 
[VCFArray][] is a _Bioconductor_ package that represents VCF files as
objects derived from the [DelayedArray][] package and `DelayedArray`
class. It converts data entries from VCF file into a
`DelayedArray`-derived data structure. The backend VCF file could
either be saved on-disk locally or remote as online resources. Data
entries that could be extracted include the fixed data fields (REF,
ALT, QUAL, FILTER), information field (e.g., AA, AF...), and the
individual format field (e.g., GT, DP...). The array data generated
from fixed/information fields are one-dimensional `VCFArray`, with the
dimension being the length of the variants. The array data generated
from individual `FORMAT` field are always returned with the first
dimension being `variants` and the second dimension being
`samples`. This feature is consistent with the assay data saved in
`SummarizedExperiment`, and makes the `VCFArray` package interoperable
with other established _Bioconductor_ data infrastructure.

[VCFArray]: https://bioconductor.org/packages/VCFArray
[DelayedArray]: https://bioconductor.org/packages/DelayedArray

# Installation

1. Download the package. 

```{r getPackage, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("VCFArray")
```
The development version is also available to download from Github. 
```{r getDevel, eval=FALSE}
BiocManager::install("Bioconductor/VCFArray")
```

2. Load the package into R session.
```{r Load, message=FALSE}
library(VCFArray)
```

# VCFArray

## VCFArray constructor

To construct a `VCFArray` object, 4 arguments are needed: `file`,
`vindex` and `name`, and `pfix`. The `file` argument could take either
a character string (VCF file name), or `VcfFile` object, or a
`RangedVcfStack` object. `name` argument must be specified to indicate
which data entry we want to extract from the input file. It's
case-sensitive, and must be consistent with the names from VCF header
file.  `vindex` argument will only be used to indicate the file path
of the index file if it does not exist. `pfix` is used to spefify the
category that the `name` field belongs to. **NOTE** that the `pfix`
needs to be provided specifically when there are same `name` in
multiple categories, otherwise, error will return.

The `vcfFields()` method takes the VCF file path, `VcfFile` object or
`RangedVcfStack` object as input, and returns a CharacterList with all
available VCF fields within specific categories. Users should consult
the `fixed`, `info` and `geno` category for available data entries
that could be converted into `VCFArray` instances. The data entry
names can be used as input for the `name` argument in `VCFArray`
constructor.

```{r avail, message=FALSE}
args(VCFArray)
fl <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation")
library(VariantAnnotation)
vcfFields(fl)
```

Since the index file for our vcf file already exists, the `vindex`
argument would not be needed (which is the most common case for
on-disk VCF files). So we can construct the `VCFArray` object for the
`GT` data entry in the provided VCF file with arguments of `file` and
`name` only.

```{r constructor}
VCFArray(file = fl, name = "GT")
```

We can also construct a `VCFArray` object with the `file` argument
being a `VcfFile` object.

```{r constructor2}
vcf <- VariantAnnotation::VcfFile(fl)
VCFArray(file = vcf, name = "DS")
```

The `file` argument could also take `RangedVcfStack` object. Note that
an ordinary `VcfStack` object without `Range` information could not
be used to construct a `VCFArray`.

```{r rgstack}
extdata <- system.file(package = "GenomicFiles", "extdata")
files <- dir(extdata, pattern="^CEUtrio.*bgz$", full=TRUE)[1:2]
names(files) <- sub(".*_([0-9XY]+).*", "\\1", basename(files))
seqinfo <- as(readRDS(file.path(extdata, "seqinfo.rds")), "Seqinfo")
stack <- GenomicFiles::VcfStack(files, seqinfo)
gr <- as(GenomicFiles::seqinfo(stack)[rownames(stack)], "GRanges")
## RangedVcfStack
rgstack <- GenomicFiles::RangedVcfStack(stack, rowRanges = gr)  
rgstack
```

Here we choose the `name = SB`, which returns a 3-dimensional
`VCFArray` object, with the first 2 dimensions correspond to variants
and samples respectively.

```{r constructor3}
vcfFields(rgstack)$geno
VCFArray(rgstack, name = "SB")
```

As the vignette title suggest, the backend VCF file could also be
remote files. Here we included an example of representing VCF file of
chromosome 22 from the 1000 Genomes Project (Phase 3). **NOTE that for
a remote VCF file, the `vindex` argument must be specified.** Since
this VCF files is relatively big, and it takes longer time, we only
show the code here without evaluation.

```{r remote, eval=FALSE}
chr22url <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz"
chr22url.tbi <- paste0(chr22url, ".tbi")
va <- VCFArray(chr22url, vindex =chr22url.tbi, name = "GT")
```

## VCFArray methods

`VCFArray` represents VCF files as `DelayedArray` instances. It has
methods like `dim`, `dimnames` defined, and it inherits array-like
operations and methods from `DelayedArray`, e.g., the subsetting
method of `[`.     
**NOTE** that for 1-dimensional `VCFArray` objects that are generated
from the fixed / information data field of VCF file, `drop = FALSE`
should always be used with `[` subsetting to ensure `VCFArray` object
as returned value.

### slot accessors 
`seed` returns the `VCFArraySeed` of the `VCFArray` object, which
  includes information about the backend VCF file, e.g., the vcf file
  path, index file path, name of the data entry (with a prefix of
  category), dimension and etc.
  
```{r, seedAccessor}
va <- VCFArray(fl, name = "GT")
seed(va)
```

`vcffile` returns the `VcfFile` object corresponding to the backend
  VCF file.
```{r, vcffileAccessor}
vcffile(va)
```

### `dim()` and `dimnames()`

The `dimnames(VCFArray)` returns an unnamed list, with the length of
each element being the same as return from `dim(VCFArray)`.

```{r, dims}
va <- VCFArray(fl, name = "GT")
dim(va)
class(dimnames(va))
lengths(dimnames(va))
```

### `[` subsetting

`VCFArray` instances can be subsetted, following the usual _R_
conventions, with numeric or logical vectors; logical vectors are
recycled to the appropriate length.

```{r, subsetting}
va[1:3, 1:3]
va[c(TRUE, FALSE), ]
```

### Some numeric calculation

Numeric calculations could be evaluated on `VCFArray` objects.

```{r, numeric}
ds <- VCFArray(fl, name = "DS")
log(ds+5)
```

# Internals: VCFArraySeed 

The `VCFArraySeed` class represents the 'seed' for the `VCFArray`
object. It is not exported from the [VCFArray][] package. Seed objects
should contain the VCF file path, and are expected to satisfy the
“seed contract” of [DelayedArray][], i.e. to support `dim()` and
`dimnames()`.

```{r, VCFArraySeed}
seed <- VCFArray:::VCFArraySeed(fl, name = "GT", pfix = NULL)
seed
path(vcffile(seed))
```

The seed can be used to construct a `VCFArray` instance.

```{r, VCFArray-from-VCFArraySeed}
(va <- VCFArray(seed))
```

The `DelayedArray()` constructor with `VCFArraySeed` object as inputs
will return the same content as the `VCFArray()` constructor over the
same `VCFArraySeed`.

```{r, da}
da <- DelayedArray(seed)
class(da)
all.equal(da, va)
```


# sessionInfo()

```{r}
sessionInfo()
```