---
title: Annotating Genomic Ranges
author: 
- name: Valerie Obenchain
  affiliation: Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., P.O. Box 19024, Seattle, WA, USA 98109-1024
  email: maintainer@bioconductor.org
output:
  BiocStyle::html_document
date: 24 April 2018
vignette: >
  %\VignetteIndexEntry{Annotating Genomic Ranges}
  %\VignetteEngine{knitr::rmarkdown}
---

# Version Info
```{r, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
library('annotation')
})
```
<p>
**R version**: `r R.version.string`
<br />
**Bioconductor version**: `r BiocInstaller::biocVersion()`
<br />
**Package version**: `r packageVersion("annotation")`
</p>

# Background

Bioconductor can import diverse sequence-related file types, including fasta,
fastq, BAM, VCF, gff, bed, and wig files, among others. Packages support common
and advanced sequence manipulation operations such as trimming, transformation,
and alignment.  Domain-specific analyses include quality assessment, ChIP-seq,
differential expression, RNA-seq, and other approaches. Bioconductor includes
an interface to the Sequence Read Archive (via the
[SRAdb](/packages/release/bioc/html/SRAdb.html) package).

This workflow walks through the annotation of a generic set of ranges with
Bioconductor packages. The ranges can be any user-defined region of interest or
can be from a public file.

# Data Preparation

## Human hg19

As a first step, data are put into a GRanges object so we can take
advantage of overlap operations and store identifiers as
metadata columns.

The first set of ranges are variants from a dbSNP Variant Call Format (VCF) 
file. This file can be downloaded from the ftp site at NCBI
[ftp://ftp.ncbi.nlm.nih.gov/snp/](ftp://ftp.ncbi.nlm.nih.gov/snp/) and
imported with readVcf() from the VariantAnnotation package. Alternatively,
the file is available as a pre-parsed VCF object in the AnnotationHub.

```{r, echo=FALSE}
suppressPackageStartupMessages(library(annotation))
```


The Hub returns a VcfFile object with a reference to the file on disk.

```{r}
hub <- AnnotationHub()
```

Query the Hub for clinvar VCF files build GRCh37:

```{r}
mcols(query(hub, "clinvar.vcf", "GRCh37"))[,"sourceurl", drop=FALSE]
```

Retrieve one of the files:

```{r}
fl <- query(hub, "clinvar.vcf", "GRCh37")[[1]]
```

Read the data into a VCF object:

```{r}
vcf <- readVcf(fl, "hg19")
dim(vcf)
```

Overlap operations require that seqlevels and the genome of the objects match.
Here the VCF seqlevels are modified to match the TxDb.


```{r}
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
head(seqlevels(txdb_hg19))
seqlevels(vcf)
seqlevels(vcf) <- paste0("chr", seqlevels(vcf))
```

For this example we'll annotate chromosomes 3 and 18:

```{r}
seqlevels(vcf, pruning.mode="coarse") <- c("chr3", "chr18")
seqlevels(txdb_hg19) <- c("chr3", "chr18")
```

Sanity check to confirm we have matching seqlevels.
```{r}
intersect(seqlevels(txdb_hg19), seqlevels(vcf))
```

The genomes already match so no change is needed.

```{r}
unique(genome(txdb_hg19))
unique(genome(vcf))
```


The GRanges in a VCF object is extracted with 'rowRanges()'.
```{r}
gr_hg19 <- rowRanges(vcf)
```

## Mouse mm10

The second set of ranges is a user-defined region of chromosome 4 in mouse.
The idea here is that any region, known or unknown, can be annotated with
the following steps.

Load the TxDb package and keep only the standard chromosomes.
```{r}
txdb_mm10 <- keepStandardChromosomes(TxDb.Mmusculus.UCSC.mm10.ensGene)
```
We are creating the GRanges from scratch and can specify the seqlevels 
(chromosome names) to match the TxDb. 
```{r}
head(seqlevels(txdb_mm10))
gr_mm10 <- GRanges("chr4", IRanges(c(4000000, 107889000), width=1000))
```

Now assign the genome.
```{r}
unique(genome(txdb_mm10))
genome(gr_mm10) <- "mm10"
```

# Location in and Around Genes

locateVariants() in the VariantAnnotation package annotates ranges
with transcript, exon, cds and gene ID's from a TxDb. Various
extractions are performed on the TxDb (exonsBy(), transcripts(), 
cdsBy(), etc.) and the result is  overlapped with the ranges. An appropriate
GRangesList can also be supplied as the annotation. Different variants
such as 'coding', 'fiveUTR', 'threeUTR', 'spliceSite', 'intron', 
'promoter',  and 'intergenic' can be searched for by passing the appropriate 
constructor as the 'region' argument. See ?locateVariants for details.

```{r}
loc_hg19 <- locateVariants(gr_hg19, txdb_hg19, AllVariants())
table(loc_hg19$LOCATION)
loc_mm10 <- locateVariants(gr_mm10, txdb_mm10, AllVariants()) 
table(loc_mm10$LOCATION)
```

# Annotate by ID

The ID's returned from locateVariants() can be used in select() to map 
to ID's in other annotation packages.


```{r}
cols <- c("UNIPROT", "PFAM")
keys <- na.omit(unique(loc_hg19$GENEID))
head(select(org.Hs.eg.db, keys, cols, keytype="ENTREZID"))
```

The 'keytype' argument specifies that the mouse TxDb contains 
Ensembl instead of Entrez gene id's.


```{r}
keys <- unique(loc_mm10$GENEID)
head(select(org.Mm.eg.db, keys, cols, keytype="ENSEMBL"))
```

# Annotate by Position

Files stored in the AnnotationHub have been pre-processed into
ranged-based R objects such as a GRanges, GAlignments and VCF.
The positions in our GRanges can be overlapped with the ranges in
the AnnotationHub files. This allows for easy subsetting 
of multiple files, resulting in only the ranges of interest.

Create a 'hub' from AnnotationHub and filter the files based
on organism and genome build.

```{r}
hub <- AnnotationHub()
hub_hg19 <- subset(hub, 
                  (hub$species == "Homo sapiens") & (hub$genome == "hg19"))
length(hub_hg19)
```

Iterate over the first 3 files and extract ranges that overlap 'gr_hg19'.
```{r, echo=FALSE}
ov_hg19 <- lapply(1:3, function(i) subsetByOverlaps(hub_hg19[[i]], gr_hg19))
```
```{r}
ov_hg19 <- lapply(1:3, function(i) subsetByOverlaps(hub_hg19[[i]], gr_hg19))
```

Inspect the results.
```{r} 
names(ov_hg19) <- names(hub_hg19)[1:3]
lapply(ov_hg19, head, n=3)
```
Annotating the mouse ranges in the same fashion is left 
as an exercise.

# Annotating Variants

<h4 id=amino-acid-coding-changes">Amino acid coding changes</h4>
For the set of dbSNP variants that fall in coding regions, amino
acid changes can be computed. The output contains one line for each 
variant-transcript match which can result in multiple lines for
each variant.


```{r}
head(predictCoding(vcf, txdb_hg19, Hsapiens), 3)
```

```{r sess}
sessionInfo()
```

# Exercises

Exercise 1: VCF header and reading data subsets.

VCF files can be large and it's often the case that only a subset of 
variables or genomic positions are of interest. The scanVcfHeader() 
function in the VariantAnnotation package retrieves header information 
from a VCF file. Based on the information returned from scanVcfHeader() 
a ScanVcfParam() object can be created to read in a subset of data from 
a VCF file. 
*  Use scanVcfHeader() to inspect the header information in the
   'chr22.vcf.gz' file in VariantAnnotation package.
*  Select a few 'info' or 'geno' variables and create a ScanVcfParam object.
*  Use the ScanVcfParam object as the 'param' argument to readVcf()
   to read in a subset of data.
Note that the header() accessor operates on VCF objects in the R
workspace. Try header(vcf) on the dbSNP file from AnnotationHub. 

Exercise 2: Annotate the mouse ranges in 'gr_mm10' with AnnotationHub files.
*  Create a new 'hub' and filter on organism.
*  Isolate the files for the appropriate genome build and perform overlaps. 
 
Exercise 3: Annotate a gene range from Saccharomyces Scerevisiae.
*  Load TxDb.Scerevisiae.UCSC.sacCer3.sgdGene and extract the
   gene ranges. (Hint: use transcriptsBy() and range()).
*  Isolate the range for gene "YBL086C".
*  Create a new 'hub' from AnnotationHub and filter by organism.
   (You should see >= 39 files.)
*  Select the files for 'sacCer3' and perform overlaps.

<p class="back_to_top">[ <a href="#top">Back to top</a> ]</p>