---
title: "Using Bioconductor's Annotation Libraries"
author: 
  - name: "Marc Carlson"
  - name: "Jianhua Zhang"
  - name: "Manvi Yaduvanshi"
    affiliation: "Vignette translation from Sweave to R Markdown / HTML" 
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Using Bioconductor's Annotation Libraries}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{Using Data Packages} 
  %\VignetteDepends{hgu95av2.db, GO.db} 
  %\VignetteKeywords{Annotation} 
  %\VignettePackage{annotate}
---

# Overview

The Bioconductor project maintains a rich body of annotation data
assembled into R libraries. The purpose of this vignette is to discuss
the structure, contents, and usage of these annotation data libraries.
Executable code is provided as examples.

# Contents

Bioconductor's annotation data libraries are constructed by assembling
data collected from various public data repositories using
Bioconductor's `r Biocpkg("AnnotationDbi")` package and distributed as regular R
libraries that can be installed and loaded in the same way an R library
is installed/loaded. Each annotation library is an independent unit that
can be used alone or in conjunction with other annotation libraries.
Platform specific libraries are a group annotation libraries assembled
specifically for given platforms (e. g. Affymetrix HG_U95Av2).
`org.XX.eg.db` are libraries containing data assembled at genome level
for specific organisms such as human, mouse, fly, or rat.
KEGG.db^[Deprecated in Bioconductor 3.13] and `r Biocpkg("GO.db")` are source
specific libraries containing generic data for various genomes.

Each annotation library, when installed, contains a sqlite database
contained within the `extdata` along with a `man` subdirectory filled
with documentation about the data. The data can be accessed using the
standard methods that would work for the classic environment objects
(hash table with key-value pairs) and act as if they were simple
associations of annotation values to a set of keys. For each of these
emulated environment objects (which we will refer to as mappings), there
is a corresponding help file in the `man` directory with detail
descriptions of the data file and usage. In addition to the traditional
access to these data, these databases can also be accessed directly by
using DBI interfaces which allow for powerful new combinations of these
data.

Each platform specific library creates a series of these mapping objects
named by following the convention of package name plus mapping name. The
package name is in lower case letters and the mapping names are in
capital letters. When a given mapping maps platform specific keys to
annotation data, only the name of the annotation data is used for the
name of the mapping. Otherwise, the mapping names have a pattern of key
name and value name joined by a "2" in between. For example,
`hgu95av2ENTREZID` maps probe ids on an Affymetrix human genome U95Av2
chip to EntrezGene IDs while `hgu95av2GO2PROBE` maps Gene Ontology IDs
to probe IDs. Names of the mappings available in a platform specific
data package are not listed here to save space but are easily accessible
as shown later in the section for usage.

Genome level annotation libraries are named in the form of org.Xx.yy.db
where Xx represents an abbreviation of the genus and species. Each of
the organism wide genome annotation packages is based upon some type of
widely used gene based identifier (such as an Entrez Gene id) that is
mapped onto all the other features in the package. The yy part of the
name corresponds to this designation, where eg means a package is an
entrez gene package and sgd is a package based on the sgd database etc.
In many cases the org packages will contain more different kinds of
information that the platform based ones, since not all types of
information are as widely sought after.

The KEGG.db library contains mappings between ids such as Entrez Gene
IDs and *GO* to *KEGG* pathway ids and thus also to pathway names. The
`r Biocpkg("GO.db")` library maintain the directed acyclic graph structure of the
original data from Gene Ontology Consortium by providing mappings of GO
ids to their direct parents or children for each of the three categories
(molecular function, cellular component, and biological process).
Mappings between Entrez Gene and *GO* ids are also available to
complement the `r Biocpkg("GO.db")` package. These mappings are found within the
organism wide packages mentioned above. These mappings are provided with
evidence code that specifies the type of evidence that supports the
annotation of a gene to a particular *GO* term.

# Usage

All the annotation libraries can be obtained from https://www.bioconductor.org.
To illustrate their usages, we use the library for Affymetrix HG_U95Av2 chip 
`r Biocpkg("hgu95av2.db")` as an example for platform specific data packages
and the `r Biocpkg("GO.db")` library for non-platform specific data packages. We
assume that [R](www.r-project.org) and Bioconductor's `r Biocpkg("Biobase")` and
`r Biocpkg("annotation")`.

## Package installation {#package-installation .unnumbered}

Download libraries `r Biocpkg("hgu95av2.db")` and  `r Biocpkg("GO.db")`  with
`BiocManager::install()`.

Typing `library(library name)` in an R session will load the library into
R. For example,

```{r loadLibs, message=FALSE}
library("annotate")
library("hgu95av2.db")
library("GO.db")
```

## Documentations {#documentations .unnumbered}

Each library contains documentation for the library in general and each
of the individual mapping objects contained by the library. Two
documents at the library level can be accessed by typing a library
basename proceeded by a question mark (e. g. `?hgu95av2`) and the library
basename followed by a pair of brackets (e. g. `hgu95av2()`),
respectively. The former explains what the package is and details how a
user can get more information, while the latter lists all the mappings
contained by a library and provides information on the total number of
keys within each of the maps contained by the library and how many of
these keys are annotated. In addition, the latter will indicate the
sources for the information provided by the package as well as the date
that these sources claim to have last been updated.

The documentation for a given mapping object can be accessed by typing
the name of a mapping object proceeded by a question mark (e. g.
`hgu95av2GO`). The resulting documentation provides detail explanations
to the mapping object, data source used to build the object, and example
code for accessing annotation data.

## Accessing annotation data within a library {#accessing-annotation-data-within-a-library .unnumbered}

Annotation data of a given library are stored as mapping objects in the
form of key (items to be annotated) and value (annotation for an key
item) pairs. Each mapping object provides annotation for keys for a
particular subject reflected by the name of the object. For example,
`hgu95av2GO` annotates probes on the HGU95Av2 chip with ids
of the Gene Ontology terms the probes correspond to.

The name of an mapping object consists of package basename
(`r Biocpkg("hgu95av2.db")`) and mapping name *GO* to avoid confusion when multiple
libraries are loaded to the system at the same time. Data contained by
an mapping can be accessed easily using Bioconductor's existing
functions. For example, the following code stores all the keys contained
by the `hgu95av2GO` mapping object to variable `temp` and displays the
first five keys on the screen:

```{r getGO}
as.list(hgu95av2GO[5])
```

To obtain annotation for a given set of keys, one may use the `mget`
function. Suppose we have run an experiment using the HG_U95Av2 chip and
found three genes represented by Affymetrix probe ids *738_at*,
*40840_at*, and *41668_r_at* interesting. To get the names of genes the
three probe ids corresponding to, we do:

```{r showmget}
mget(c("738_at", "40840_at", "41668_r_at"), hgu95av2GENENAME)
```

Similarly, identifiers of Gene Ontology terms corrsponding to the three
probes can be obtained as shown below:

```{r moremget}
temp <- mget(c("41561_s_at", "40840_at", "41668_r_at"), hgu95av2GO)
```

In this case, the function `mget` returns a list of pre-defined S4
objects containing data for the ids, ontology, and evidence code of Gene
Ontology terms corresponding to the three keys. The following code shows
how to access the GO id, evidence code and ontology of the Gene Ontology
term corresponding to probe id *40840_at*:

```{r gettingGO}
temp <- get("738_at", hgu95av2GO)
names(temp)
temp[["GO:0008253"]][["Evidence"]]
temp[["GO:0008253"]][["Ontology"]]
```

As shown above, probe *40840_at* can be annotated by three Gene Ontology
terms identified by *GO\:0005829*, *GO\:0008253*, and *GO\:0016787*. The
evidence code for *GO\:0008253* is *TAS* (traceable author statement) and
it belongs to ontology *MF* (molecular function).

## Accessing annotation data across libraries {#accessing-annotation-data-across-libraries .unnumbered}

Often, data available in a given data package alone may not be
sufficient and need to be sought across packages. Bioconductor's
annotation data packages are linked by common public data identifiers to
allow traverse between packages. Using the example above, we
know that probe id *738_at* are annotated by three Gene Ontology ids
*GO\:0005829*, *GO\:0008253*, and *GO\:0016787*. The Gene Ontology terms
for various Gene Ontology ids, however, are stored in another package
named `r Biocpkg("GO.db")`. As package `r Biocpkg("hgu95av2.db")` and
`r Biocpkg("GO.db")` are linked by *GO* ids, one can annotate probe id *738_at*
with Gene Ontology terms by linking data in the two packages using *GO* id as
shown below:

```{r}
mget(names(get("738_at", hgu95av2GO)), GOTERM)
```

It turns out that probe id *738_at* (corresponding to *GO\:0008253*, and
*GO\:0016787*) has molecular function (MF) *5'-nucleotidase activity* and
*hydrolase activity*.

# Session Information

The version number of R and packages loaded for generating the vignette
were:

```{r echo=FALSE}
sessionInfo()
```