---
title: "Introduction to gep2pep"
date: 
author: "Francesco Napolitano"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Introduction to gep2pep}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
library(knitr)
opts_chunk$set(collapse = TRUE)
```

## About gep2pep

Pathway Expression Profiles (PEPs) are based on the expression of all
the pathways (or generic gene sets) belonging to a collection under a
given experimental condition, as opposed to individual
genes. `gep2pep` supports the conversion of gene expression profiles
(GEPs) to PEPs and performs enrichment analysis of both pathways and
conditions.

`gep2pep` creates a local repository of gene sets, which can also be
imported from the MSigDB database [1]. The local repository is in the
`repo` format. When a GEP, defined as a ranked list of genes, is
passed to `buildPEPs`, the stored database of pathways is used to
convert the GEP to a PEP and permanently store the latter.

One type of analysis that can be performed on PEPs and that is
directly supported by `gep2pep` is the Drug-Set Enrichment Analysis
(DSEA, see reference below). It finds pathways that are consistently
dysregulated by a set of drugs, as opposed to a background of other
drugs. Of course PEPs may refer to non-pharmacological conditions
(genetic perturbations, disease states, etc.) for analogous
analyses. See the `CondSEA` function.

A complementary approach is that of finding conditions under which a
set of pathways is consistently UP- or DOWN-regulated. This is the
pathway-based version of the Gene Set Enrichment Analysis (GSEA). As
an application example, this approach can be used to find drugs
mimicking the dysregulation of a gene by looking for drugs
dysregulating the pathways involving the gene (this has been published
as the `gene2drug` tool [3]). See the `PathSEA` function.


### Toy and real-world examples

This vignette uses toy data, as real data can be computationally
expensive. Connectivity Map data [4] (drug induced gene expression
profiles) pre-converted to PEPs can be downloaded from
http://dsea.tigem.it in the `gep2pep` format. At the end of this
vignette, a precomputed example is reported in which that data is
used.


## Creating a repository of pathways

In order to use the package, it must be loaded as follows (the
GSEABase package will also be used in this vignette to access gene set
data):

```{r}
library(gep2pep)
suppressMessages(library(GSEABase))
```

The [MSigDB](http://software.broadinstitute.org/gsea/msigdb) is a
curated database of gene set collections. The entire database can be
downloaded as a single XML file and used by `gep2pep`. The following
commented code would import the database once downloaded:

```{r}
## db <- importMSigDB.xml("msigdb_v6.1.xml")
```

However, for this vignette a small excerpt will be used. Gep2pep uses
a slight variation of the `BroadCollection` type used by MSigDB,
named `CategorizedCollection`. Thus a conversion is necessary:

```{r}
db <- loadSamplePWS()
db <- as.CategorizedCollection(db)
```

The database is in `GSEABase::GeneSetCollection` format and
includes 30 pathways, each of which is included in one of 3 different
collections. Following MSigDB conventions, each collection is
identified by a "category" and "subCategory" fields. But the
CategorizedCollection type allows them to be arbitrary strings, as
opposed to MSigDB categories. This allows for the creation of custom
collection with categories. `gep2pep` puts together into a
single collection identifier using `makeCollectionIDs`.


```{r}
colltypes <- sapply(db, collectionType)
cats <- sapply(colltypes, attr, "category")
subcats <- sapply(colltypes, attr, "subCategory")
print(cats)
print(subcats)
makeCollectionIDs(db)
```

In order to build a local `gep2pep` repository containing pathway
data, `createRepository` is used:

```{r, collapse=TRUE}
repoRoot <- file.path(tempdir(), "gep2pep_data")
rp <- createRepository(repoRoot, db)
```

The repository is in `repo` format (see later in this document how to
access the repository directly). However, knowing `repo` is not
necessary to use `gep2pep`. The following lists the contents of the
repository, loads the `GeneSetCollection` object containing all the
TFT database gene sets and finally shows the description of the
pathway "E47_01":

```{r}
rp
TFTsets <- rp$get("c3_TFT_sets")
TFTsets
description(TFTsets[["E47_01"]])
```

`rp$get` is a `repo` command, see later in this vignette.

## Creating Pathway Expression Profiles

Pathway Expression Profiles (PEPs) are created from Gene Expression
Profiles (GEPs) using pathway information from the
repository. GEPs must be provided as a matrix with rows corresponding
to genes and columns corresponding to conditions
(*conditions*). Genes and conditions must be specified through row
and column names respectively. The values must be ranks: for each
condition, the genes must be ranked from that being most UP-regulated
(rank 1) to that being most DOWN-regulated (rank equal to the number
of rows of the matrix).

One well known database that can be obtained in this format is for
example the Connectivty Map. A small excerpt (after further
processing) is included with the `gep2pep`. The excerpt must be
considered as a dummy example, as it only includes 500 genes for 5
conditions. It can be loaded as follows:

```{r}
geps <- loadSampleGEP()
dim(geps)
geps[1:5, 1:3]
```

The GEPs can be converted to PEPs using the `buildPEPs` function. They
are stored as repository items by the names "category_subcategory".

```{r, collapse=TRUE}
buildPEPs(rp, geps)
```

Each PEP is composed of an Enrichment Score (ES) -- p-value (PV) pair
associated to each pathway. ESs and PVs are stored in two separated
matrices. For each condition, the p-value reports wether a pathway
is significantly dysregulated and the sign of the corresponding ES
indicates the direction (UP- or DOWN-regulation).

```{r}
loadESmatrix(rp, "c3_TFT")[1:3, 1:3]
loadPVmatrix(rp, "c3_TFT")[1:3, 1:3]
```

## Performing Analysis

### Performing Condition-Set Enrichment Analysis (CondSEA).

Suppose the stored PEPs correspond to pharmacological
perturbations. Then `gep2pep` can perform Drug-Set Enrichment Analysis
(DSEA, see Napolitano et al., 2016, Bioinformatics). It finds pathways
that are consistently dysregulated by a set of drugs, as opposed to a
background of other drugs. Of course PEPs may refer to
non-pharmacological conditions (genetic perturbations, disease states,
etc.) for analogous analyses (Condition-Set Enrichment Analysis,
CondSEA). Given a set `pgset` of drugs of interest, CondSEA (which in
this case is a DSEA) is performed as follows:

```{r, collapse=TRUE}
pgset <- c("(+)_chelidonine", "(+/_)_catechin")
psea <- CondSEA(rp, pgset)
```

The result is a list of of 2 elements, named "CondSEA" and "details",
the most important of which is the former. Per-collection results can
be accessed as follows:

```{r}
getResults(psea, "c3_TFT")
```

In this dummy example the statistical background is made of only 3
GEPs (we added 5 in total), thus, as expected, there are no
significant p-values. For the c3_MIR collection, the pathway most
UP-regulated by the chosen set of two drugs is M5012, while the most
DOWN-regulated is M18759. They are respectively described as:

```{r}
sets <- rp$get("c3_MIR_sets")
wM5012 <- which(sapply(sets, setIdentifier)=="M5012")
wM18759 <- which(sapply(sets, setIdentifier)=="M18759")

description(sets[[wM5012]])

description(sets[[wM18759]])
```

The analysis can be exported in XLS format as follows:

```{r, eval=FALSE}
exportSEA(rp, psea)
```

Performing `CondSEA` using Pathway Expression Profiles derived from
drug-induced gene expression profiles yields Drug Set Enrichment
Analysis (DSEA [2]).


### Performing Pathway-Set Enrichment Analysis (PathSEA)

A complementary approach to CondSEA is Pathway-Set Enrichment Analysis
(PathSEA). PathSEA searches for conditions that consistently
dysregulate a set of pathways. It can be seen as a pathway-based
version of the popular Gene Set Enrichment Analysis (GSEA). The
PathSEA is run independently in each pathway collection.

```{r, collapse=TRUE}
pathways <- c("M11607", "M10817", "M16694",         ## from c3_TFT
              "M19723", "M5038", "M13419", "M1094") ## from c4_CGN
w <- sapply(db, setIdentifier) %in% pathways
subdb <- db[w]

psea <- PathSEA(rp, subdb)
```

```{r}
getResults(psea, "c3_TFT")
```

PathSEA results are analogous to those of CondSEA, but
condition-wise. A set of pathways con also be obtained starting from
a gene of interest, for example:

```{r}
pathways <- gene2pathways(rp, "FAM126A")
pathways
```

Using a gene to obtain the pathways and performing `PathSEA` with
drug-induced Pathway Expression Profiles yields "gene2drug" analysis,
see the following reference:

* Napolitano F. et al, gene2drug: a Computational Tool for
  Pathway-based Rational Drug Repositioning, bioRxiv (2017) 192005;
  doi: https://doi.org/10.1101/192005



## A real-world example

Precomputed Pathway Expression Profiles of the Connectivity Map data
in the gep2pep format can be downloaded, unpacked and opened as
follows:

```{r, eval=F}
download.file("http://dsea.tigem.it/data/Cmap_MSigDB_v6.1_PEPs.tar.gz",
  "Cmap_MSigDB_v6.1_PEPs.tar.gz")

untar("Cmap_MSigDB_v6.1_PEPs.tar.gz")

rpBig <- openRepository("Cmap_MSigDB_v6.1_PEPs")
```

Using these data, two kinds of analysis can be performed:

1. Drug Set Enrichment Analysis, which looks for commond pathways
   shared by a set of drugs.

2. Gene2drug analysis, which looks for drugs dysregulating a gene of
   interest.

The analyses below are not built at runtime with this document and
could become outdated.


### Drug Set Enrichment Analysis (DSEA)

Drug Set Enrichment Analysis for a set of HDAC inhibitors using the
Gene Ontology collections can be performed as follows:

```{r, eval=F}
csea <- CondSEA(rpBig, c("scriptaid", "trichostatin_a", "valproic_acid",
                      "vorinostat", "hc_toxin", "bufexamac"),
                collections=c("C5_BP", "C5_MF", "C5_CC"))
## [16:41:40] Working on collection: C5_BP
## [16:41:42] Common conditions removed from bgset
## [16:41:42] Row-ranking collection
## [16:41:48] Computing enrichments
## [16:41:58] done
## [16:41:58] Working on collection: C5_MF
## [16:41:58] Row-ranking collection
## [16:42:00] Computing enrichments
## [16:42:02] done
## [16:42:02] Working on collection: C5_CC
## [16:42:02] Row-ranking collection
## [16:42:03] Computing enrichments
## [16:42:04] done
```

The following code retrieves information about the top 10 pathways
ranked by CondSEA in GO-MF.

```{r, eval=F}
library(GSEABase)
setids <- sapply(loadCollection(rpBig, "C5_MF"), setIdentifier)
MFresults <- getResults(csea, "C5_MF")
w <- match(rownames(MFresults)[1:10], setids)
top10 <- loadCollection(rpBig, "C5_MF")[w]
sapply(top10, setName)
##  [1] "GO_TRANSCRIPTION_FACTOR_ACTIVITY_PROTEIN_BINDING"
##  [2] "GO_TRANSCRIPTION_COACTIVATOR_ACTIVITY"           
##  [3] "GO_PHOSPHATIDYLCHOLINE_1_ACYLHYDROLASE_ACTIVITY" 
##  [4] "GO_RETINOIC_ACID_RECEPTOR_BINDING"               
##  [5] "GO_PRE_MRNA_BINDING"                             
##  [6] "GO_N_ACETYLTRANSFERASE_ACTIVITY"                 
##  [7] "GO_CYTOSKELETAL_PROTEIN_BINDING"                 
##  [8] "GO_PEPTIDE_N_ACETYLTRANSFERASE_ACTIVITY"         
##  [9] "GO_ACETYLTRANSFERASE_ACTIVITY"                   
## [10] "GO_HYDROGEN_EXPORTING_ATPASE_ACTIVITY"           
```

Note that 2 main effects of HDAC inhibitors have been correctly
identified: regulation of transcription, and alteration of the
acetylation/deacetylation homeostasis. The full analysis can be
exported to the Excel format with:

```{r, eval=F}
exportSEA(rpBig, csea)
```

### Gene2drug analysis

A Gene2drug analysis can be performed starting by a gene of interest,
for example the TFEB gene. Pathways including the gene are found as
follows:

```{r, eval=F}
pws <- gene2pathways(rpBig, "TFEB")
```

The following code runs the PathSEA analysis on the pathways involving
TFEB. Also in this case the analysis is performed on Gene Ontology
collections. Note that a warning is thrown as the GO-CC category has
no annotation for TFEB (this is ok).

```{r, eval=F}
psea <- PathSEA(rpBig, pws, collections=c("C5_BP", "C5_MF", "C5_CC"))
## Warning: [17:17:13] There is at least one selected collections for
## which no pathway has been provided
## [17:17:13] Removing pathways not in specified collections
## [17:17:13] Working on collection: C5_BP
## [17:17:13] Common pathway sets removed from bgset
## [17:17:15] Column-ranking collection
## [17:17:22] Computing enrichments
## [17:17:29] done
## [17:17:29] Working on collection: C5_MF
## [17:17:29] Common pathway sets removed from bgset
## [17:17:29] Column-ranking collection
## [17:17:30] Computing enrichments
## [17:17:32] done
```

Thus the top 10 drugs causing (or mimicking) TFEB upregulation are:

```{r, eval=F}
getResults(psea, "C5_BP")[1:10,]
##                      ES           PV
## loperamide    0.7324720 1.504075e-11
## proadifen     0.7278256 2.079448e-11
## hydroquinine  0.7220082 3.110434e-11
## bepridil      0.6904276 2.616027e-10
## clomipramine  0.6891810 2.839879e-10
## alexidine     0.6741085 7.574142e-10
## digitoxigenin 0.6737685 7.741670e-10
## lanatoside_c  0.6651556 1.342553e-09
## helveticoside 0.6642112 1.425479e-09
## ouabain       0.6631157 1.527949e-09
```

Note that the top drug, loperamide, has been demonstrated to induce
TFEB translocation at very low concentrations [3]. As before, the full
analysis can be exported to the Excel format with:

```{r, eval=F}
exportSEA(rpBig, psea)
```

## Advanced access to the repository

`gep2pep` users don't need to understand how to interact with a
`gep2pep` repository, however it can be useful in some
cases. `gep2pep` repositories are in the `repo` format (see the `repo`
package), so they can be accessed as any other `repo`
repository. However item tags should not be changed, as they are used
by `gep2pep` to identify data types. Each `gep2pep` repository always
contains a special *project* item including repository and session
information, which can be shown as follows:

```{r}
rp$info("gep2pep repository")
```

Poject name and description can be provided when creating the
repository (`createRepository`), or edited with `rp$set("gep2pep
repository", newname="my project name", description="my project
description")`. The repository will also contain an item for each
pathway collection, and possibly an item for each corresponding PEP
collection, as in this example:

```{r}
rp
```

In order to look at the space that the repository is using, the
following command can be used:

```{r}
rp$info()
```


`put`, `set` and other `repo` commands can be used to alter repository
contents directly, however this could leave the repository in an
inconsistent state. The following code checks a repository for
possible problems:

```{r, collapse=TRUE}
checkRepository(rp)
```

The last check (summary of commond conditions) ensures that the same
conditions have been computed for all the pathway collections, which
is however not mandatory.


## Further documentation

Additional methodological help can be found at:

1. http://dsea.tigem.it
2. http://gene2drug.tigem.it
3. Open access papers [2] and [3].


```{r include=FALSE}
unlink(repoRoot, TRUE)
```


## References

[1] Subramanian A. et al. Gene set enrichment analysis: A
    knowledge-based approach for interpreting genome-wide expression
    profiles. PNAS 102, 15545-15550 (2005).
    
[2] Napolitano F. et al, Drug-set enrichment analysis: a novel tool to
    investigate drug mode of action. Bioinformatics 32, 235-241 (2016).

[3] Napolitano F. et al, gene2drug: a Computational Tool for
    Pathway-based Rational Drug Repositioning, bioRxiv (2017) 192005;
    doi: https://doi.org/10.1101/192005

[4] Lamb, J. et al. The Connectivity Map: Using Gene-Expression
    Signatures to Connect Small Molecules, Genes, and Disease. Science
    313, 1929-1935 (2006).


<hr/>
<small><i>
Built with `gep2pep` version `r packageVersion("gep2pep")`
</i></small>