---
title: "Topology-based pathway analysis of RNA-seq data"
author:
    - name: Ivana Ihnatova
      affiliation: Institute of Biostatistics and Analyses, 
                   Masarykova University Brno
      email: ihnatova@iba.muni.cz
    - name: Ludwig Geistlinger
      affiliation: School of Public Health, City University of New York
      email: ludwig.geistlinger@sph.cuny.edu
package: ToPASeq
abstract: > 
    The _ToPASeq_ package implements methods for topology-based pathway analysis
    of RNA-seq data. This includes Topological Analysis of Pathway Phenotype 
    Association (TAPPA; Gao and Wang, 2007), PathWay Enrichment Analysis (PWEA; 
    Hung et al., 2010), and the Pathway Regulation Score (PRS; Ibrahim et al., 
    2012).
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{Topology-based pathway analysis of RNA-seq data}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r setup, echo=FALSE}
suppressPackageStartupMessages({ 
    library(ToPASeq)
    library(EnrichmentBrowser)
    library(graphite)
    library(BiocStyle)
})
```

# Setup

**Note:** the `r Biocpkg("ToPASeq")` package currently undergoes a major rework due to 
the change of the package maintainer. It is recommended to use the topology-based methods 
implemented in the `r Biocpkg("EnrichmentBrowser")` or the `r Biocpkg("graphite")`
package instead.

We start by loading the package.

```{r lib}
library(ToPASeq)
```

# Preparing the data

For RNA-seq data, we consider transcriptome profiles of four primary human
airway smooth muscle cell lines in two conditions: control and treatment with
dexamethasone
[Himes et al., 2014](https://doi.org/10.1371/journal.pone.0099625).

We load the `r Biocpkg("airway")` dataset 

```{r loadAirway}
library(airway)
data(airway)
```

For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID. 
```{r processAirway}
airSE <- airway[grep("^ENSG", rownames(airway)),]
dim(airSE)
assay(airSE)[1:4,1:4]
```

# Differential expression

The `r Biocpkg("EnrichmentBrowser")` package incorporates established 
functionality from the `r Biocpkg("limma")` package for differential expression 
analysis.
This involves the `voom` transformation when applied to RNA-seq data.
Alternatively, differential expression analysis for RNA-seq data can also be
carried out based on the negative binomial distribution with `r Biocpkg("edgeR")`
and `r Biocpkg("DESeq2")`.

This can be performed using the function `EnrichmentBrowser::deAna`
and assumes some standardized variable names:

- **GROUP** defines the sample groups being contrasted,
- **BLOCK** defines paired samples or sample blocks, as e.g. for batch effects.

For more information on experimental design, see the
[limma user's guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf),
chapter 9.

For the airway dataset, the **GROUP** variable indicates whether the cell 
lines have been treated with dexamethasone (1) or not (0).

```{r pdataAirway}
airSE$GROUP <- ifelse(airway$dex == "trt", 1, 0)
table(airSE$GROUP)
```

Paired samples, or in general sample batches/blocks, can be defined via a
**BLOCK** column in the `colData` slot.
For the airway dataset, the sample blocks correspond to the four different cell
lines.

```{r pdataAirway2}
airSE$BLOCK <- airway$cell
table(airSE$BLOCK)
```

For RNA-seq data, the `deAna` function can be used to carry out differential 
expression analysis between the two groups either based on functionality from
*limma* (that includes the `voom` transformation), or
alternatively, the frequently used *edgeR* or *DESeq2*
package. Here, we use the analysis based on *edgeR*.

```{r deAirway}
library(EnrichmentBrowser)
airSE <- deAna(airSE, de.method="edgeR")
rowData(airSE, use.names=TRUE)
```


# Pathway analysis

Pathways are typically represented as graphs, where the nodes are genes and edges 
between the nodes represent interaction between genes. 

The `r Biocpkg("graphite")` package provides pathway collections from major 
pathway databases such as KEGG, Biocarta, Reactome, and NCI.

Here, we retrieve human KEGG pathways.  

```{r pwys}
library(graphite)
pwys <- pathways(species="hsapiens", database="kegg")
pwys
```

As the airway dataset uses ENSEMBL gene IDs, but the nodes of the pathways are 
based on NCBI Entrez Gene IDs,

```{r nodes}
nodes(pwys[[1]])
```

we first map the gene IDs in the airway dataset from ENSEMBL to ENTREZ IDs. 

```{r mapIDs}
airSE <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")
```

Next, we define all genes with adjusted _p_-value below 0.01 as differentially 
expressed, and collect their log2 fold change for further analysis.

```{r genes}
all <- names(airSE)
de.ind <- rowData(airSE)$ADJ.PVAL < 0.01
de <- rowData(airSE)$FC[de.ind]
names(de) <- all[de.ind]
```

This results in 2,426 DE genes - out of 11,780 genes in total. 

```{r nrGenes}
length(all)
length(de)
```

## Pathway Regulation Score (PRS)

The Pathway Regulation Score (PRS) incorporates the pathway topology by weighting
the indiviudal gene-level log2 fold changes by the number of downstream DE genes. 
The weighted absolute fold changes are summed across the pathway and statistical 
significance is assessed by permutation of genes. 
[Ibrahim et al., 2012](https://doi.org/10.1089/cmb.2011.0182)   
    

```{r prs}
res <- prs(de, all, pwys[1:100], nperm=100)
head(res)
```

Corresponding gene weights (number of downstream DE genes) can be obtained for a
pathway of choice via
```{r prsWeights}
ind <- grep("ErbB signaling pathway", names(pwys))
weights <- prsWeights(pwys[[ind]], de, all)
weights
```

Inspecting the genes with maximum number of downstream DE genes
```{r maxWeight}
weights[weights == max(weights)]
```
reveals important upstream regulators including several G protein subunits 
such as subunit beta 2 (Entrez Gene ID 
[2783](https://www.ncbi.nlm.nih.gov/gene/?term=2783)).