---
title: "diffUTR"
author:
- name: Pierre-Luc Germain
  affiliation:
  - D-HEST Institute for Neurosciences, ETH Zürich
  - Laboratory of Statistical Bioinformatics, University Zürich
package: diffUTR
output:
  BiocStyle::html_document
abstract: |
  Showcases the use of the diffUTR package for streamlining analyses of 
  differential exon usage and differential 3' UTR usage.
vignette: |
  %\VignetteIndexEntry{1_diffUTR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
library(BiocStyle)
```

# Introduction

## Differential exon usage

`diffUTR` wraps around three methods for different exon usage analysis:

* The `r BiocStyle::Biocpkg("DEXSeq")` package (see `?DEXSeqWrapper`)
* An improved version of `r BiocStyle::Biocpkg("limma")`'s `diffSplice` method 
(see `?diffSpliceWrapper`)
* `r BiocStyle::Biocpkg("edgeR")`'s `diffSpliceDGE` method 
(see `?diffSpliceDGEWrapper`)

All three wrappers have been designed to use the same input (the 
`RangedSummarizedExperiment` object created by `countFeatures` -- see below) 
and produce highly similar outputs, so that they can all be used with the same 
downstream plotting functions (Figure 1A).

Based on various benchmarks (e.g. 
[Sonenson et al. 2016](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0862-3);
see also [our paper](https://doi.org/10.1186/s12859-021-04114-7)), 
`r BiocStyle::Biocpkg("DEXSeq")` is the most accurate of the three methods, 
and should therefore be the method of choice. It is however very slow and does 
not scale well to larger sample sizes. We therefore suggest our improved 
_diffSplice2_ method (`?diffSpliceWrapper`) when 
`r BiocStyle::Biocpkg("DEXSeq")` is not an option.

```{r echo = FALSE, fig.cap = "A: Overview of the diffUTR workflow. B: Bin creation scheme."}
knitr::include_graphics(system.file('docs', 'figure1.svg', package = 'diffUTR'))
```

## Differential 3' UTR usage

A chief difficulty in analyzing 3' UTR usage is that most UTR variants are not 
cataloged in standard transcript annotations, limiting the utility of standard 
transcript-level quantification based on reference transcripts. `diffUTR` 
leverages the differential exon usage methods for differential 3' UTR analysis 
using alternative poly-adenylation site databases to create additional UTR 
bins (Figure 1B).

<br/><br/>

# Getting started

## Package installation

Install the package with:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("diffUTR")
```

We load diffUTR as well as the `r BiocStyle::Biocpkg("SummarizedExperiment")` 
package on which it depends:

```{r}
suppressPackageStartupMessages({
  library(diffUTR)
  library(SummarizedExperiment)
})
```

## Obtaining gene annotations

Prior to using `diffUTR` you will need a gene annotation. This can either be 
provided as a `gtf` file (with relatively standard formatting), or as an 
`r BiocStyle::Biocpkg("ensembldb")` object. For example, an `EnsDb` object for 
the latest mouse annotation can be fetched as follows:

```{r, eval=FALSE}
library(AnnotationHub)
ah <- AnnotationHub()
# obtain the identifier of the latest mouse ensembl annotation:
ahid <- rev(query(ah, c("EnsDb", "Mus musculus"))$ah_id)[1]
ensdb <- ah[[ahid]]
```

This `ensdb` object could then be directly passed to the `prepareBins` function
 (see below).

For the purpose of this vignette, we will use a reduced annotation containing 
only 100 mouse genes:

```{r}
data(example_gene_annotation, package="diffUTR")
g <- example_gene_annotation
head(g)
```

# Workflow for differential exon usage (DEU) analysis

## Preparing the annotation

Because exons partially overlap, they must be disjoined in non-overlapping bins
for the purpose of analysis. This is handled by the `prepareBins` function:

```{r}
# If you know that your data will be stranded, use
# bins <- prepareBins(g, stranded=TRUE)
# Otherwise use
bins <- prepareBins(g)
```

## Counting reads in bins

Counting reads (or fragments) overlapping bins is done using the 
`countFeatures` function, with a vector of paths to `bam` files as input. Under
 the hood, this calls the `featureCounts` function of the 
 `r BiocStyle::Biocpkg("Rsubread")` package, so any argument of that function 
 can be passed to `countFeatures`. For example:

```{r, eval=FALSE}
bamfiles <- c("path/to/sample1.bam", "path/to/sample2.bam", ...)
rse <- countFeatures(bamfiles, bins, strandSpecific=2, nthreads=6, isPairedEnd=FALSE)
```

Note that `strandSpecific` should be set correctly (i.e. to 0 if the data is 
unstranded, and to 1 or 2 if stranded -- see `?Rsubread::featureCounts`), and
`isPairedEnd=TRUE` should be used if the data is paired-end.

For the purpose of this vignette, we load a pre-computed object containing data
 from Whipple et al. (2020):
```{r}
data(example_bin_se, package="diffUTR")
rse <- example_bin_se
rse
```

The output of `countFeatures` is a RangedSummarizedExperiment object (see 
`r BiocStyle::Biocpkg("SummarizedExperiment")`) containing the samples' counts 
(as well as normalized assays) across bins, as well as all the bin information 
in the `rowRanges(se)`:

```{r}
head(rowRanges(rse))
```
In this example only a subset of information was retained, but the original 
file stores any information present in the annotation (e.g. transcripts, 
biotype, etc.)

## Differential analysis

For the purpose of this example, we will use the improved diffSplice method:

```{r}
rse <- diffSpliceWrapper(rse, design = ~condition)
```

The `~condition` formula indicates that the samples should be split according 
to the `group` column of `colData(rse)`. Alternatively to the formula inferace,
a `model.matrix` can be directly passed. For more complex models involving 
several terms, you will have to specify the coefficient to test using the 
`coef` argument.

The bin-wise results of the the differential analysis have been saved in the 
`rowData` of the object:

```{r}
head(rowData(rse))
```

The gene-wise aggregation has been saved in the object's metadata:
```{r}
perGene <- metadata(rse)$geneLevel
head(perGene)
```

The results are described in more detail below.

<br/><br/>


# Workflow for differential 3' UTR usage analysis

## Obtaining alternative poly-adenlyation sites and preparing the bins

The preparation of the bins is done as in the standard DEU case, except that an
 additional argument should be given providing the alternative poly-adenylation
 (APA) sites. These will be used to break and extend UTRs into further bins. 
APA sites can be provided as a `r BiocStyle::Biocpkg("GenomicRanges")` object 
or the path to a `bed` file containing the coordinates. For mouse, human and 
C. elegans, an atlas of poly-A sites can be downloaded from 
https://polyasite.unibas.ch/atlas 
(\link[Herrmann et al. 2020]{https://doi.org/10.1093/nar/gkz918}). You can 
download the mouse file:

```{r, eval=FALSE}
download.file("https://polyasite.unibas.ch/download/atlas/2.0/GRCm38.96/atlas.clusters.2.0.GRCm38.96.bed.gz",
              destfile="apa.mm38.bed.gz")
bins <- prepareBins(g, "apa.mm38.bed.gz")
# (Again, if you know that your data will be stranded, use `stranded=TRUE`)
```

Unfortunately, the [polyASite](https://polyasite.unibas.ch/atlas) database 
does not contain APA sites for the rat. A somewhat older similar database, 
[PolyA_DB](https://exon.apps.wistar.org/PolyA_DB/v3/), does include the rat, 
however with an obsolete annotation; for convenience we lifted it over to Rno6, 
and made it available in the package data:

```{r}
data(rn6_PAS)
# bins <- prepareBins(g, rn6_PAS)
```

If there is no APA sites database that includes your species of interest, you 
can still perform differential UTR analysis, however you will be limited to 
bins defined by the UTRs ends included in the gene annotation.

Note that if the gene annotation and the APA sites use different chromosome 
notation styles, you can enforce a given notation using the `chrStyle` argument
of `prepareBins`. Beware however that there is no internal check that the genome
builds match -- be sure that they do!

For the purpose of this example, we'll just use the same `bins` as in the first
 example.


## Counting and differential analysis:

Counting reads in bins works as in the standard DEU case:

```{r, eval=FALSE}
bamfiles <- c("path/to/sample1.bam", "path/to/sample2.bam", ...)
se <- countFeatures(bamfiles, bins, strandSpecific=2, nthreads=6, isPairedEnd=TRUE)
```

Differential analysis also works in the same way as for DEU:

```{r}
rse <- diffSpliceWrapper( rse, design = ~condition )
```

By default, this will look for changes in _any_ bin type. To look specifically 
for changes in certain types of bins, you can perform the gene-level 
aggregation using different filters. This can be done using the 
`geneLevelStats()` function, which accepts the arguments `excludeTypes` and 
`includeTypes` to decide which bin types to include in the gene-level estimates.
For example:

```{r}
# we can then only look for changes in CDS bins:
cds <- geneLevelStats(rse, includeTypes="CDS", returnSE=FALSE)
head(cds)
# or only look for changes in UTR bins:
utr <- geneLevelStats(rse, includeTypes="UTR", returnSE=FALSE)
head(utr)
# or only look for changes in bins that _could be_ UTRs:
# geneLevelStats(rse, excludeTypes=c("CDS","non-coding"), returnSE=FALSE)
```

<br/><br/>

# Exploring the results

## Top genes

`plotTopGenes()` provides gene-level statistic plots (similar to a 'volcano 
plot'), which come in two variations in terms of aggregated effect size 
(x-axis). For standard DEU analysis, we take the weighted average of absolute 
bin-level coefficients, using the bins' `-log10(p-value)` as weights, to 
produce gene-level estimates of effect sizes. For differential 3' UTR usage, 
where bins are expected to have consistent directions (i.e. lengthening or 
shortening of the UTR) and where their size is expected to have a strong impact 
on biological function, the signed bin-level coefficients are weighted both by 
both size and significance to produce (signed) gene-level estimates of effect 
sizes.

```{r}
plotTopGenes(rse)
```

By default, the size of the points reflects the relative expression of the 
genes, and the colour the relative expression of the significant bins with 
respect to the gene (density ratio). This enables the rapid identification of 
changes occuring in highly-expressed genes, as well as discount changes 
happening in bins which tend not to be included in transcripts from that gene 
(low density ratio). Note that it is possible to use filter by density ratio 
during gene-level aggregation (see `?geneLevelStats`).

The product is a `ggplot` object, and can be manipulated as such.

By default, the gene-level statistics computed use all bin types. However, 
`plotTopGenes` also accepts directly the gene-level stats, as outputted by
`geneLevelStats` (see above), enabling the visualization of top genes 
separately for CDS and UTRs, for instance:

```{r}
# `utr` being the output of the above
# geneLevelStats(rse, includeTypes="UTR", returnSE=FALSE)
plotTopGenes(utr, diffUTR = TRUE)
```

Here we also indicated that we are in differential UTR mode, to use signed and
width-weighted aggregation.

Note that when there are too many gene labels to be plotted nearby, `ggrepel` 
will omit them and produce the warning above. To plot them nevertheless, you
can pass the argument `max.overlaps` with a higher value (see 
`?ggrepel::geom_text_repel`).


## Gene profiles

The package also offers two functions for plotting bin-level profiles for a 
specific gene. Before doing so, we generate normalized assays if this was not 
already done:

```{r}
rse <- addNormalizedAssays(rse)
```

`deuBinPlot` provides plots similar to those produced by 
`r BiocStyle::Biocpkg("DEXSeq")` and `r BiocStyle::Biocpkg("limma")`, but 
offering more flexibility. They can be plotted as overall bin statistics, per 
condition, or per sample, and can plot various types of values. Importantly, 
since all data and annotation is contained in the object, these can easily be 
included in the plots, mapping for instance to colour or shape:

```{r}
deuBinPlot(rse,"Jund")
deuBinPlot(rse,"Jund", type="condition", colour="condition")
deuBinPlot(rse,"Jund", type="sample", colour="condition") # shows separate samples
```

And since the output is a `ggplot`, we can modify it at will:
```{r}
library(ggplot2)
deuBinPlot(rse,"Jund",type="condition",colour="condition") + 
  guides(colour = guide_legend(override.aes = list(size = 3))) +
  scale_colour_manual(values=c(CTRL="darkblue", LTP="red"))
```

Finally, `geneBinHeatmap` provides a compact, bin-per-sample heatmap 
representation of a gene, allowing the simultaneous visualization of various 
information: 

```{r}
geneBinHeatmap(rse, "Smg6")
```

Bins that are overlapping multiple genes will be flagged as such (here there 
are none).

We found these representations particularly useful to prioritize candidates 
from differential bin usage analyses. For example, many genes show differential 
usage of bins which are generally not included in most transcripts of that gene
(low count density), and are therefore less likely to be relevant (note that 
it is possible to use filter by density ratio during gene-level aggregation --
see `?geneLevelStats`). On the other hand, some genes show massive expression 
of a single bin which might (or might not) represent a different, unannotated 
transcript.

### Overlaying with transcripts

The `r BiocStyle::Biocpkg("ggbio")` package has functionalities enabling to plot
bin-level information along side transcript tracks. While wrappers for this are 
not included in the `diffUTR` package so as to keep dependencies low, the 
function below would be one way of using this with `rse` objects from `diffUTR` 
as well as the `EnsDb` object used to create the bins:

```{r}
#' binTxPlot
#'
#' @param se A bin-wise SummarizedExperiment as produced by
#' \code{\link{countFeatures}} and including bin-level tests (i.e. having been
#' passed through one of the DEU wrappers such as
#' \code{\link{diffSpliceWrapper}} or \code{\link{DEXSeqWrapper}})
#' @param ensdb The `EnsDb` which was used to create the bins
#' @param gene The gene of interest
#' @param by The colData column of `se` used to split the samples
#' @param assayName The assay to plot
#' @param removeAmbiguous Logical; whether to remove bins that are
#' gene-ambiguous (i.e. overlap multiple genes).
#' @param size Size of the lines
#' @param ... Passed to `plotRangesLinkedToData`
#'
#' @return A `ggbio` `Tracks`
#' @importFrom AnnotationFilter GeneNameFilter GeneIdFilter
#' @importFrom ensembldb getGeneRegionTrackForGviz
#' @importFrom ggbio plotRangesLinkedToData autoplot
#' @export
binTxPlot <- function(se, ensdb, gene, by=NULL, assayName=c("logNormDensity"),
                      removeAmbiguous=TRUE, size=3, threshold=0.05, ...){
  w <- diffUTR:::.matchGene(se, gene)
  se <- sort(se[w,])
  if(removeAmbiguous) se <- se[!rowData(se)$geneAmbiguous,]
  if(length(w)==0) return(NULL)
  if(!is.null(by)) by <- match.arg(by, colnames(colData(se)))
  assayName <- match.arg(assayName, assayNames(se))
  if(rowData(se)$gene[[1]]==gene){
    filt <- GeneIdFilter(gene)
  }else{
    filt <- GeneNameFilter(gene)
  }
  gr <- ggbio::autoplot(ensdb, filt)
  gr2 <- rowRanges(se)
  if(!is.null(by)){
    sp <- split(seq_len(ncol(se)), se[[by]])
    for(f in names(sp))
      mcols(gr2)[[f]] <- rowMeans(assays(se)[[assayName]][,sp[[f]],drop=FALSE])
    y <- names(sp)
  }else{
    mcols(gr2)[[assayName]] <- rowMeans(assays(se)[[assayName]])
    y <- assayName
  }
  ggbio::plotRangesLinkedToData(gr2, stat.y=y, stat.ylab=assayName, 
                                annotation=list(gr),
                                size=size, ...) + ggtitle(gene)
}

# example usage (not run):
# binTxPlot(rse, ensdb, gene="Arid5a", by="condition")
```

(Additionally requires the `r BiocStyle::Biocpkg("ggbio")` package)

<br/><br/>
***

```{r}
sessionInfo()
```