---
title: "Splice event prediction and quantification from RNA-seq data"
author: "Leonard D Goldstein"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('SGSeq')`"
abstract: >
  *SGSeq* provides a framework for analyzing annotated and previously uncharacterized splice events from RNA-seq data. Input data must be provided as BAM files containing RNA-seq reads aligned against a reference genome. Exons and splice junctions are predicted from aligned reads and are assembled into a genome-wide splice graph. Splice events are identified from the graph and quantified using reads spanning event boundaries. This vignette provides an introduction to *SGSeq*, including splice event prediction, quantification, annotation and visualization.
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{SGSeq}
  %\VignettePackage{SGSeq}
output: 
  BiocStyle::html_document:
    toc: true
bibliography: SGSeq.bib
---

```{r, echo = FALSE, results = 'hide'}
library(knitr)
opts_chunk$set(error = FALSE)
```

```{r style, echo = FALSE, results = 'asis'}
##BiocStyle::markdown()
```

# Preliminaries

```{r, message = FALSE}
library(SGSeq)
```

This vignette illustrates an analysis of paired-end RNA-seq data from four tumor and four normal colorectal samples, which are part of a data set published in [@Seshagiri:2012gr]. For the purpose of this vignette, we created BAM files that only include reads mapping to a single gene of interest (*FBXO31*).

When starting a new project, *SGSeq* requires information about the samples to be analyzed. This information can be provided as a *data.frame*, which must include columns *sample_name* (specificying a unique name for each sample) and *file_bam* (specifying the location of the BAM file). Function *getBamInfo* can be used to extract additional required information from BAM files, including paired-end status, median read length, median insert size and the total number of aligned reads. This information must be obtained once initially and can then be used for all subsequent analyses. It is essential that BAM files are generated using a splice-aware alignment program that generates the custom tag 'XS' indicating the direction of transcription for spliced reads. In the following, we work with a *data.frame* *si* previously generated from the original (complete) BAM files with function *getBamInfo*. 

```{r}
si
``` 

The following code block sets the correct BAM file paths in the sample information for this vignette.

```{r}
path <- system.file("extdata", package = "SGSeq")
si$file_bam <- file.path(path, "bams", si$file_bam)
``` 

# Transcript features and the *TxFeatures* class

We use the UCSC knownGene table as reference annotation, which is available as a *Bioconductor* annotation package `r Biocannopkg("TxDb.Hsapiens.UCSC.hg19.knownGene")`. We retain transcripts on chromosome 16, where the *FBXO31* gene is located, and change chromosome names in the annotation to match chromosome names in the BAM files.

```{r, message = FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) <- "NCBI"
``` 

To work with transcript annotation in the *SGSeq* framework, we first extract exons and splice junctions from the *TxDb* object using function *convertToTxFeatures*. We only retain features overlapping the *FBXO31* gene (genomic coordinates of the *FBXO31* gene are stored in the *GRanges* object *gr*).

```{r}
txf_ucsc <- convertToTxFeatures(txdb)
txf_ucsc <- txf_ucsc[txf_ucsc %over% gr]
txf_ucsc
```

*SGSeq* makes extensive use of the *Bioconductor* infrastructure for genomic ranges [@Lawrence:2013hi]. The *TxFeatures* class shown above extends the *GRanges* class with additional columns. Column *type* can take values 

* *J* (splice junction)
* *I* (internal exon)
* *F* (first/5$^\prime$-terminal exon)
* *L* (last/3$^\prime$-terminal exon)
* *U* (unspliced).

Columns *txName* and *geneName* indicate the transcript and the gene that each feature derives from. Note that a feature can belong to more than one transcript. Accordingly these columns can store multiple values for each feature.

columns can be accessed using accessor functions named after the columns they access (e.g. use function *type* to obtain feature type). 

If transcript annotation is not available as a *TxDb* object, function *importTranscripts* can be used to import annotation in GFF format. Function *convertToTxFeatures* can construct *TxFeatures* from a *GRangesList* of exons grouped by transcript.

# Splice graph features and the *SGFeatures* class

Exons stored as *TxFeatures* can be overlapping (e.g. due to alternative splice sites). Overlapping exons can result in ambiguities when attempting to assign reads to individual exons. We therefore partition exonic regions into disjoint exon bins. Splice junctions and disjoint exon bins uniquely determine a genome-wide splice graph [@Heber:2002aa]. To store splice graph features, *SGSeq* implements the *SGFeatures* class. 

```{r}
sgf_ucsc <- convertToSGFeatures(txf_ucsc)
sgf_ucsc
``` 

Similar to *TxFeatures*, *SGFeatures* extends the *GRanges* class with additional columns. Column *type* for an *SGFeatures* object takes values

* *J* (splice junction)
* *E* (disjoint exon bin)
* *D* (splice donor site)
* *A* (splice acceptor site).

By convention, splice donor and acceptor sites correspond to exonic positions immediately upstream and downstream of the intron, respectively. Note that splice sites are redundant in the sense that they are determined by the splice junctions included in the *SGFeatures* object. When assigning read counts to each feature (see below), counts for exons and splice junctions are based on structurally compatible reads. In the case of splice donor and acceptor sites, counts indicate the number of reads that extend across the spliced boundary (i.e. overlapping the splice site, as well as the flanking intronic position). Splice sites are included in the *SGFeatures* object as their counts are subsequently used for splice variant quantification.  

*SGFeatures* includes additional columns not included in *TxFeatures*. *spliced5p* and *spliced3p* indicate whether exon bins have a mandatory splice at the 5$^\prime$ and 3$^\prime$ boundary, respectively. This information is used to determine whether a read is structurally compatible with an exon bin, as well as to determine whether an exon bin is consistent with an annotated transcript. 

Column *featureID* provides a unique identifier for each feature, while columnn *geneID* indicates the unique connected component of the splice graph a feature belongs to. 

Both *TxFeatures* and *SGFeatures* objects can be exported to BED files using function *exportFeatures*.

# Analysis based on annotated transcripts

We can now start analyzing the RNA-seq data at the *FBXO31* gene locus. We first perform an analysis based on annotated transcripts. The following example converts the transcript features into splice graph features and obtains counts of compatible RNA-seq reads for each feature and each sample. 

```{r}
sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)
sgfc_ucsc
``` 

*analyzeFeatures* returns an object of class *SGFeatureCounts*, which extends the *RangedSummarizedExperiment* class from the `r Biocpkg("SummarizedExperiment")` package. *SGFeatureCounts* contains the sample information as *colData*, splice graph features as *rowRanges* and assays *counts* and *FPKM*, which store structurally compatible counts and FPKMs, respectively. Accessor functions *colData*, *rowRanges*, *counts* and *FPKM* can be used to access the data. For example, counts and FPKMs can be extracted from an *SGFeatureCounts* object as shown below.

```{r}
head(counts(sgfc_ucsc))
head(FPKM(sgfc_ucsc))
```

Compatible FPKMs for splice graph features can be visualized with function *plotFeatures*. *plotFeatures* generates a two-panel figure with a splice graph shown in the top panel and a heatmap illustrating expression levels of individual features in the bottom panel. For customization of *plotFeatures* output, see section [Visualization]. The plotting function invisibly returns a *data.frame* with information on splice graph features, including genomic coordinates.

```{r figure-1, fig.width=4.5, fig.height=4.5}
df <- plotFeatures(sgfc_ucsc, geneID = 1)
df
``` 

Note that the splice graph derived from annotated transcripts includes three alternative transcript start sites (TSSs). However, the heatmap indicates that the first TSS is not used in the samples in our data set.

# Analysis based on *de novo* prediction

Instead of relying on existing annotation, *SGSeq* can predict features from BAM files directly. The following code block predicts splice graph features with read evidence in our data set.

```{r}
sgfc_pred <- analyzeFeatures(si, which = gr)
``` 

For interpretability, we annotate predicted features with respect to transcripts included in the UCSC knownGene table. The *annotate* function assigns compatible transcripts to each feature and stores them in column *txName*. column *geneName* behaves transitively, meaning all features belonging to the same connected component of the splice graph (with identical *geneID*) have the same value for *geneName*. This behavior makes it easy to identify unannotated features (with empty *txName*) that belong to an annotated gene (non-empty *geneName*).

```{r}
sgfc_pred <- annotate(sgfc_pred, txf_ucsc)
``` 

Predicted splice graph features and compatible FPKMs can be visualized as previously. Splice graph features with missing annotation can be highlighted using argument *color_novel*.

```{r figure-2, fig.width=4.5, fig.height=4.5}
df <- plotFeatures(sgfc_pred, geneID = 1, color_novel = "red")
df
``` 

Note that most exons and splice junctions predicted from the RNA-seq data are consistent with transcripts in the UCSC knownGene table (shown in gray). However, in contrast to the previous figure, the predicted gene model does not include parts of the splice graph that are not expressed in our data set. Also, an unannotated exon (E3, shown in red) was discovered from the RNA-seq data, which is expressed in three of the four normal colorectal samples (N2, N3, N4). 

# Analysis of predicted splice variants

Instead of considering the complete splice graph of a gene, we can focus the analysis on individual splice events. In the *SGSeq* framework, the splice graph is a directed acyclic graph with nodes corresponding to transcript starts, ends and splice sites, and edges corresponding to disjoint exon bins and splice junctions, directed from 5$^\prime$ to the 3$^\prime$ end. A splice event is defined by a start node and an end node connected by two or more paths and no intervening nodes with all paths intersecting. *SGSeq* identifies splice events recursively from the graph, and estimates relative usage of splice variants based on compatible reads spanning the event boundaries. The following example identifies splice events from the splice graph and obtains representative counts for each splice variant.

```{r}
sgvc_pred <- analyzeVariants(sgfc_pred)
sgvc_pred
``` 

*analyzeVariants* returns an *SGVariantCounts* object. Similar to *SGFeatureCounts*, *SGVariantCounts* extends the *RangedSummarizedExperiment* class. *SGVariantCounts* contains sample information as *colData* and *SGVariants* as *rowRanges*. Assay *variantFreq* stores estimates of relative usage for each splice variant and sample. Accessor functions *colData*, *rowRanges* and *variantFreq* can be used to access the data. For example, estimates of relative usage can be extracted from an *SGVariantCounts* object as shown below.

```{r}
variantFreq(sgvc_pred)
```

Information on splice variants is stored in metadata columns and can be accessed as follows.

```{r}
mcols(sgvc_pred)
``` 

* *from* and *to* indicate the variant start and end node, respectively, *from* nodes are splice donor sites (*D*) or transcript starts (*S*), while *to* nodes are splice acceptor sites (*A*) or transcript ends (*E*)
* columns *type* and *featureID* describe the variant in terms of the splice graph features that make up the variant
* *segmentID* are unique identifiers labelling unbranched segments of the splice graph (internal use only)
* *closed5p* indicates whether the nodes belonging to a splice variant can be reached from nodes outside of the splice event exclusively through the *from* node
* *closed3p* indicates whether the nodes belonging to a splice variant can reach nodes outside of the splice event exclusively through the *to* node
* *eventID* and *variantID* are unique identifiers for each splice event and splice variant, respectively
* *featureID5p* and *featureID3p* indicate representative features that are used for obtaining relative usage estimates at the variant start and end, respectively
* *variantType* indicates whether a splice variant is consistent with a canonical splice event (for a list of possible values, see the manual page for *annotateSGVariants*)
* *variantName* provides a unique identifier for each splice variant that is intended to be more human-readable than the numeric identifier stored in *variantID* (for details, see the manual page for *makeVariantNames*)

Splice variants and estimates of relative usage can be visualized with function *plotVariants*.

```{r figure-3, fig.width=1.5, fig.height=4.5}
plotVariants(sgvc_pred, eventID = 1, color_novel = "red")
``` 

*plotVariants* generates a two-panel figure similar to *plotFeatures*. The splice graph in the top panel illustrates the selected splice event. In this example, the splice event consists of two splice variants that correspond to a skip or inclusion of the unannotated exon. The heatmap illustrates estimates of relative usage for each splice variant. We observe that samples N2, N3 and N4 show evidence for both transcripts that include the exon as well as transcripts that skip the exon. The remaining samples show little evidence for exon inclusion.

# Visualization

Functions *plotFeatures* and *plotVariants* support many options for customizing figures. Note that the splice graph in the top figure panel is plotted by function *plotSpliceGraph*, which can be called directly. 

*plotFeatures* includes multiple alternative arguments for selecting features to be displayed. The following code illustrates three different ways of selecting and plotting the splice graph and expression levels for *FBXO31* (Entrez ID 79791).

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1)
plotFeatures(sgfc_pred, geneName = "79791")
plotFeatures(sgfc_pred, which = gr)
```

By default, the heatmap generated by *plotFeatures* displays splice junctions. Alternatively, exon bins, or both exon bins and splice junctions can be displayed.

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1, include = "junctions")
plotFeatures(sgfc_pred, geneID = 1, include = "exons")
plotFeatures(sgfc_pred, geneID = 1, include = "both")
```

Argument *toscale* controls which parts of the gene model are drawn to scale. 

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1, toscale = "gene")
plotFeatures(sgfc_pred, geneID = 1, toscale = "exon")
plotFeatures(sgfc_pred, geneID = 1, toscale = "none")
```

Heatmaps allow the visualization of expression values summarized for splice junctions and exon bins. Alternatively, per-base read coverages and splice junction counts can be visualized with function *plotCoverage*.

```{r, figure-4, fig.width=4.5, fig.height=4.5}
par(mfrow = c(5, 1), mar = c(1, 3, 1, 1))
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4) {
  plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}
```

# Advanced use

Functions *analyzeFeatures* and *analyzeVariants* wrap multiple analysis steps for convenience. Alternatively, the functions performing individual steps can be called directly. For example, the previous analysis using *de novo* prediction can be performed as follows.

```{r}
txf <- predictTxFeatures(si, gr)
sgf <- convertToSGFeatures(txf)
sgf <- annotate(sgf, txf_ucsc)
sgfc <- getSGFeatureCounts(si, sgf)
sgv <- findSGVariants(sgf)
sgvc <- getSGVariantCounts(sgv, sgfc)
``` 

*predictTxFeatures* and *getSGFeatureCounts* can be run on individual samples (e.g. for distribution across a high-performance computing cluster). *predictTxFeatures* predicts features for each sample, merges features across samples and finally performs filtering and processing of predicted terminal exons. When using *predictTxFeatures* for individual samples, with predictions intended to be merged at a later point in time, run *predictTxFeatures* with argument *min_overhang = NULL* to suppress processing of terminal exons. Then predictions can subsequently be merged and processed with functions *mergeTxFeatures* and *processTerminalExons*, respectively.

# Session information

```{r}
sessionInfo()
``` 

# References