---
title: "CleanUpRNAseq: detecting and correcting gDNA contamination in RNA-seq data"
author:
- name: Haibo Liu
  affiliation: MCCB, UMass Chan Medical School, Worcester, USA
- name: Kai Hu
  affiliation: MCCB, UMass Chan Medical School, Worcester, USA
- name: Kevin O'Connor
  affiliation: MCCB, UMass Chan Medical School, Worcester, USA
- name: Michelle Kelliher
  affiliation: MCCB, UMass Chan Medical School, Worcester, USA
- name: Lihua Julie Zhu
  affiliation: MCCB, UMass Chan Medical School, Worcester, USA
date: "`r Sys.Date()`"
package: "CleanUpRNAseq"
output:
  BiocStyle::html_document:
    toc_float: true
link-citations: yes
bibliography: bibliography.bib
abstract: |
  Some RNA-seq data might suffer from genomic DNA (gDNA) contamination due to 
  carrying over of residual gDNA in RNA preparation into sequencing library. 
  The R package CleanUpRNAseq provides a set of functionalities to detect and 
  correct for gDNA contamination, thus facilitate better quantitation of 
  gene expression and differential expression analysis.
vignette: |
  %\VignetteIndexEntry{CleanUpRNAseq: detecting and correcting for DNA contamination\nin RNA-seq data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
options(timeout = Inf)
knitr::opts_chunk$set(
    eval = TRUE,
    message = FALSE,
    warning = FALSE,
    tidy = FALSE,
    fig.align = 'center'
)
```

# Introduction   
RNA-seq has become a state-of-art technology for studying gene expression
[@Wang2009]. However, due to improper RNA preparation and choice of some 
library preparation protocols, such as rRNA-depletion based RNA-seq protocol 
[@ONeil2013] and the [SMART-Seq](https://t.ly/sMZho) protocol, RNA-seq data 
might suffer from genomic DNA (gDNA) contamination, which skews quantitation
of gene expression and hinders differential gene expression analysis [@Li2022;@Verwilt2020;@Markou2021]. Some quality control tools have been 
developed to check the quality of RNA-seq data at the raw read and 
post-alignment levels, including [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/),
RSeQC [@Wang2012], Qualimap [@GarcaAlcalde2012], RNA-SeQC/RNA-SeQC 2 
[@DeLuca2012;@Graubert2021], QoRTs [@Hartley2015], RNA-QC-Chain [@Zhou2018], 
and [RNAseqQC](https://cran.r-project.org/web/packages/RNAseqQC/index.html). 
Those post-alignment tools can report percentages of reads mapping to different 
genomic features, such as genes, exons, introns, intergenic regions, and rRNA 
exons. Thus, they can be used to roughly detect gDNA contamination. So far, [SeqMonk](https://www.bioinformatics.babraham.ac.uk/projects/seqmonk/) and the
gDNAx package are the only tool which can be used to both detect and correct
for gDNA contamination in RNA-seq data. However, SeqMonk is a Java-based GUI 
tool which makes it not available in most high performance computing cluster.
More importantly, seqMonk assumes a uniform distribution of reads derived from
gDNA contamination and its performance on correcting for gDNA contamination in 
RNA-seq data is not fully evaluated and peer reviewed. On the other hand, gDNAx
is an R/Bioconductor package and can be incorporated into automated RNA-seq
data analysis pipeline easily. However, gDNAx only simply removes reads mapping
to intergenic regions and introns, but not those mapping to exons, to mitigate
gDNA contamination. Thus, gDNAx's algorithm for correcting for gDNA 
contamination is not sophisticated. To this end, we developed the R pacakge `r Biocpkg("CleanUpRNAseq")`, which provides a full set of functions for detecting 
and correcting for gDNA contamination in RNA-seq data across all computing 
platforms.  


# Setting up
As for any other sequencing data analysis, users should first check the quality 
of raw RNA-seq sequencing data using a tool like [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) followed
by MultiQC [@Ewels2016]. Depending on the quality of the raw data, trimming low
quality bases and/or adpator sequences might be necessary to increase mapping
quality and rate. Subsequently, reads are mapped to the reference genome of 
interest using tools, such as STAR [@Dobin2012], HISAT2 [@Kim2019], 
GMAP [@Wu2016], and Subread [@Liao2013]. The resulting alignment files in the
BAM format are used for post-alignment quality control, including detection of
gDNA contamination. Meanwhile, if the RNA-seq library is *unstranded*, reads 
are mapped to the reference transcriptome using Salmon [@Patro2017] to get 
transcript-level abundance, counts, and effective lengths, since Salmon can
resolve reads mapped to multiple transcripts, which are imported into R using 
the `r Biocpkg("tximport")` to get gene-level counts, abundance and length 
information. However, if the library is *stranded*, reads are mapped to the 
reference transcriptome using Salmon twice: one using the designed strandedness
for the '--libType' option, the other using the opposite strandedness. See
Salmon library type discussion
<https://salmon.readthedocs.io/en/latest/library_type.html>. Then gene-level 
counts, abundance and length information are imported into R as above.


# How to run CleanUpRNAseq   
First, load the required packages, including CleanUpRNAseq. Then users can 
perform a step-by-step analysis of the RNA-seq data as below for more 
flexibility. For users' convenience, CleanUpRNAseq also offers two wrapper 
functions, create_diagnostic_plots and correct_for_contamination, for detecting 
and correcting for gDNA contamination in RNA-seq data. As for how to use these
wrappers, please refer to their function documentation.  

```{r load_package, eval = TRUE}
suppressPackageStartupMessages({
  library("CleanUpRNAseq")
  #devtools::load_all("../../CleanUpRNAseq")
  library("ggplotify")
  library("patchwork")
  library("ensembldb")
  library("utils")
})
```

## Step 1: Load an EnsDb package or prepare an EnsDb database     
Users can list all current EnsDb packages from `r Biocpkg("AnnotationHub")` and 
load the package of choice, if available, as follows. Here, an EnsDb package
for the human genome is loaded. It is crucial to use an EnsDb pacakge which
represents the genome annotation file (GTF) used for RNA-seq read alignment.

```{r load_ensdb, eval=FALSE}
suppressPackageStartupMessages({
  library("EnsDb.Hsapiens.v86")
})
hs_ensdb_sqlite <- EnsDb.Hsapiens.v86
```

Otherwise, users can easily prepare an EnsDb database from an Ensembl GTF file.
For all following steps, this option is adopted because the latest version of
human transcriptome has been used for read mapping by STAR and Salmon to 
quantify gene expression of the example RNA-seq data.

```{r create_ensdb}
options(timeout = max(3000, getOption("timeout")))
tmp_dir <- tempdir()
gtf <- system.file("extdata", "example.gtf.gz",
                    package = "CleanUpRNAseq")

hs_ensdb_sqlite <-
  ensDbFromGtf(
        gtf = gtf,
        outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"),
        organism = "Homo_Sapiens",
        genomeVersion = "GRCh38",
        version = 110
  )
```
 
## Step 2. Prepare SAF (simplified annotation format) files 
Potential DNA contamination exists if a significantly high portion of RNA-seq 
reads mapped to intergenic regions. `r Biocpkg("CleanUpRNAseq")` uses the 
*featureCounts* function from the `r Biocpkg("Rsubread")` package to 
quantify the percentage of reads mapped to different genomic features: genes,
exons, introns, intergenic regions, rRNA exons, and organelle genome(s). So 
users need to prepare SAF files for these genomic features. In addition, 
a BAM file is needed to provide the lengths of the chromosomes/scaffolds.

```{r prepare_saf}
bam_file <- system.file("extdata", "K084CD7PCD1N.srt.bam",
    package = "CleanUpRNAseq"
)
saf_list <- get_saf(
    ensdb_sqlite = hs_ensdb_sqlite,
    bamfile = bam_file,
    mitochondrial_genome = "MT"
)
```
## Step 3. Summarize reads mapped to different genomic features  
Reads mapped to different genomic features is summarized by using the
*featureCounts* function with the SAF files generated above as annotation.
`r Biocpkg("CleanUpRNAseq")` also quantifies the reads spanning exon-exon 
junctions and the reads mapped to exons of each gene using the GTF file as 
annotation. The junction read count table is used to determine the unexpressed, 
multiexonic genes, while the gene-level read count table is used for comparing
samples at the gene level. Here two downsampled BAM files are used for speeding
up the demonstration, while a precomputed R object in the .RData format is 
used for actual downstream analysis.

```{r load_data}
 tmp_dir <- tempdir()
 in_dir <- system.file("extdata", package = "CleanUpRNAseq")
 gtf.gz <- dir(in_dir, ".gtf.gz$", full.name = TRUE)
 gtf <- file.path(tmp_dir, gsub("\\.gz", "", basename(gtf.gz)))
 R.utils::gunzip(gtf.gz, destname= gtf,
                 overwrite = TRUE, remove = FALSE)

 in_dir <- system.file("extdata", package = "CleanUpRNAseq")
 BAM_file <- dir(in_dir, ".bam$", full.name = TRUE)
 salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE)
 sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file)
 salmon_quant_file_opposite_strand <- salmon_quant_file
 col_data <- data.frame(sample_name = sample_name,
                        BAM_file = BAM_file,
                        salmon_quant_file = salmon_quant_file,
                        salmon_quant_file_opposite_strand =
                            salmon_quant_file_opposite_strand,
                        group = c("CD1N", "CD1P"))

 sc <- create_summarizedcounts(lib_strand = 0, colData = col_data)
 
## featurecounts 
 capture.output({counts_list <- summarize_reads(
     SummarizedCounts = sc,
     saf_list = saf_list,
     gtf = gtf,
     threads = 1,
     verbose = TRUE
 )}, file = tempfile())

## load salmon quant 
salmon_counts <- salmon_tximport(
     SummarizedCounts = sc,
     ensdb_sqlite = hs_ensdb_sqlite
)

```

## Step 4. Check DNA contamination  
Precomputed R object in the .RData format (*feature_counts_list.rda*, and
*salmon_quant.rda*) 
containing the *featureCounts* output and *Salmon quantification* output
imported by using the `tximport` function for a RNA-seq dataset of 8 samples
from two treatment groups are used for demo. GC-content for genes and
intergenic regions (*intergenic_GC.rda*, and *gene_GC.rda*) were also
precomputed and used for demo.

```{r plot_assignment_stat} 
#| mappingstat, fig.height = 6, fig.width = 6, 
#| fig.cap = "Read mapping status"
data("feature_counts_list")
data("salmon_quant")
data("intergenic_GC")
data("gene_GC")

lib_strand <- 0
col_data_f <- system.file("extdata", "example.colData.txt",
                          package = "CleanUpRNAseq")
col_data <- read.delim(col_data_f, as.is = TRUE)
## create fake bam files
tmp_dir <- tempdir()
bamfiles <- gsub(".+/", "", col_data$BAM_file)
null <- lapply(file.path(tmp_dir, bamfiles), file.create)
## create fake quant.sf files
quant_sf <- file.path(tmp_dir, gsub(".srt.bam$",
                                     "quant.sf",
                                     bamfiles))
null <- lapply(quant_sf, file.create)
col_data$BAM_file <- file.path(tmp_dir, bamfiles)
col_data$salmon_quant_file <- quant_sf

## pretend this is stranded RA=NA-seq data
col_data$salmon_quant_file_opposite_strand <- quant_sf

sc <- create_summarizedcounts(lib_strand, col_data)
sc$set_feature_counts(feature_counts_list)
sc$set_salmon_quant(salmon_quant)
sc$set_salmon_quant_opposite(salmon_quant)

p_assignment_stat <-plot_assignment_stat(SummarizedCounts = sc)
wrap_plots(p_assignment_stat)
```

```{r} 
#| readdistr, fig.height = 4, fig.width = 5, 
#| fig.cap = "Read distribution over different genomic features"
assigned_per_region <- get_region_stat(SummarizedCounts = sc)
p <- plot_read_distr(assigned_per_region)
p

```

```{r plot_sample_corr} 
#| samplecorr, fig.height = 8, fig.width = 8,
#| fig.cap = "Smoothed scatter plot showing pairwise sample correlation"
plot_sample_corr(SummarizedCounts = sc)
```


```{r plot_read_distr} 
#| exprdistr, fig.height = 8, fig.width = 5, 
#| fig.cap = "Expression distribution"
p_expr_distr <- plot_expr_distr(
    SummarizedCounts = sc,
    normalization = "DESeq2"
)
wrap_plots(p_expr_distr, ncol = 1, nrow = 3)
```

```{r percent_expressed_gene}
#| exprgene, fig.height = 3, fig.width = 3.5, out.width = "80%", 
#| fig.cap = "Percent of expressed genes"

plot_gene_content(
    SummarizedCounts = sc,
    min_cpm = 1,
    min_tpm =1
)
```


```{r pca_heatmap}
#| samplesimilarity, fig.height = 5, fig.width = 10,
#| fig.cap = "PCA score plot and heatmap showing sample similarity"

## DESeq2 exploratory analysis before correction
p<- plot_pca_heatmap(SummarizedCounts = sc,
                     silent = TRUE)
p$pca + as.ggplot(p$heatmap)
```

## Step 5. Correct for DNA contamination
If the libraries are unstranded, `r Biocpkg("CleanUpRNAseq")` uses a median 
per-base read coverage (median_PBRC) over non-zero count intergenic
regions to estimate per-base DNA contamination over exons of each gene of each 
sample, and corrects for gene-level DNA contamination by subtracting 
median_PBRC * average_gene_length from the Salmon count table of each gene of
each sample. 
```{r global_correction}
global_correction <- correct_global(SummarizedCounts = sc)
```

Alternatively, for unstranded RNA-seq data, `r Biocpkg("CleanUpRNAseq")` offers
a correction method leveraging GC-content bias on PCR amplification.
```{r GC_correction}
gc_correction <-
    correct_GC(
        SummarizedCounts = sc,
        gene_gc = gene_GC,
        intergenic_gc = intergenic_GC,
        plot = FALSE
    )
```
However, if the libraries are stranded, `r Biocpkg("CleanUpRNAseq")` considered 
as gene-level contamination the count table resulted from quantitation using the
opposite strandedness setting. By subtracting the DNA contamination of each gene 
in each sample from the count table resulted from quantitation using the actual 
strandedness setting, users can get contamination corrected count table. To this
end, use the `correct_stranded` function.


## Step 6. Check correction effect
After correcting for DNA contamination, we expect to see more comparable gene 
expression across samples.  

Boxplots, density plots and empirical cumulative distribution after correction
revealed gene expression across samples are more comparable.

# Session info
Here is the output of `sessionInfo()` on the system on which this document was 
compiled running pandoc `r rmarkdown::pandoc_version()`:
```{r sessionInfo, eval=TRUE, echo = FALSE}
sessionInfo()
```