---
title: "Overview of the scRNAseq dataset collection"
author: "Davide Risso"
date: "Created: May 25, 2016; Compiled: `r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{scRNAseq}
  %\VignetteEngine{knitr::rmarkdown}
bibliography: biblio.bib
---

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```

# Raw data availability and accession codes

This package contains a collection of three publicly available single-cell RNA-seq datasets. The data were downloaded from NCBI's SRA or from EBI's ArrayExpress (see below for Accession numbers)

The dataset `fluidigm` contains 65 cells from [@pollen2014low], each sequenced at high and low coverage (SRA: SRP041736).

The dataset `th2` contains 96 T helper cells from [@mahata2014single] (ArrayExpress: E-MTAB-2512).

The dataset `allen` contains 379 cells from the mouse visual cortex. This is a subset of the data published in [@tasic2016adult] (SRA: SRP061902).

# Pre-processing and summary

SRA files were downloaded from the Sequence Read Archive and the SRA Toolkit was used to transform them to FASTQ. FASTQ files were downloaded from EMBL-EBI ArrayExpress.

Reads were aligned with TopHat (v. 2.0.11) [@trapnell2009tophat] to the appropriate reference genome (GRCh38 for human samples, GRCm38 for mouse). RefSeq mouse gene annotation (GCF_000001635.23_GRCm38.p3) was downloaded from NCBI on Dec. 28, 2014. RefSeq human gene annotation (GCF_000001405.28) was downloaded from NCBI on Jun. 22, 2015.

featureCounts (v. 1.4.6-p3) [@liao2013featurecounts] was used to compute gene-level read counts and Cufflinks (v. 2.2.0) [@trapnell2010transcript] was used to compute gene-leve FPKM's.

In parallel, reads were mapped to the transcriptome using RSEM (v. xx) [@li2011rsem] to compute read counts and TPM's.

FastQC (v. 0.10.1) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and Picard (v. 1.128) (http://broadinstitute.github.io/picard) were used to compute sample quality control (QC) metrics. (Picard's scripts `CollectRnaSeqMetrics`, `CollectAlignmentSummaryMetrics` and `CollectInsertSizeMetrics`).

Note that all the samples available in GEO and/or ArrayExpressed were included in the data object, including the samples that were excluded in the original publication because they did not pass QC.

# Data format and metadata

The package contains each dataset in the form of `SummarizedExperiment` objects. To illustrate the format of each dataset, we will use the `fluidigm` data.

```{r, message=FALSE}
library(scRNAseq)
data(fluidigm)
fluidigm
```

We can retrieve the gene expression measures by using the `assay` contstruct.

```{r}
head(assay(fluidigm)[,1:3])
```

`assay` will return the gene-level read counts. If we want to access the FPKM values, we need the following

```{r}
head(assay(fluidigm, 2)[,1:3])
```

Alternatively, we can use the `assays` accessor to get a list with both matrices.

```{r}
names(assays(fluidigm))
head(assays(fluidigm)$fpkm[,1:3])
```

Note that the all the protein-coding genes are included in the expression matrix, even if they are not expressed in these samples, hence filtering of the non-expressed genes should be performed before any statistical analysis.

```{r}
dim(fluidigm)
table(rowSums(assay(fluidigm))>0)
```

In addition to the gene expression levels, the object contains some useful QC information, as well as the available annotation of the samples. This information can be accessed through the `colData` accessor.

```{r}
colData(fluidigm)
```

The first columns are related to sample quality, while other fields include information on the samples, provided by the original authors in their GEO/SRA submission and/or as Supplementary files in the pubblication.

Finally, the object contains a list of `metadata` that provide additional information on the experiment.

```{r}
names(metadata(fluidigm))
metadata(fluidigm)$which_qc
```

In particular, in all datasets, the metadata list includes an element called `which_qc` that contains the names of the `colData` columns that relate to sample QC.

# ERCC spike-ins

The `th2` and `allen` datasets contain the expression of the ERCC spike-ins. Note that these are **included in the same matrices** as the endogenous genes, hence users must use caution to avoid when using the data, to avoid mistreat external spike-ins as endogenous genes. One may wish to split the datasets in two, e.g.:

```{r}
data(th2)
ercc_idx <- grep("^ERCC-", rownames(th2))
th2_endogenous <- th2[-ercc_idx,]
th2_ercc <- th2[ercc_idx,]

head(assay(th2_ercc)[,1:4])
```

# References