---
title: "Introduction to `scater`: Single-cell analysis toolkit for expression in R"
author: "Davis McCarthy"
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true
vignette: >
  %\VignetteIndexEntry{An introduction to the scater package}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---


```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE}
## To render an HTML version that works nicely with github and web pages, do:
## rmarkdown::render("vignettes/vignette.Rmd", "all")
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))
```

This document gives an introduction to and overview of the functionality of the
`scater` package.

The `scater` package is contains tools to help with the analysis of single-cell
transcriptomic data, with the focus on RNA-seq data. The package features:

* A data structure for single-cell expression data;
* Wrappers to [`kallisto`](http://pachterlab.github.io/kallisto/) for rapid
quantification of transcript abundance and tight integration with `scater`;
* Simple calculation of many quality control metrics from the expression data;
* Many tools for visualising scRNA-seq data, especially diagnostic plots
for quality control;
* Subsetting and many other methods for filtering out problematic cells and
features;
* Methods for identifying important experimental variables and normalising data
ahead of downstream statistical analysis and modeling.

Future versions of `scater` may also incorporate:

* Methods for identifying cells at different stages of the cell cycle;
* Differential testing methods for finding features with differences in
behaviour between experimental conditions.

To get up and running as quickly as possible, see the [Quick Start](#quickstart)
section below. For see the various in-depth sections on various aspects of the
functionality that follow.

# Quick Start
<a name="quickstart"></a>

Assuming you have a matrix containing expression count data summarised at the
level of some features (gene, exon, region, etc.), then we first need to form an
`SCESet` object containing the data. An `SCESet` object is the basic data
container that we use in `scater`.

Here we use the example data provided with the package, which gives us two 
objects, a matrix of counts and a dataframe with information about the cells we 
are studying:

```{r quickstart-load-data, message=FALSE, warning=FALSE}
library(scater, quietly = TRUE)
data("sc_example_counts")
data("sc_example_cell_info")
```

We use these objects to form an `SCESet` object containing all of the necessary
information for our analysis:

```{r quickstart-make-sceset, results='hide'}
pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd)
```

Subsetting is very convenient with this class. For example, we can filter out
features (genes) that are not expressed in any cells:

```{r filter-no-exprs}
keep_feature <- rowSums(is_exprs(example_sceset)) > 0
example_sceset <- example_sceset[keep_feature,]
```

Now we have the expression data neatly stored in a structure that can be used
for lots of exciting analyses.

It is straight-forward to compute many quality control metrics:

```{r quick-start-calc-qc-metrics, eval=FALSE}
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:40)
```

Now you can play around with your data using the graphical user interface (GUI),
which opens an interactive dashboard in your browser!

```{r quick-start-gui, eval=FALSE}
scater_gui(example_sceset)
```

Many plotting functions are available for visualising the data:

* `plot`: a plot method exists for `SCESet` objects, which gives an overview
of expression across cells.
* `plotQC`: various methods are available for producing QC diagnostic plots.
* `plotPCA`: produce a principal components plot for the cells.
* `plotTSNE`: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.
* `plotDiffusionMap`: produce a diffusion map (reduced dimension) plot for the
cells.
* `plotMDS`: produce a multi-dimensional scaling plot for the cells.
* `plotReducedDim`: plot a reduced-dimension representation of the cells.
* `plotExpression`: plot expression levels for a defined set of features.
* `plotPhenoData`: plot cell metadata and QC metrics.
* `plotFeatureData`: plot feature metadata and QC metrics.

More detail on all plotting methods is given throughout the vignette below.

Visualisations can highlight features and cells to be filtered out, which can be
done easily with the subsetting capabilities of scater.

The QC plotting functions also enable the identification of important
experimental variables, which can be conditioned out in the data normalisation
step.

After QC and data normalisation (methods are available in `scater`), the dataset
is ready for downstream statistical modeling.


#  A note on terminology

The capabilities of `scater` are built on top of Bioconductor's `Biobase`, which
ties us to some specific terminology. Some of this terminology may be unfamiliar
or seem peculiar for the single-cell RNA-seq setting `scater` is designed for.

In Bioconductor terminology we assay numerous "features" for a number of
"samples".

Features, in the context of `scater`, correspond most commonly to genes or
transcripts, but could be any general genomic or transcriptomic regions (e.g.
exon) of interest for which we take measurements. Thus, a command such as
`featureNames`, returns the names of the features defined for an `SCESet`
object, which in typical `scater` usage could correspond to gene IDs. In much of
what follows, it may be more intuitive to mentally replace "feature" with "gene"
or "transcript" (depending on the context of the study) wherever "feature"
appears.

In the `scater` context, "samples" refer to individual cells that we have
assayed. This differs from common usage of "sample" in other contexts, where
we might usually use "sample" to refer to an individual subject, a biological
replicate or similar. A "sample" in this sense in `scater` may be referred to as
a "block" in the more classical statistical sense. Within a "block" (e.g.
individual) we may have assayed numerous cells. Thus, the function `sampleNames`,
when applied to an `SCESet` object returns the cell IDs.


#  The SCESet class and methods

In `scater` we organise single-cell expression data in objects of the `SCESet`
class. The class inherits the Bioconductor `ExpressionSet` class, which
provides a common interface familiar to those who have analyzed microarray
experiments with Bioconductor. The class requires three input files:

  1. `exprs`, a numeric matrix of expression values, where rows are
  features, and columns are cells
  2. `phenoData`, an `AnnotatedDataFrame` object, where rows are cells, and
  columns are cell attributes (such as cell type, culture condition, day
  captured, etc.)
  3. `featureData`, an `AnnotatedDataFrame` object, where rows are features
  (e.g. genes), and columns are feature attributes, such as biotype, gc content,
  etc.

For more details about other features inherited from Bioconductor's
`ExpressionSet` class, type `?ExpressionSet` at the R prompt.

The requirements for the `SCESet` class (as with other S4 classes in R and
Bioconductor) are strict. The idea is that strictness with generating a valid
class object ensures that downstream methods applied to the class will work
reliably. Thus, the expression value matrix *must* have the same number of
columns as the `phenoData` object has rows, and it must have the same number of
rows as the `featureData` dataframe has rows. Row names of the `phenoData`
object need to match the column names of the expression matrix. Row names of the
`featureData` object need to match row names of the expression matrix.

You can create a new `SCESet` object using count data as follows. In this case,
the `exprs` slot in the object will be generated as log2(counts-per-million)
using the `cpm` function from `edgeR`, with a "prior count" value of 1:

```{r sceset-make-sceset-counts-only}
pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd,
                            featureData = fd)
example_sceset
```

We can also make an `SCESet` object with just a matrix of expression values
(no count data required) like this:

```{r sceset-make-sceset-exprs-only}
example2 <- newSCESet(exprsData = edgeR::cpm.default(sc_example_counts))
pData(example2)
fData(example2)
```

We have accessor functions to access elements of the `SCESet` object:

* `counts(object)`: returns the matrix of read counts. As you can see above, if
no counts are defined for the object, then the counts matrix slot is simpy
`NULL`.

```{r counts-accessor}
counts(example2)[1:3, 1:6]
```
* `exprs(object)`: returns the matrix of feature expression values. Typically these
should be log2(counts-per-million) values or log2(reads-per-kilobase-per-million-mapped),
appropriately normalised of course. The package will generally assume that
these are the values to use for expression.

```{r exprs-accessor}
exprs(example2)[1:3, 1:6]
```
* `is_exprs(object)`: returns a logical matrix indicating whether each feature
expression observation is above the defined `lowerDetectionLimit` (default is 0).
This can be determined on the count scale or the "expression" (i.e. `exprs(object)`)
scale.

```{r isexprs-accessor}
is_exprs(example2)[1:3, 1:6]
```

It is straight-forward to change the threshold by which we decide which
observations are expressed or not:

```{r sceset-define-is-exprs}
is_exprs(example2) <- calcIsExprs(example2, lowerDetectionLimit = 100,
                               exprs_data = "exprs")
is_exprs(example2)[1:3, 1:6]
```

If you later find some count data that you want to add to the `SCESet` object
(always worth checking down the back of the couch), then you can add it easily:

```{r sceset-add-count-data}
counts(example2) <- sc_example_counts
example2
counts(example2)[1:3, 1:6]
```

Handily, it is also easy to replace other data in slots of the `SCESet` object
using generic accessor and replacement functions.

```{r sceset-demo-replacement}
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
## replace featureData
fData(example_sceset) <- fd
## replace phenotype data
pData(example_sceset) <- pd
## replace expression data to be used
exprs(example_sceset) <- edgeR::cpm.default(counts(example_sceset),
                                            prior.count = 5, log = TRUE)
```

It is possible to get an overall view of the dataset by using the `plot` method
available for `SCESet` objects. This method plots the cumulative proportion of
each cell's library that is accounted for by the top highest-expressed features
(by default showing the cumulative proportion across the top 500 features).

```{r plot-sceset-default}
plot(example_sceset)
```

This gives an overall idea of differences in expression distributions for
different cells. It is used in the same way as per-sample boxplots are for
microarray or bulk RNA-seq data. Due to the large numbers of zeroes in
expression values for single-cell RNA-seq data, boxplots are not as useful, so
instead we focus on the contributions from the most expressed features for each
cell.

With this function, we can split up the cells based on phenoData variables to
get a finer-grained look at differences between cells. By default, the plot
method will try to use transcripts-per-million values for the plot, but if these
are not present in the `SCESet` object, then the values in `exprs(object)` will
be used. Other expression values can be used, specified with the `exprs_values`
argument

```{r plot-sceset-blocking}
plot(example_sceset, block1 = "Mutation_Status", block2 = "Treatment",
     colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
```

This sort of approach can help to pick up large differences in expression
distributions across different experimental blocks (e.g. processing batches or
similar.)


#  Plots of expression values

In `scater`, the `plotExpression` function makes it easy to plot expression
values for a subset of genes or features. This can be particularly useful when
investigating the some features identified as being of interest from differential
expression testing or other means.

```{r plot-expression}
plotExpression(example_sceset, rownames(example_sceset)[1:6],
               x = "Mutation_Status", exprs_values = "exprs")
```

This function uses `ggplot2`, making it easy to change the theme to whatever you
prefer. We can also show the median expression level per group on the plot and
show a violin plot to summarise the distribution of expression values:

```{r plot-expression-theme-bw}
plotExpression(example_sceset, rownames(example_sceset)[7:12],
               x = "Mutation_Status", exprs_values = "counts", colour = "Cell_Cycle",
               show_median = FALSE, show_violin = TRUE,  xlab = "Mutation Status",
               log = TRUE)
```


#  Quality control

The `scater` package puts a focus on aiding with quality control (QC) and
pre-processing of single-cell RNA-seq data before further downstream analysis.

We see QC as consisting of three distinct steps:

1. QC and filtering of features (genes)
2. QC and filtering of cells
3. QC of experimental variables

Following QC, we can proceed with data normalisation before downstream analysis
and modelling.

In the next few sections we discuss the QC and filtering capabilities available
in `scater`.

## Calculate QC metrics

To compute commonly-used QC metrics we have the function `calculateQCMetrics()`:

```{r calc-qc-metrics}
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:20)
varLabels(example_sceset)
```

More than one set of feature controls can be defined if desired.

```{r calc-qc-metrics-multi-feature-controls}
example_sceset <- calculateQCMetrics(
    example_sceset, feature_controls = list(controls1 = 1:20, controls2 = 500:1000),
    cell_controls = list(set_1 = 1:5, set_2 = 31:40))
varLabels(example_sceset)
```

### Cell-level QC metrics

This function adds the following columns to `pData(object)`:

* `total_counts`: total number of counts for the cell (aka 'library size')
* `log10_total_counts`: total_counts on the log10-scale
* `total_features`: the number of features for the cell that have expression above the
detection limit (default detection limit is zero)
* `filter_on_total_counts`: would this cell be filtered out based on its log10-total_counts
being (by default) more than 5 median absolute deviations from the median
log10-total_counts for the dataset?
* `filter_on_total_features`: would this cell be filtered out based on its total_features
being (by default) more than 5 median absolute deviations from the median
total_features for the dataset?
* `counts_feature_controls`: total number of counts for the cell that come from
(a set of user-defined) control features. Defaults to zero if no control features are
indicated.
* `counts_endogenous_features`: total number of counts for the cell that come from
endogenous features (i.e. not control features). Defaults to `total_counts` if no control features are
indicated.
* `log10_counts_feature_controls`: total number of counts from control features on
the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite
values) if no control features are indicated.
* `log10_counts_endogenous_features`: total number of counts from endogenous
features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid
infinite values) if no control features are indicated.
* `n_detected_feature_controls`: number of defined feature controls that have
expression greater than the threshold defined in the object.
*`pct_counts_feature_controls`: percentage of all counts that come from the
defined control features. Defaults to zero if no control features are defined.

If we define multiple sets of feature controls, then the above will be supplied for
all feature sets, plus the set of all feature controls combined, as appropriate.

Furthermore, where "counts" appear in the above, the same metrics will also be
computed for "exprs", "tpm" and "fpkm" (if tpm and fpkm are present in the
`SCESet` object).

### Feature-level QC metrics

The function further adds the following columns to `fData(object)`:

* `mean_exprs`: the mean expression level of the gene/feature.
* `exprs_rank`: the rank of the feature's expression level in the cell.
* `total_feature_counts`: the total number of counts mapped to that feature across all
cells.
* `log10_total_feature_counts`: total feature counts on the log10-scale.
* `pct_total_counts`: the percentage of all counts that are accounted for by the
counts mapping to the feature.
* `is_feature_control`: is the feature a control feature? Default is `FALSE` unless
control features are defined by the user.
* `n_cells_exprs`: the number of cells for which the expression level of the feature
is above the detection limit (default detection limit is zero).

```{r list-fdata-qc}
names(fData(example_sceset))
```

As above, where "counts" appear in the above, the same metrics will also be
computed for "exprs", "tpm" and "fpkm" (if tpm and fpkm are present in the
`SCESet` object).


## Produce diagnostic plots for QC

Visualising the data and metadata in various ways can be very helpful for QC. We
have a suite of plotting functions to produce diagnostic plots for:

1. Plotting the most expressed features across the dataset.
2. Finding the most important principal components for a given cell phenotype or
metadata variable (from `pData(object)`).
3. Plotting a set of cell phenotype/metadata variables against each other and
calculating the (marginal) percentage of feature expression variance that they
explain.

These three QC plots can all be accessed through the function `plotQC` (we need
to make sure there are no features with zero or constant expression).


## QC and filtering of features

The first step in the QC process is filtering out unwanted features. We will
typically filter out features with very low overall expression, and any others
that plots or other metrics indicate may be problematic.

First we look at a plot that shows the top 50 (by default) most-expressed
features. By default, "expression" is defined using the feature counts (if
available), but $tpm$, $cpm$, $fpkm$ or the `exprs` values can be used instead,
if desired.

```{r plot-qc-expression, fig.height=7.5, fig.width=8.5}
keep_feature <- rowSums(is_exprs(example_sceset)) > 4
example_sceset <- example_sceset[keep_feature,]
## Plot QC
plotQC(example_sceset, type = "highest-expression")
p1 <- plotQC(example_sceset, type = "highest-expression", exprs_values = "exprs")
p2 <- plotQC(example_sceset, type = "highest-expression", exprs_values = "tpm")
multiplot(p1, p2, cols = 2)
```

As we can see in the output above, if the desired expression values are not present
in the `SCESet` object (e.g. there are no $tpm$ values, above), then `exprs(object)`
values will be used instead, with a warning. So in the above, we end up with the same
plot twice.

The `multiplot` function allows a very simple way to plot multiple `ggplot2`
plots on the same page. For more sophisticated possibilities for arranging
multiple `ggplot2` plots, check out the excellent [`cowplot`](http://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html)
package, available on CRAN. If you have `cowplot` installed (highly
recommended), then `scater` will automatically use it to create particularly
attractive plots.

It can also be particularly useful to inspect the most-expressed features in
just the cell controls (for example blanks or bulk samples). Subsetting
capabilities for `SCESet` objects allow us to do this easily. In the previous
section, we defined two sets of cell controls in the call to
`calculateQCMetrics`. That function added the `is_cell_control` column to the
phenotype data of the `SCESet` object `example_sceset`, which indicates if a
cell is defined as a cell control across any of the cell control sets.

The `$` operator makes it easy to access the `is_cell_control` column and use it
to subset the `SCESet` as below. We can compare the most-expressed features in the
cell controls and in the cells of biological interest with this subsetting, as
shown below.

```{r plot-qc-expression-cell-controls, fig.height=7.5, fig.width=8.5}
p1 <- plotQC(example_sceset[, !example_sceset$is_cell_control],
             type = "highest-expression")
p2 <- plotQC(example_sceset[, example_sceset$is_cell_control],
       type = "highest-expression")
multiplot(p1, p2, cols = 2)
```

Another way to obtain an idea of the level of technical noise in the dataset is
to plot the frequency of expression (that is, number of cells with expression
for the gene above the defined threshold (default is zero)) against mean
expression expression level . A set of specific features to plot can be defined,
but need not be. By default, the function will look for defined feature controls
(as supplied to `calculateQCMetrics`). If feature controls are found, then these
will be plotted, if not then all features will be plotted.

```{r plot-qc-exprs-freq-vs-mean-default}
plotQC(example_sceset, type = "exprs-freq-vs-mean")
```

We can also plot just a subset of features with code like that below (plot not
shown):

```{r plot-qc-exprs-mean-vs-freq-defined-feature-set, results = 'hide', fig.show = FALSE}
feature_set_1 <- fData(example_sceset)$is_feature_control_controls1
plotQC(example_sceset, type = "exprs-freq-vs-mean", feature_set = feature_set_1)
```

Beyond these QC plots, we have a neat, general and flexible function for
plotting two feature metadata variables:

```{r plot-fdata}
plotFeatureData(example_sceset, aes(x = n_cells_exprs, y = pct_total_counts))
```

We can see that there is a small number of features that are ubiquitously
expressed expressed in all cells (`n_cells_exprs`) and account for a large
proportion of all counts observed (`pct_total_counts`; more than 0.5% of all
counts).

The subsetting of rows of `SCESet` objects makes it easy to drop unwanted
features.


## QC and filtering of cells

See `plotPhenoData` and other QC plots below. The subsetting of columns (which
correspond to cells) of `SCESet` objects makes it easy to drop unwanted cells.


### Plotting cell metadata variables

We also have neat functions to plot two cell metadata variables:

```{r plot-pdata}
plotPhenoData(example_sceset, aes(x = total_counts, y = total_features,
                                  colour = Mutation_Status))
```

```{r plot-pdata-cont-col, fig.show = FALSE}
plotPhenoData(example_sceset, aes(x = Mutation_Status, y = total_features,
                                  colour = log10_total_counts))
```

Note that ggplot aesthetics will work correctly (in general) for everything
except `colour` (`color`) and `fill`, which must be either columns of `pData`
or feature names (i.e. gene/transcript names).

These sorts of plots can be very useful for finding potentially problematic
cells.

```{r plot-pdata-col-gene-exprs}
plotPhenoData(example_sceset, aes(x = total_counts, y = total_features,
                                  colour = Gene_1000))
```

```{r plot-pdatacol-gene-exprs-2, fig.show = FALSE}
plotPhenoData(example_sceset, aes(x = pct_exprs_feature_controls,
                                  y = total_features, colour = Gene_0500))
```


```{r plot-pdatacol-gene-exprs-3, fig.show = FALSE}
plotPhenoData(example_sceset, aes(x = pct_exprs_feature_controls,
                                  y = pct_exprs_top_50_features,
                                  colour = Gene_0001))
```

The output of these functions is a `ggplot` object, which can be added to,
amended and altered. For example, if we don't like the legend position in the
`total_features` vs `total_counts` plot above, we can change it, and we could
also add a trend line for each group:

```{r plot-pdata-move-legend-add-trend}
plotPhenoData(example_sceset, aes(x = log10_exprs_feature_controls,
                                  y = total_features, colour = Mutation_Status)) +
    theme(legend.position = "top") +
    stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
```

Tapping into the powerful capabilities of `ggplot2`, the possibilities are many.

A particularly useful plot for cell QC is plotting the percentage of expression
accounted for by feature controls against total_features.

```{r plot-pdata-pct-exprs-controls}
plotPhenoData(example_sceset, aes(x = total_features,
                                  y = pct_exprs_feature_controls,
                                  colour = Mutation_Status)) +
    theme(legend.position = "top") +
    stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
```

On real data, we expect to see well-behaved cells with relatively high total_features
(number of features with detectable expression) and low percentage of expression
from feature controls. High percentage expression from feature controls and low
total_features are indicative of blank and failed cells.

The `plotPhenoData` function is useful for exploring the relationships between
the many QC metrics computed by `calculateQCMetrics` above. Often, problematic
cells can be identified from such plots.


### PCA and dimensionality reduction

The `plotPCA` function makes it easy to produce a PCA plot directly from an
`SCESet` object, which is useful for visualising cells.

The default plot shows the first two principal components and if any cell
controls have been defined, plots these cells in a different colour.

```{r plot-pca-default}
plotPCA(example_sceset)
```

By default, the PCA plot
is produced using the 500 features with the most variable expression across all
cells. The number of most-variable features used can be changed with the `ntop`
argument. Alternatively, a specific set of features to use for the PCA can be
defined with the `feature_set` argument. This allows, for example, using only
housekeeping features or control features to produce a PCA plot.

By default the PCA plot uses the expression values in the `exprs` slot of the
`SCESet` object, but other expression values can be used by specifying the
`exprs_values` argument:

```{r plot-pca-cpm}
plotPCA(example_sceset, exprs_values = "cpm")
```

A subset of features can be used to produce a PCA plot. In the code below only
the features defined as "feature controls" are used for the PCA (plot not shown).

```{r plot-pca-feature-controls, fig.show = FALSE}
plotPCA(example_sceset, feature_set = fData(example_sceset)$is_feature_control)
```

The function allows more than just the first two components to be plotted, and
also allows phenotype variables to be used to define the colour, shape and size
of points in the scatter plot.

```{r plot-pca-4comp-colby-shapeby, fig.height=5.5}
plotPCA(example_sceset, ncomponents = 4, colour_by = "Treatment",
        shape_by = "Mutation_Status")
```

When more than two components are plotted, the diagonal boxes in the scatter
plot matrix show the density for each component.

We can also use the colour and size of point in the plot to reflect feature
expression values.

```{r plot-pca-4comp-colby-sizeby-exprs, fig.height=5.5}
plotPCA(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
```


```{r plot-pca-4comp-sizeby-exprs, fig.height=5.5, fig.show = FALSE}
plotPCA(example_sceset, size_by = "Gene_1000")
```

```{r plot-pca-4comp-colby-exprs, fig.height=5.5, fig.show = FALSE}
plotPCA(example_sceset, colour_by = "Gene_1000")
```

Thus, expression levels of two marker genes or transcripts can be shown overlaid
onto reduced-dimension representations of cells. This also works for `plotTSNE`
plots.

```{r plot-tsne-1comp-colby-sizeby-exprs, fig.height=5.5}
plotTSNE(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
```

And for diffusion map plots.

```{r plot-difmap-1comp-colby-sizeby-exprs, fig.height=5.5}
plotDiffusionMap(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
```

The `SCESet` matrix has a `reducedDimension` slot, where coordinates for a
reduced dimension representation of the cells can be stored. If we so wish, the
top principal components can be added to the `reducedDimension` slot:

```{r plot-pca-4comp-colby-shapeby-save-pcs, fig.show = FALSE}
example_sceset <- plotPCA(example_sceset, ncomponents = 4,
                          colour_by = "Treatment", shape_by = "Mutation_Status",
                          return_SCESet = TRUE, theme_size = 12)
head(reducedDimension(example_sceset))
```

As shown above, the function `reducedDimension` (as well as the shorthand
`redDim`) can be used to access the reduced dimension coordinates.

This means that any other dimension reduction method can be applied and the
coordinates stored. For example, we might wish to use t-distributed stochastic
nearest neighbour embedding (t-SNE) or Gaussian process latent variable models
or any other dimensionality reduction method. We can store these in the `SCESet`
object and produce plots just as we did with PCA:

```{r plot-reduceddim-4comp-colby-shapeby}
plotReducedDim(example_sceset, ncomponents = 4, colour_by = "Treatment",
               shape_by = "Mutation_Status")
```

As for the PCA plots we can also colour and size points by feature expression.

```{r plot-reduceddim-4comp-colby-sizeby-exprs, fig.show = FALSE}
plotReducedDim(example_sceset, ncomponents = 4, colour_by = "Gene_1000",
               size_by = "Gene_0500")
```

(Here, the dimensionality reduction is just PCA, so we have the same plot as the
PCA plot above.)

Based on PCA or dimensionality reduction plots we may identify outlier cells
and, if we wish, filter them out of the analysis. There is also an outlier 
detection option available with the plotPCA function.

```{r plot-pca-outlier}
example_sceset <- plotPCA(example_sceset, pca_data_input = "pdata", 
                          detect_outliers = TRUE, return_SCESet = TRUE)

```

The `$outlier` element of the pData (phenotype data) slot of the `SCESet` 
contains indicator values about whether or not each cell has been designated as
an outlier based on the PCA. Here, these values can be accessed for filtering 
low quality cells with `example_sceset$outlier`. 

We can also use t-SNE and diffusion maps to visualise cells in a
reduced-dimension space:

```{r plot-tsne-colby-shapeby, fig.height=5.5, fig.show = FALSE}
plotTSNE(example_sceset, colour_by = "Treatment", shape_by = "Mutation_Status")
```

```{r plot-tsne-colby-sizeby-exprs, fig.height=5.5, fig.show = FALSE}
plotTSNE(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
```

```{r plot-tsne-4comp-sizeby-exprs, fig.height=5.5, fig.show = FALSE}
plotDiffusionMap(example_sceset, size_by = "Gene_1000")
```

```{r plot-tsne-4comp-colby-exprs, fig.height=5.5, fig.show = FALSE}
plotDiffusionMap(example_sceset, colour_by = "Gene_1000")
```



### Filtering cells

On this example dataset there are no cells that need filtering, but the
subsetting capabilities of `scater` make it easy to filter out unwanted cells.
Column subsetting selects cells, while row subsetting selects features (genes
or transcripts).

## QC of experimental variables

See the `plotQC` options below. The various plotting functions enable
visualisation of the relationship betwen experimental variables and the
expression data.

We can look at the relative importance of different explanatory variables with
some of the `plotQC` function options. We can compute the median marginal $R^2$
for each variable in `pData(example_sceset)` when fitting a linear model
regressing `exprs` values against just that variable.

The default approach looks at all variables in `pData(object)` and plots the top
`nvars_to_plot` variables (default is 10).

```{r plot-qc-expl-variables-all, warning=FALSE}
plotQC(example_sceset, type = "expl")
```

Alternatively, we can choose a subset of variables to plot in this manner.

```{r plot-qc-expl-variables-select-variables}
plotQC(example_sceset, type = "expl",
       variables = c("total_features", "total_counts", "Mutation_Status", "Treatment",
                     "Cell_Cycle"))
```

We can also easily produce plots to identify PCs that correlate with
experimental and QC variables of interest. The function ranks the principal
components in decreasing order of $R^2$ from a linear model regressing PC value
against the variable of interest.

We can also produce a pairs plot of potential explanatory variables ranked by
their median percentage of expression variance explained in a marginal (only
one explanatory variable) linear model.

```{r plot-qc-pairs-pc}
plotQC(example_sceset, type = "expl", method = "pairs", theme_size = 6)
```

In this small dataset total_counts and total_features explain a very large proportion of the
variance in feature expression. The proportion of variance that they explain for a
real dataset should be much smaller (say 1-5%).

The default is to plot six most-associated principal components against the
variable of interest.

```{r plot-qc-find-pcs-pcs-vs-vars, fig.width=8, fig.height=7}
p1 <- plotQC(example_sceset, type = "find-pcs", variable = "total_features",
        plot_type = "pcs-vs-vars")
p2 <- plotQC(example_sceset, type = "find-pcs", variable = "Cell_Cycle",
       plot_type = "pcs-vs-vars")
multiplot(p1, p2, cols = 2)
```

An alternative is to produce a pairs plot of the top five PCs.

```{r plot-qc-find-pcs-pairs, fig.width=10, fig.height=7}
p1 <- plotQC(example_sceset, type = "find-pcs", variable = "total_features",
       plot_type = "pairs-pcs")
p2 <- plotQC(example_sceset, type = "find-pcs", variable = "Cell_Cycle",
       plot_type = "pairs-pcs")
multiplot(p1, p2, cols = 2)
```


Combined with the excellent subsetting capabilities of the `SCESet` class, we
have convenient tools for conducting QC and pre-processing (e.g. filtering) data
for downstream analysis.


#  Data normalisation

High levels of variability between cells characterise single-cell expression
data. In almost all settings, many sources of unwanted variation should be
accounted for before proceeding with more sophisticated analysis. Below, we show
some of `scater`'s capabilities for normalising data for downstream analyses.

We can use feature controls to help address differences between cells arising from
different sets of transcripts being expressed and differences in library
composition.

Important experimental variables and latent factors (if used) can be regressed
out, so that normalised data has these effects removed.


#  Feature and cell pairwise distance matrices

In many single-cell expression analyses we may want to generate and store
pairwise distance matrices for both cells and features.

We can first look at a multidimensional scaling (MDS) plot using Euclidean
distance between cells.

```{r cell-pairwise-distance-matrices-euclidean, eval=TRUE}
cell_dist <- as.matrix(dist(t(exprs(example_sceset))))
cellPairwiseDistances(example_sceset) <- cell_dist
limma::plotMDS(cellDist(example_sceset))
```

Second, we could also look at an MDS plot using the count data and the Canberra
distance metric (plot not shown).

```{r cell-pairwise-distance-matrices-canberra, eval=TRUE, fig.show = FALSE}
cell_dist <- as.matrix(dist(t(counts(example_sceset)), method = "canberra"))
cellPairwiseDistances(example_sceset) <- cell_dist
limma::plotMDS(cellDist(example_sceset))
```

We could also look at an MDS plot for the features, here using Euclidean
distance (plot not shown).

```{r feature-pairwise-distance-matrices, eval=FALSE}
feature_dist <- as.matrix(dist(exprs(example_sceset)))
featurePairwiseDistances(example_sceset) <- feature_dist
limma::plotMDS(featDist(example_sceset))
```


#  Using **kallisto** to quantify transcript abundance from within R

Lior Pachter's group at Berkeley has recently released a new software tool
called **kallisto** for rapid quantification of transcript abundance from RNA-seq
data. In `scater`, a couple of wrapper functions to `kallisto` enable easy and
extremely fast quantification of transcript abundance from within R.

```{r kallisto-demo-kallisto-test-data, eval=FALSE}
################################################################################
### Tests and Examples

# Example if in the kallisto/test directory
setwd("/home/davis/kallisto/test")
kallisto_log <- runKallisto("targets.txt", "transcripts.idx", single_end=FALSE,
            output_prefix="output", verbose=TRUE, n_bootstrap_samples=10)

sce_test <- readKallistoResults(kallisto_log, read_h5=TRUE)
sce_test
```

An example using real data from a project investigating cell cycle. Not that
this analysis is not 'live' (the raw data are not included in the package), but
it demonstrates what can be done with `scater` and `kallisto`.

```{r kallisto-cell-cycle-example, eval=FALSE}
setwd("/home/davis/021_Cell_Cycle/data/fastq")
system("wc -l targets.txt")
ave_frag_len <- mean(c(855, 860, 810, 760, 600, 690, 770, 690))

kallisto_test <- runKallisto("targets.txt",
                             "Mus_musculus.GRCm38.rel79.cdna.all.ERCC.idx",
                             output_prefix="kallisto_output_Mmus", n_cores=12,
                             fragment_length=ave_frag_len, verbose=TRUE)
sce_kall_mmus <- readKallistoResults(kallisto_test, read_h5=TRUE)
sce_kall_mmus

sce_kall_mmus <- readKallistoResults(kallisto_test)

sce_kall_mmus <- getBMFeatureAnnos(sce_kall_mmus)

head(fData(sce_kall_mmus))
head(pData(sce_kall_mmus))
sce_kall_mmus[["start_time"]]

counts(sce_kall_mmus)[sample(nrow(sce_kall_mmus), size=15), 1:6]

## Summarise expression at the gene level
sce_kall_mmus_gene <- summariseExprsAcrossFeatures(
    sce_kall_mmus, exprs_values="tpm", summarise_by="feature_id")

datatable(fData(sce_kall_mmus_gene))

sce_kall_mmus_gene <- getBMFeatureAnnos(
    sce_kall_mmus_gene, filters="ensembl_gene_id",
    attributes=c("ensembl_gene_id", "mgi_symbol", "chromosome_name",
                 "gene_biotype", "start_position", "end_position",
                 "percentage_gc_content", "description"),
    feature_symbol="mgi_symbol", feature_id="ensembl_gene_id",
    biomart="ensembl", dataset="mmusculus_gene_ensembl")

datatable(fData(sce_kall_mmus_gene))

## Add gene symbols to featureNames to make them more intuitive
new_feature_names <- featureNames(sce_kall_mmus_gene)
notna_mgi_symb <- !is.na(fData(sce_kall_mmus_gene)$mgi_symbol)
new_feature_names[notna_mgi_symb] <- fData(sce_kall_mmus_gene)$mgi_symbol[notna_mgi_symb]
notna_ens_gid <- !is.na(fData(sce_kall_mmus_gene)$ensembl_gene_id)
new_feature_names[notna_ens_gid] <- paste(new_feature_names[notna_ens_gid],
          fData(sce_kall_mmus_gene)$ensembl_gene_id[notna_ens_gid], sep="_")
sum(duplicated(new_feature_names))
featureNames(sce_kall_mmus_gene) <- new_feature_names
head(featureNames(sce_kall_mmus_gene))
tail(featureNames(sce_kall_mmus_gene))
sum(duplicated(fData(sce_kall_mmus_gene)$mgi_symbol))

```