---
title: "Using pRoloc for spatial proteomics data analysis"
author:
- name: Lisa M. Breckels
  affiliation: Cambridge Centre for Proteomics, University of Cambridge, UK
- name: Laurent Gatto
  affiliation: de Duve Institute, UCLouvain, Belgium
package: pRoloc
abstract: >
 This tutorial illustrates the usage of the `pRoloc` R
 package for the analysis and interpretation of spatial proteomics
 data. It walks the reader through the creation of *MSnSet* instances,
 that hold the quantitative proteomics data and meta-data and
 introduces several aspects of data analysis, including data
 visualisation and application of machine learning to predict protein
 localisation.
output:
  BiocStyle::html_document:
   toc_float: true
bibliography: pRoloc.bib
vignette: >
  %\VignetteIndexEntry{Using pRoloc for spatial proteomics data analysis}
  %\VignetteKeywords{Bioinformatics, Machine learning, Organelle, Spatial Proteomics}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r env, include=FALSE, echo=FALSE, cache=FALSE}
library("knitr")
opts_chunk$set(stop_on_error = 1L)
suppressPackageStartupMessages(library("MSnbase"))
suppressWarnings(suppressPackageStartupMessages(library("pRoloc")))
suppressPackageStartupMessages(library("pRolocdata"))
suppressPackageStartupMessages(library("class"))
set.seed(1)
```

# Foreword {-}

*[MSnbase](http://bioconductor.org/packages/MSnbase)* and
*[pRoloc](http://bioconductor.org/packages/pRoloc)* are under active
developed; current functionality is evolving and new features are
added on a regular basis.  This software is free and open-source
software.  If you use it, please support the project by citing it in
publications:

> Gatto L. and Lilley K.S. *MSnbase - an R/Bioconductor package for
> isobaric tagged mass spectrometry data visualization, processing and
> quantitation*. [Bioinformatics 28, 288-289 (2011)](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btr645).

> Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley
> KS. *Mass-spectrometry-based spatial proteomics data analysis using
> pRoloc and pRolocdata.*
> [Bioinformatics. 2014 May 1;30(9):1322-4.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu013).

> Breckels LM, Mulvey CM, Lilley KS and Gatto L. A Bioconductor
> workflow for processing and analysing spatial proteomics
> data. F1000Research 2016, 5:2926
> [doi: 10.12688/f1000research.10411.1](https://f1000research.com/articles/5-2926/).

If you are using the *phenoDisco* function, please also cite

> Breckels L.M., Gatto L., Christoforou A., Groen A.J., Kathryn Lilley
> K.S. and Trotter M.W.  *The effect of organelle discovery upon
> sub-cellular protein localisation.* J Proteomics,
> [S1874-3919(13)00094-8 (2013)](http://www.sciencedirect.com/science/article/pii/S1874391913000948).

If you are using any of the transfer learning functions, please also
cite

> Breckels LM, Holden S, Wonjar D, Mulvey CM, Christoforou A, Groen A,
> Trotter MW, Kohlbacker O, Lilley KS and Gatto L (2016). *Learning
> from heterogeneous data sources: an application in spatial
> proteomics*. PLoS Comput Biol 13;12(5):e1004920. doi:
> [10.1371/journal.pcbi.1004920](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004920).

If you are using any of the Bayesian generative models, please also
cite

> A Bayesian Mixture Modelling Approach For Spatial Proteomics Oliver
> M Crook, Claire M Mulvey, Paul D. W. Kirk, Kathryn S Lilley, Laurent
> Gatto bioRxiv 282269; doi: https://doi.org/10.1101/282269

For an introduction to spatial proteomics data analysis:

> Gatto L, Breckels LM, Burger T, Nightingale DJ, Groen AJ, Campbell
> C, Nikolovski N, Mulvey CM, Christoforou A, Ferro M, Lilley KS. *A
> foundation for reliable spatial proteomics data analysis*. Mol Cell
> Proteomics. [2014 Aug;13(8):1937-52](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4125728/). doi:
> 10.1074/mcp.M113.036350.

The *[pRoloc](http://bioconductor.org/packages/pRoloc)* package
contains additional vignettes and reference material:

* *pRoloc-tutorial*: pRoloc tutorial (this vignette).
* *pRoloc-ml*: Machine learning techniques available in pRoloc.
* *pRoloc-transfer-learning*: A transfer learning algorithm for spatial proteomics.
* *pRoloc-bayesian*: Bayesian spatial proteomics with pRoloc.

# Questions and bugs {-}

You are welcome to contact me directly about `r Biocpkg("pRoloc")`.
For bugs, typos, suggestions or other questions, please file an issue
in our issue tracking system
(<https://github.com/lgatto/pRoloc/issues>) providing as much
information as possible, a reproducible example and the output of
`sessionInfo()`.

If you wish to reach a broader audience for general questions about
proteomics analysis using `R`, you may want to use the Bioconductor
support site: <https://support.bioconductor.org/>.


# Introduction {#sec:intro}

## Spatial proteomics

Spatial (or organelle) proteomics is the study of the localisation of
proteins inside cells. The sub-cellular compartment can be organelles,
i.e. structures defined by lipid bi-layers,macro-molecular assemblies
of proteins and nucleic acids or large protein complexes. In this
document, we will focus on mass-spectrometry based approaches that
assay a population of cells, as opposed as microscopy based techniques
that monitor single cells, as the former is the primary concern of
`r Biocpkg("pRoloc")`, although the techniques described below and the
infrastructure in place could also be applied the processed image
data. The typical experimental use-case for using `r Biocpkg("pRoloc")` is
a set of fractions, originating from a total cell lysate. These
fractions can originate from a continuous gradient, like in the LOPIT
[@Dunkley2006] or PCP [@Foster2006] approaches, or can be
discrete fractions. The content of the fractions is then identified
and quantified (using labelled or un-labelled quantitation
techniques). Using relative quantitation of known organelle residents,
termed organelle markers, organelle-specific profiles along the
gradient are determined and new residents are identified based on
matching of these distribution profiles. See for example
[@Gatto2010] and references therein for a detailed review on
organelle proteomics.

It should be noted that large protein complexes, that are not
necessarily separately enclosed within their own lipid bi-layer, can
be detected by such techniques, as long as a distinct profile can be
defined across the fractions.

## About R and *pRoloc*

R [@Rstat] is a statistical programming language and interactive
working environment. It can be expanded by so-called packages to
confer new functionality to users. Many such packages have been
developed for the analysis of high-throughput biology, notably through
the Bioconductor project [@Gentleman2004]. Two packages are of
particular interest here, namely `r Biocpkg("MSnbase")` [@Gatto2012]
and `r Biocpkg("pRoloc")`. The former provides flexible infrastructure
to store and manipulate quantitative proteomics data and the
associated meta-data and the latter implements specific algorithmic
technologies to analyse organelle proteomics data.

Among the advantages of R are robust statistical procedures, good
visualisation capabilities, excellent documentation, reproducible
research^[The content of this document is compiled (the code is executed and its output, text and figures, is displayed dynamically) to generate the html file.],
power and flexibility of the R language and environment itself and a
rich environment for specialised functionality in many domains of
bioinformatics: tools for many omics technologies, including
proteomics, bio-statistics, gene ontology and biological pathway
analysis, ... Although there exists some specific graphical user
interfaces (GUI), interaction with R is executed through a command
line interface. While this mode of interaction might look alien to new
users, experience has proven that after a first steep learning curve,
great results can be achieved by non-programmers. Furthermore,
specific and general documentation is plenty and beginners and
advanced course material are also widely available.

Once R is started, the first step to enable functionality of a
specific packages is to load them using the `library` function, as
shown in the code chunk below:

```{r libraries}
library("MSnbase")
library("pRoloc")
library("pRolocdata")
```

`r Biocpkg("MSnbase")` implements the data containers that are used by
`r Biocpkg("pRoloc")`. `r Biocexptpkg("pRolocdata")` is a data package
that supplies several published organelle proteomics data sets.

# Data structures {#sec:data}

## Example data

The data used in this tutorial has been published in
[@Tan2009]. The LOPIT technique [@Dunkley2006] is used to
localise integral and associated membrane proteins in
*Drosophila melanogaster* embryos. Briefly, embryos were
collected at 0 -- 16 hours, homogenised and centrifuged to collect the
supernatant, removing cell debris and nuclei. Membrane fractionation
was performed on a iodixanol gradient and fractions were quantified
using iTRAQ isobaric tags [@Ross2004] as follows: fractions 4/5,
114; fractions 12/13, 115; fraction 19, 116 and fraction 21,
117. Labelled peptides were then separated using cation exchange
chromatography and analysed by LS-MS/MS on a QSTAR XL
quadrupole-time-of-flight mass spectrometer (Applied Biosystems). The
original localisation analysis was performed using partial least
square discriminant analysis (PLS-DA). Relative quantitation data was
retrieved from the supplementary file `pr800866n_si_004.xls`
(<http://pubs.acs.org/doi/suppl/10.1021/pr800866n/suppl_file/pr800866n_si_004.xls>)
and imported into R as described below. We will concentrate on the
first replicate.

## Importing and loading data

This section illustrates how to import data in comma-separated value
(csv) format into an appropriate R data structure. The first section
shows the original `csv` (comma separated values) spreadsheet, as
published by the authors, and how one can read such a file into \R
using the *read.csv* function. This spreadsheet file is similar to the
output of many quantitation software.

In the next section, we show 2 `csv` files containing a subset of the
columns of original `pr800866n_si_004-rep1.csv` file and another
short file, created manually, that will be used to create the
appropriate R data.

### The original data file {#sec:orgcsv}

```{r readCsvData0}
## The original data for replicate 1, available
## from the pRolocdata package
f0 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE,
      pattern = "pr800866n_si_004-rep1.csv")
csv <- read.csv(f0)
```

The three first lines of the original spreadsheet, containing the data
for replicate one, are illustrated below (using the function
*head*). It contains `r nrow(csv)` rows (proteins) and `r ncol(csv)`
columns, including protein identifiers, database accession numbers,
gene symbols, reporter ion quantitation values, information related to
protein identification, ...

```{r showOrgCsv}
head(csv, n=3)
```

### From `csv` files to R data {#sec:csv}

There are several ways to create the desired R data object, termed
*MSnSet*, that will be used to perform the actual sub-cellular
localisation prediction. Here, we illustrate a method that uses
separate spreadsheet files for quantitation data, feature meta-data
and sample (fraction) meta-data and the `readMSnSet` constructor
function, that will hopefully be the most straightforward for new
users.

```{r readCsvData1, tidy = FALSE}
## The quantitation data, from the original data
f1 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE, pattern = "exprsFile.csv")
exprsCsv <- read.csv(f1)
## Feature meta-data, from the original data
f2 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE, pattern = "fdataFile.csv")
fdataCsv <- read.csv(f2)
## Sample meta-data, a new file
f3 <- dir(system.file("extdata", package = "pRolocdata"),
      full.names = TRUE, pattern = "pdataFile.csv")
pdataCsv <- read.csv(f3)
```



* `exprsFile.csv` containing the quantitation (expression) data for
  the `r nrow(exprsCsv)` proteins and 4 reporter tags.
```{r showExprsFile}
head(exprsCsv, n=3)
```

* `fdataFile.csv` containing meta-data for the `r nrow(fdataCsv)`
  features (here proteins).
```{r showFdFile}
head(fdataCsv, n=3)
```

* `pdataFile.csv` containing samples (here fractions) meta-data. This
  simple file has been created manually.
```{r showPdFile}
pdataCsv
```


A self-contained data structure, called *MSnSet* (defined in
the `r Biocpkg("MSnbase")` package) can now easily be generated using the
*readMSnSet* constructor, providing the respective
`csv` file names shown above and specifying that the data is
comma-separated (with `sep = ","`). Below, we call that object
`tan2009r1` and display its content.

```{r makeMSnSet}
tan2009r1 <- readMSnSet(exprsFile = f1,
            featureDataFile = f2,
            phenoDataFile = f3,
            sep = ",")
tan2009r1
```


## A shorter input work flow

The `readMSnSet2` function provides a simplified import pipeline.  It
takes a single spreadsheet as input (default is `csv`) and extract the
columns identified by `ecol` to create the expression data, while the
others are used as feature meta-data. `ecol` can be a `character` with
the respective column labels or a numeric with their indices. In the
former case, it is important to make sure that the names match
exactly. Special characters like `'-'` or `'('` will be transformed by
R into `'.'` when the `csv` file is read in.  Optionally, one can also
specify a column to be used as feature names.  Note that these must be
unique to guarantee the final object validity.

```{r readMSnSet2}
ecol <- paste("area", 114:117, sep = ".")
fname <- "Protein.ID"
eset <- readMSnSet2(f0, ecol, fname)
eset
```

The `ecol` columns can also be queried interactively from R using the
`getEcols` and `grepEcols` function. The former return a character
with all column names, given a splitting character, i.e. the
separation value of the spreadsheet (typically `","` for `csv`,
`"\t"` for `tsv`, ...). The latter can be used to grep a
pattern of interest to obtain the relevant column indices.

```{r ecols}
getEcols(f0, ",")
grepEcols(f0, "area", ",")
e <- grepEcols(f0, "area", ",")
readMSnSet2(f0, e)
```

The `phenoData` slot can now be updated accordingly using the
replacement functions `phenoData<-` or `pData<-` (see `?MSnSet` for
details).

### The *MSnSet* class

Although there are additional specific sub-containers for additional
meta-data (for instance to make the object MIAPE compliant), the
feature (the sub-container, or slot `featureData`) and sample (the
`phenoData` slot) are the most important ones. They need to meet the
following validity requirements (see [figure below](#fig:msnset)):

* the number of row in the expression/quantitation data and feature
  data must be equal and the row names must match exactly, and

* the number of columns in the expression/quantitation data and number
  of row in the sample meta-data must be equal and the column/row
  names must match exactly.

It is common, in the context of `r Biocpkg("pRoloc")` to update the
feature meta-data (described in section \@ref(sec:analysis)) by adding
new columns, without breaking the objects validity. Similarly, the
sample meta-data can also be updated by adding new sample variables. A
detailed description of the `MSnSet` class is available by typing
`?MSnSet` in the R console.

![Dimension requirements for the respective expression, feature and sample meta-data slots.](./Figures/msnset.png){#fig:msnset}

The individual parts of this data object can be accessed with their
respective accessor methods:

* the quantitation data can be retrieved with `exprs(tan2009r1)`,
* the feature meta-data with `fData(tan2009r1)` and
* the sample meta-data with `pData(tan2009r1)`.

The advantage of this structure is that it can be manipulated as a
whole and the respective parts of the data object will remain
compatible. The code chunk below, for example, shows how to extract
the first 5 proteins and 2 first samples:

```{r showSubset}
smallTan <- tan2009r1[1:5, 1:2]
dim(smallTan)
exprs(smallTan)
```

Several data sets, including the 3 replicates from [@Tan2009], are
distributed as *MSnSet* instances in the `r Biocexptpkg("pRolocdata")`
package. Others include, among others, the *Arabidopsis thaliana*
LOPIT data from [@Dunkley2006] (`dunkley2006`) and the mouse PCP data
from [@Foster2006] (`foster2006`). Each data set can be loaded with
the `data` function, as show below for the first replicate from
[@Tan2009].

```{r rmtan, echo=FALSE}
## remove manullay constructred data
rm(tan2009r1)
data(tan2009r1)
```

```{r loadTan1}
data(tan2009r1)
```

The original marker proteins are available as a feature meta-data
variables called `markers.orig` and the output of the partial least
square discriminant analysis, applied in the original publication, in
the `PLSDA` feature variable. The most up-to-date marker list for the
experiment can be found in `markers`. This feature meta-data column
can be added from a simple `csv` markers files using the `addMarkers`
function - see `?addMarkers` for details.

The organelle markers are illustrated below using the convenience
function *getMarkers*, but could also be done manually by accessing
feature variables directly using *fData()*.

```{r lookAtTan}
getMarkers(tan2009r1, fcol = "markers.orig")
getMarkers(tan2009r1, fcol = "PLSDA")
```

**Important** As can be seen above, some proteins are labelled
`"unknown"`, defining non marker proteins. This is a convention in
many `r Biocpkg("pRoloc")` functions. Missing annotations (empty
string) will not be considered as of unknown localisation; we prefer
to avoid empty strings and make the absence of known localisation
explicit by using the `"unknown"` tag. This information will be used
to separate marker and non-marker (unlabelled) proteins when
proceeding with data visualisation and clustering (sections
\@ref(sec:viz) and \@ref(sec:usml)) and classification analysis
(section \@ref(sec:sml)).

## *pRoloc*'s organelle markers

Exploring protein annotations and defining sub-cellular localisation
markers (i.e. known residents of a specific sub-cellular niche in a
species, under a condition of interest) play important roles in the
analysis of spatial proteomics data. The latter is essential for
downstream supervised machine learning (ML) classification for protein
localisation prediction (as discussed later and in the
`vignette("pRoloc-ml", package = "pRoloc")`) and the former is interesting for
initial biological interpretation through matching annotations to the
data structure.

Robust protein-localisation prediction is reliant on markers that
reflect the true sub-cellular diversity of the multivariate data. The
validity of markers is generally assured by expert curation. This can
be time consuming and difficult owing to the limited number of marker
proteins that exist in databases and elsewhere. The Gene Ontology (GO)
database, and in particular the cellular compartment (CC) namespace
provide a good starting point for protein annotation and marker
definition. Nevertheless, automatic extraction from databases, and in
particular GO CC, is only a first step in sub-cellular localisation
analysis and requires additional curation to counter unreliable
annotation based on data that is inaccurate or out of context for the
biological question under investigation.

The `r Biocpkg("pRoloc")` package distributes starting sets of markers that
have been obtained by mining the `r Biocexptpkg("pRolocdata")`
datasets and curation by various members of the
[Cambridge Centre for Proteomics](http://proteomics.bio.cam.ac.uk/). The
available marker sets can be obtained and loaded using the
`pRolocmarkers` function:

```{r markers}
pRolocmarkers()
head(pRolocmarkers("dmel"))
table(pRolocmarkers("dmel"))
```

These markers can then be added to a new *MSnSet* using the
*addMarkers* function by matching the marker names (protein
identifiers) and the feature names of the *MSnSet*. See *?addMarkers*
for examples. 

## Data processing

The quantitation data obtained in the supplementary file is normalised
to the sum of intensities of each protein; the sum of fraction
quantitation for each protein equals 1 (considering rounding
errors). This can quickly be verified by computing the row sums of the
expression data.

```{r realtiveQuants}
summary(rowSums(exprs(tan2009r1)))
```

The `normalise` method (also available as `normalize`) from the
`r Biocpkg("MSnbase")` package can be used to obtain relative
quantitation data, as illustrated below on another iTRAQ test data
set, available from `r Biocpkg("MSnbase")`. Several normalisation
`methods` are available and described in `?normalise`.  For many
algorithms, including classifiers in general and support vector
machines in particular, it is important to properly per-process the
data.  Centering and scaling of the data is also available with the
`scale` method.


In the code chunk below, we first create a test *MSnSet*
instance^[Briefly, the `itraqdata` raw iTRAQ4-plex data is quantified by trapezoidation using `r Biocpkg("MSnbase")` functionality. See the `MSnbase-demo` vignette for details.]
and illustrate the effect of `normalise(..., method = "sum")`.

```{r norm, echo=TRUE, message=FALSE, cache=TRUE}
## create a small illustrative test data
data(itraqdata)
itraqdata <- quantify(itraqdata, method = "trap",
              reporters = iTRAQ4)
## the quantification data
head(exprs(itraqdata), n = 3)
summary(rowSums(exprs(itraqdata)))
## normalising to the sum of feature intensitites
itraqnorm <- normalise(itraqdata, method = "sum")
processingData(itraqnorm)
head(exprs(itraqnorm), n = 3)
summary(rowSums(exprs(itraqnorm)))
```

Note above how the processing undergone by the *MSnSet*
instances `itraqdata` and `itraqnorm` is stored in
another such specific sub-container, the `processingData`
slot.

The different features (proteins in the `tan2009r1` data
above, but these could also represent peptides or MS$^2$ spectra) are
characterised by unique names. These can be retrieved with the
`featureNames` function.

```{r featurenames}
head(featureNames(tan2009r1))
```

If we look back at section \@ref(sec:csv), we see that these have been
automatically assigned using the first columns in the `exprsFile.csv`
and `fdataFile.csv` files. It is thus crucial for these respective
first columns to be identical. Similarly, the sample names can be
retrieved with `sampleNames(tan2009r1)`.

# Data visualisation {#sec:viz}

The following sections will focus on two closely related aspects, data
visualisation and data analysis (i.e. organelle assignments). Data
visualisation is used in the context on quality control, to convince
ourselves that the data displays the expected properties so that the
output of further processing can be trusted. Visualising results of
the localisation prediction is also essential, to control the validity
of these results, before proceeding with orthogonal (and often
expensive) *dry* or *wet* validation.

## Profile plots {#sec:profplot}

The underlying principle of gradient approaches is that we have
separated organelles along the gradient and by doing so, generated
organelle-specific protein distributions along the gradient
fractions. The most natural visualisation is shown on figure
\@ref(fig:plotdist1), obtained using the sub-setting functionality of
*MSnSet* instances and the `plotDist` function, as illustrated below.

```{r showplotdist, echo=TRUE, eval=FALSE}
## indices of the mito markers
j <- which(fData(tan2009r1)$markers.orig == "mitochondrion")
## indices of all proteins assigned to the mito
i <- which(fData(tan2009r1)$PLSDA == "mitochondrion")
plotDist(tan2009r1[i, ],
     markers = featureNames(tan2009r1)[j])
```


```{r plotdist1, echo=FALSE, fig.cap="Distribution of protein intensities along the fractions of the separation gradient for 4 organelles: mitochondrion (red), ER/Golgi (blue, ER markers and green, Golgi markers) and plasma membrane (purple)."}
par(mfrow = c(2,2), mar = c(4, 4, 2, 0.1))
cls <- getStockcol()[1:4]
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "mitochondrion"), ],
     markers = featureNames(tan2009r1)
     [which(fData(tan2009r1)$markers.orig == "mitochondrion")],
     mcol = cls[1])
title(main = "mitochondrion")
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "ER/Golgi"), ],
     markers = featureNames(tan2009r1)
     [which(fData(tan2009r1)$markers.orig == "ER")],
     mcol = cls[2])
title(main = "ER")
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "ER/Golgi"), ],
     markers = featureNames(tan2009r1)
     [which(fData(tan2009r1)$markers.orig == "Golgi")],
     mcol = cls[3])
title(main = "Golgi")
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "PM"), ],
     markers = featureNames(tan2009r1)
     [which(fData(tan2009r1)$markers.orig == "PM")],
     mcol = cls[4])
title(main = "PM")
```

## Sub-cellular cluster dendrogram {#sec:dendro}

To gain a quick overview of the distance/similarity between the
sub-cellular clusters, it can useful to compare average marker
profiles, rather than all profiles, as presented in the profile plots
above. The `mrkHClust` calculates average class profiles and generates
the resulting dendrogram.

```{r dendro, fig.cap="Hierarchical clustering. Average distance between organelle classes."}
mrkHClust(tan2009r1)
```

On figure \@ref(fig:dendro), we see that the lysosome and the ribosome
60S are separated by the smallest distance. The advantage of this
representation is that it provides a quick snapshot of the average
similarity between organelles using the complete profiles (as opposed
to a PCA plot, discussed in the next section). The main drawback is
that it ignores any variability in individual markers (cluster
tightness). It is however a good guide for a more thorough exploration
of the data, as described in the next sections.


Note the colours of the labels on the dendrogram (figure
\@ref(fig:dendro)), which match the colours used to annotate PCA
plots, described in the next section (figure \@ref(fig:plot2d). These
colours are defined at the session level (see `getStockcol` and
`setStockcol`) and re-used throughout `r Biocpkg("pRoloc")` for
consistent annotation.

## Average organelle class profile plot  {#sec:avrprofplot}

We can also visualise the average organelle class profiles generated by
`mrkConsProfiles` using `plotConsProfiles`. We can optionally order the organelle
classes on the y-axis according to the heirachical clustering from `mrkHClust`.

See `?mrkHClust` for more details

```{r plotavrprof, fig.cap="Average organelle class profiles. Protein intensity indicated by colour. Organelle classes ordered by hierarchical clustering"}
## histogram
hc <- mrkHClust(tan2009r1, plot=FALSE)

## order of markers according to histogram
mm <- getMarkerClasses(tan2009r1)
m_order <- levels(factor(mm))[order.dendrogram(hc)]

## average marker profile
fmat <- mrkConsProfiles(tan2009r1)

plotConsProfiles(fmat, order=m_order)
```

## Dimensionality reduction {#sec:pcalot}

Alternatively, we can combine all organelle groups in one single 2
dimensional figure by applying a dimensionality reduction technique
such as Principal Component Analysis (PCA) using the `plot2D` function
(see figure \@ref(fig:plot2d)). The protein profile vectors are
summarised into 2 values that can be visualised in two dimensions, so
that the variability in the data will be maximised along the first
principal component (PC1). The second principal component (PC2) is
then chosen as to be orthogonal to PC1 while explaining as much
variance in the data as possible, and so on for PC3, PC4, etc.

Using a PCA representation to visualise a spatial proteomics
experiment, we can easily plot all the proteins on the same figure as
well a many sub-cellular clusters. These clusters are defined in a
feature meta-data column (slot `featureData`, accessed as a
`data.frame` with the `fData` function) that is
declared with the `fcol` argument (default is
`"markers"` which contains the most current known markers for
this experiment, while the original markers published with the data
can be found in the slot `"markers.orig"`).


```{r plot2d, echo=TRUE, fig.asp=1, fig.cap="PCA plot. Representation of the proteins of `tan2009r1` after reduction of the 4 reporter quantitation data to 2 principal components."}
plot2D(tan2009r1, fcol = "markers")
addLegend(tan2009r1, fcol = "markers", cex = .7,
      where = "bottomright", ncol = 2)
```

As the default value for the `fcol` argument is `"markers"`, it is not
necessary to specify it. It is however mandatory to specify other
annotation feature variables, such as to visualise the set of markers
described in the original publication.


```{r plot2dorg, echo=TRUE, fig.asp=1, fig.cap="PCA plot. Reduced set of markers for the `tan2009r1` data projected onto 2 principal components."}
plot2D(tan2009r1, fcol = "markers.orig")
addLegend(tan2009r1, fcol = "markers.orig", where = "bottomright")
```

It is also possible to visualise the data along 3 dimensions using the
*plot3D* function, which works like the 2 dimension version
([figure below](#fig:plot3D)). The resulting figure can be rotated by the
user.

```{r plot3Danim, echo=FALSE, eval=FALSE}
## here's how the animation below was created
data(hyperLOPIT2015)
suppressPackageStartupMessages(library("rgl"))
plot3D(hyperLOPIT2015)
movie3d(spin3d(c(1, 1, 1), rpm = 10), duration = 5,
    clean = FALSE,
    dir = "~/dev/00_github/pRoloc/vignettes/Figures/")
```

```{r plot3D, eval = FALSE}
plot3D(tan2009r1)
```

![Snapshot of the 3-dimensional PCA plot. The `tan2009r1` data is represented along the first 3 principal components.](./Figures/plot3D.png){#fig:plot3D}

```{r scree0, echo=FALSE}
pcvar <- plot2D(tan2009r1, method = "scree", plot = FALSE)
pcvar <- round(pcvar, 2)
```

As can be seen on the figures \@ref(fig:plot2d), \@ref(fig:plot2dorg)
and on the [3D plot above](#fig:plot3D), we indicate on the axis
labels the percentage of total variance explained by the individual
PCs. It is not unusual to reach over 75% along the first two PCs,
even for experiments with several tens of fractions. One can calculate
this information for all PCs by setting `method = "scree"` in
`plot2D`. On figure \@ref(fig:scree), we see that the four PCs in the
`tan2009r1` data account for `r paste(pcvar[1:3], collapse = ", ")`
and `r pcvar[4]` percent of the total variance.


```{r scree, fig.cap="Percentage of variance explained."}
plot2D(tan2009r1, method = "scree")
```

A variety of dimensionality reduction methods are available in
`plot2D`: `r paste(pRoloc:::plot2Dmethods, collapse = ", ")`.
Except for `scree` (see above) and `none` (no data transformation,
which can be useful when the data is already transformed and only
needs to be plotted), these can used to produce visualisation of the
data in two dimensions. Two are worth some discussion here; the
readers are redirected to the manual page for more details.

Linear discriminant analysis (LDA) will project the protein occupancy
profiles in a new set of dimensions using as a criterion the
separation of marker classes by maximising the between class variance
to the within class variance ratio. As opposed to the
*unsupervised* PCA, the *supervised* LDA should not be
used as an to explore the data or as a quality control, but can be
useful to assess if one or more organelles have been preferentially
separated.

The t-Distributed Stochastic Neighbour Embedding
(t-SNE)^[<https://lvdmaaten.github.io/tsne/>] [@tsne:2008] is widely
applied in many areas in computational biology and more generally
field that need to visualise high-dimensional data. The t-SNE method
is non-linear, and will emphasise separation of different features
while grouping features with similar profiles. In addition, different
transformations are applied on different regions leading to plots that
can substantially differ from a PCA plot. As a result, proximity in
two dimensions and tightness of the clusters can't be related to these
quantities in the original data. See *How to Use t-SNE
Effectively*^[<http://distill.pub/2016/misread-tsne/>] for a useful
non-technical introduction.

The results of the algorithm crucially depend on the values of its
input parameters, in particular the *perplexity*, which
balances global and local aspects of the data. The suggested range of
value ranges from 5 to 50, and should not be greater than the number
of data points (which we can assume not to be the case in modern
spatial proteomics experiments). Below, we test the effect of this
parameter along the suggested range, including the default value of
30, at which the algorithm converges.

```{r tsne, eval=FALSE}
data(dunkley2006)
perps <- sort(c(30, seq(5, 50, 15)))
par(mfrow = c(3, 2))
plot2D(dunkley2006, main = "PCA")
sapply(perps,
       function(perp) {
       plot2D(dunkley2006, method = "t-SNE",
          methargs = list(perplexity = perp))
       title(main = paste("t-SNE, perplexity", perp))
       })
```

![Effect of t-SNE's perplexity parameter on the `dunkley2006` data.](./Figures/tsne.png){#fig:tsne}


Other parameters that can effect the results are the number of
iterations and the learning rate epsilon.

The t-SNE algorithm takes much more time to complete than the other
available methods. In such cases, saving the results and re-plotting
with method `none` can be useful (see `?plot2D`). In the case of this
document, the [figure above](#fig:tsne), was pre-generated rather than
computed upon compilation.

It is also possible to visualise the data using a Uniform Manifold 
Approximation Projection (UMAP) [@McInnes:2018]. UMAP is another 
non-linear dimensionality reduction method. It has also become 
increasingly popular in recent years in the omics' field. It works in 
a similar way to t-SNE in that it tried to find a low-dimensional 
representation that preserves the relationship between 
neighbours in high-dimensional space. Without going into the theory,  
UMAP claims to preserve both local and global structure in the data,
whereas t-SNE claims to preserve just the local structure.

Like t-SNE the results of running an embedding depend on the input
parameters. The `plot2D` function uses the default settings from
the `umap` package in CRAN [@umap] (see `umap.defaults`). One of
the main parameters to optimise is the number of nearest neighbours.
In the below code chunk we test `c(5, 10, 30, 75, 100)`. The optimal
number of neighbours will depend on several factors, including the 
total number of proteins or peptides in the data and also the number of 
subcellular classes in the data. The default number of training 
epochs should also been explored. This is the number of iterations to 
be used in the initial low dimensional embedding. This is set to 200 
i.e. `n_epochs = 200`. Large values are expected to give a more accurate 
embedding. See the `umap` vignette^[<https://cran.r-project.org/web/packages/umap/vignettes/umap.html>] 
for more details.

```{r umap, eval=FALSE}
nn <- c(5, 10, 30, 75, 100)
par(mfrow = c(3, 2))
plot2D(dunkley2006, main = "PCA")
sapply(nn,
       function(nn) {
       set.seed(1)
       plot2D(dunkley2006, method = "UMAP",
          methargs = list(n_neighbors = nn, n_epochs = 50))
       title(main = paste("UMAP, nearest neighbours = ", nn))
       })
```

![Effect of UMAPs nearest neighbours parameter on the `dunkley2006` data.](./Figures/umap.png)

## Features of interest

In addition to highlighting sub-cellular niches as coloured clusters
on the PCA plot, it is also possible to define some arbitrary
*features of interest* that represent, for example, proteins of a
particular pathway or a set of interaction partners. Such sets of
proteins are recorded as *FeaturesOfInterest* instances, as illustrated
below (using the ten first features of our experiment):

```{r foi0}
foi1 <- FeaturesOfInterest(description = "Feats of interest 1",
               fnames = featureNames(tan2009r1[1:10]))
description(foi1)
foi(foi1)
```

Several such features of interest can be combined into collections:

```{r}
foi2 <- FeaturesOfInterest(description = "Feats of interest 2",
               fnames = featureNames(tan2009r1[880:888]))
foic <- FoICollection(list(foi1, foi2))
foic
```

*FeaturesOfInterest* instances can now be overlaid on the PCA plot
with the `highlightOnPlot` function. The `highlightOnPlot3D` can be
used to overlay data onto a 3 dimensional figure produced by `plot3D`.


```{r plotfoi, fig.asp=1, fig.cap="Adding features of interest on a PCA plot."}
plot2D(tan2009r1, fcol = "PLSDA")
addLegend(tan2009r1, fcol = "PLSDA",
      where = "bottomright",
      cex = .7)
highlightOnPlot(tan2009r1, foi1,
        col = "black", lwd = 2)
highlightOnPlot(tan2009r1, foi2,
        col = "purple", lwd = 2)
legend("topright", c("FoI 1", "FoI 2"),
       bty = "n", col = c("black", "purple"),
       pch = 1)
```


See `?FeaturesOfInterest` and `?highlightOnPlot` for more details.

## Interactive visualisation {#sec:gui}

The `r Biocpkg("pRolocGUI")` application allows one to explore the
spatial proteomics data using an interactive, web-based
\CRANpkg{shiny} interface [@shiny]. The package is available from
Bioconductor and can be installed and started as follows:

```{r guiinstall, eval = FALSE}
library("BiocManager")
BiocManager::install("pRolocGUI")
library("pRolocGUI")
data(Barylyuk2020ToxoLopit)
pRolocVis(Barylyuk2020ToxoLopit)
```

![Screenshot of the `r Biocpkg("pRolocGUI")` interface. The GUI is also available as an online app for the hyperLOPIT experiment from [@Barylyuk:2020] at <https://proteome.shinyapps.io/toxolopittzex/>.](./Figures/Screenshot_PCA2.png){#fig:gui}


More details are available in the vignette that can be started from
the application by clicking on any of the question marks, by starting
the vignette from R with `vignette("pRolocGUI")` or can be accessed
online
(<http://bioconductor.org/packages/devel/bioc/vignettes/pRolocGUI/inst/doc/pRolocGUI.html>).


# Assessing sub-cellular resolution {#sec:qsep}

## QSep metrics

The sub-cellular resolution of a spatial proteomics experiment,
i.e. the quantitation of how well the respective sub-cellular niches
are separated, can be computed with the `QSep` function. Briefly, this
function compares, for each pairs ofx sub-cellular niches, the ratio
between the the average Euclidean distance between niche *i* and *j* and
the the average within distance of cluster *j*. A large ratio indicates
that *i* and *j* are well separated with respect to the tightness of
cluster *j*. The larger the distances, the better the spatial proteomics
experiment.

Below, we calulate and visualise the *QSep* distances for the
`hyperLOPIT2015` data:

```{r qsep, fig.asp=1}
library("pRolocdata")
data(hyperLOPIT2015)

## Create the object and get a summary
hlq <- QSep(hyperLOPIT2015)
hlq
levelPlot(hlq)
plot(hlq)
```

See [Assessing sub-cellular resolution in spatial proteomics
experiments](https://doi.org/10.1101/377630) [@Gatto:2018] for
details, including a large meta-analysis on 29 different spatial
proteomics experiments.

## Euclidian distance metrics

In addition to `QSep` metrics there exists the `ClustDist` framework which
includes a set of classes and methods that given a set of proteins computes the
(normalised) distances between protein correlation profiles. Whilst the
motivation behind `QSep` is to assess the resolution of a correlation profiling
experiment by looking at between cluster (relative to within cluster) distances,
the `clustDist` method was developed to assess cluster tightness. Moreover,
correlate annotation information e.g. markers or GO terms, with the multivariate
data space to identify densely and sparsely annotated regions that reflect the
fit of the markers. Given a set of proteins that share some property e.g. a
specified GO term, or subcellular class label, the `clustDist` method performs
an initial k-means clustering of all proteins annotated to the class (testing
small values of `k`, the default is `k = 1:5`) and then for each number of `k`
components tested, all pairwise Euclidean distances are calculated per
component, and then normalised. The minimum mean normalised distance is then
extracted and used as a measure of fit of the annotation or markers to the data.
This is repeated for all terms/annotation sets. These sets can then be ranked
according to minimum mean normalised distance and then displayed and explored
using the `r `Biocpkg("pRolocGUI")` package.

The motivation behind performing an initial clustering for each term/marker set
is to account for sets which might have several distributions, for example,
components of a protein complex or sub-compartment resolution. If the annotation
is protein markers, the best `k` components per protein set is likely to be 1 as
on average we expect residents of known organelles to co-fractionate in
correlation profiling experiments. We may find 2, 3 or more components as the
best `k` for when evaluating annotation sets that have multiple distributions or
if the annotation is more ambiguous.

Before running the `clustDist` algorithm a matrix of markers must be created
where the number of rows in the matrix matches the same proteins in the `MSnSet`
(as denoted by `featureNames`) and the columns denote if the protein is present
in that term/organelle. A matrix of markers is used over a standard character
vector (e.g. `fcol = "markers"`) to allow flexibility for proteins being allocated
to multiple locations. If we wish to test an existing `fcol` in the `MSnSet`
there is a convenience function that converts a character vector e.g. "markers"
to a matrix of "markers", this is called `mrkVecToMat`. In the below code chunk
we use the `mrkVecToMat` function to take the `"markers"` column from the
`hyperLOPIT2015` dataset and convert it to a matrix of markers which we call
`"MM"` in the `fData` slot of the `MSnSet`. A new object with this information
is created called `hld`.

```{r mrkframework}
hld <- mrkVecToMat(hyperLOPIT2015, vfcol = "markers", mfcol = "MM")
head(fData(hld)$MM)
```

We see the function has converted the `"markers`" vector in the `fData` 
to a matrix of markers called `MM` where a 1 denotes the protein
is annotated to that term/class and a 0 means it is not.

In the below code chunk we run the `clustDist` function testing k's between 1
and 3 per class. We also specify the minimum number of proteins per class to be
5. If any of the k-means components result in less than `n` proteins per
component a NA value is recorded.

```{r clusterDist, eval = TRUE, verbose = FALSE}
dd <- clustDist(hld, fcol = "MM", k = 1:3, n = 5, verbose = FALSE)
dd
```

The output is a `"ClustDistList"` of length equal to the number of sets/classes
tested. The `"ClustDist"` and `"ClustDistList"` class summarises the algorithm
information including the number of k's components tested, size of each
component, the mean and normalised pairwise Euclidean distances per number of
component clusters tested. 

```{r}
dd[[1]]
dd[[2]]
```

The plotting method for a `ClustDist` object is a boxplot of the normalised distances
per term.

```{r visualiseRes, fig.asp=1, eval = TRUE}
plot(dd)
```


It is also possible to view the individual k-means component clusterings for
each class by selecting an element in the `"ClustDistList"` by name or index.
For example, in the code chunk below we look at the list of available classes
from the matrix of markers we created `"MM"` and then select interesting terms
to plot.

```{r plotClusterings, fig.width=8, fig.height=4, eval = TRUE}
getMarkerClasses(hld, "MM")
plot(dd[["Endoplasmic reticulum/Golgi apparatus"]], hld)
plot(dd[["Actin cytoskeleton"]], hld)
```

The output is a set of principal components analysis (PCA) plots, one for each
`k` tested, highlighting the component clusters found according to the
algorithm. We see for the first example, of the `k = 1:3` clusterings all
combinations were fit, but for the second example only one clustering of `k = 1`
was fit as the higher values of `k` gave rise to components with less than `n`
members per cluster.

It is also possible to plot the raw Euclidean distances by calling the
argument `method = "raw"`.

```{r visualiseRaw, fig.asp=1, eval = TRUE}
plot(dd, method = "raw")
```


# Data analysis {#sec:analysis}

Classification of proteins, i.e. assigning sub-cellular localisation
to proteins, is the main aspect of the present data analysis. The
principle is the following and is, in its basic form, a 2 step
process. First, an algorithm learns from the known markers that are
shown to him and models the data space accordingly. This phase is also
called the training phase. In the second phase, un-labelled proteins,
i.e. those that have not been labelled as resident of any organelle,
are matched to the model and assigned to a group (an organelle). This
2 step process is called machine learning (ML), because the computer
(machine) learns by itself how to recognise instances that possess
certain characteristics and classifies them without human
intervention. That does however not mean that results can be trusted
blindly.

In the above paragraph, we have defined what is called supervised ML,
because the algorithm is presented with some know instances from which
it learns (see section \@ref(sec:sml)). Alternatively, un-supervised
ML does not make any assumptions about the group memberships, and uses
the structure of the data itself to defined sub-groups (see section
\@ref(sec:usml)). It is of course possible to classify data based on
labelled and unlabelled data. This extension of the supervised
classification problem described above is called semi-supervised
learning. In this case, the training data consists of both labelled
and unlabelled instances with the obvious goal of generating a better
classifier than would be possible with the labelled data only. The
TAGM methods fall under this genre of algorithms and are available in
`pRoloc` and described in the *pRoloc-bayesian* vignette. 

## Unsupervised ML {#sec:usml}

The `plot2D` can also be used to visualise the data on a PCA plot
omitting any marker definitions, as shown on figure
\@ref(fig:plot2dnull). This approach avoids any bias towards marker
definitions and concentrate on the data and its underlying structure
itself.

```{r plot2dnull, echo=TRUE, fig.asp=1, fig.cap="Plain PCA representation of the `tan2009r1` data."}
plot2D(tan2009r1, fcol = NULL)
```

Alternatively, `r Biocpkg("pRoloc")` also gives access to
`r Biocpkg("MLInterfaces")`'s `MLean` unified interface for, among
others, unsupervised approaches using k-means (figure
\@ref(fig:plotKmeans)), hierarchical (figure \@ref(fig:plotHclust)) or
partitioning around metiods (figure \@ref(fig:plotPam)), clustering.

```{r plotKmeans, echo=TRUE, fig.asp=1, fig.cap="k-means clustering on the `tan2009r1` data."}
kcl <- MLearn( ~ ., tan2009r1,  kmeansI, centers=5)
plot(kcl, exprs(tan2009r1))
```

```{r plotHclust, echo=TRUE, tidy = FALSE, fig.asp=1, fig.cap="Hierarchical clustering on the `tan2009r1` data."}
hcl <- MLearn( ~ ., tan2009r1,
          hclustI(distFun = dist,
              cutParm = list(k = 5)))
plot(hcl, labels = FALSE)
```


```{r plotPam, echo=TRUE, fig.asp=1, fig.cap="Partitioning around medoids on the `tan2009r1` data."}
pcl <- MLearn( ~ ., tan2009r1,  pamI(dist), k = 5)
plot(pcl, data = exprs(tan2009r1))
```

## Supervised ML {#sec:sml}

In this section, we show how to use `r Biocpkg("pRoloc")` to run a typical
supervised ML analysis. Several ML methods are available, including
k-nearest neighbour (knn), partial least square discriminant analysis
(plsda), random forest (rf), support vector machines (svm), \ldots The
detailed description of each method is outside of the scope of this
document. We will use support vector machines to illustrate a typical
pipeline and the important points that should be paid attention
to. These points are equally valid and work, from a `r Biocpkg("pRoloc")`
user perspective, exactly the same for the other approaches.

### Classification algorithm parameters optimisation

Before actually generating a model on the new markers and classifying
unknown residents, one has to take care of properly setting the model
parameters. Wrongly set parameters can have a very negative impact on
classification performance. To do so, we create testing (to model) and
training (to predict) subsets using known residents, i.e. marker
proteins. By comparing observed and expected classification
prediction, we can assess how well a given model works using the macro
F1 score (see below). This procedure is repeated for a range of
possible model parameter values (this is called a grid search), and
the best performing set of parameters is then used to construct a
model on all markers and predict un-labelled proteins. For this
parameter optimisation procedure to perform well and produce useful
results, it is essential to run it with a reasonable amount of
markers. In our experience, 15 such marker proteins are necessary.


Model accuracy is evaluated using the F1 score, $F1 = 2 ~
\frac{precision \times recall}{precision + recall}$, calculated as the
harmonic mean of the precision ($precision = \frac{tp}{tp+fp}$, a
measure of *exactness* -- returned output is a relevant result) and
recall ($recall=\frac{tp}{tp+fn}$, a measure of *completeness* --
indicating how much was missed from the output). What we are aiming
for are high generalisation accuracy, i.e high $F1$, indicating that
the marker proteins in the test data set are consistently correctly
assigned by the algorithms.


In order to evaluate how well a classifier performs on profiles it was
not exposed to during its creation, we implemented the following
schema. Each set of marker protein profiles, i.e. labelled with known
organelle association, are separated into *training* and
*test/validation* partitions by sampling 80\% of the profile
corresponding to each organelle (i.e. stratified) without replacement
to form the training partition $S_{tr}$ with the remainder becoming
the test/validation partition $S_{ts}$. The SVM regularisation
parameter $C$ and Gaussian kernel width $sigma$ are selected using a
further round of stratified five-fold cross-validation on each
training partition. All pairs of parameters $(C_i, sigma_j)$ under
consideration are assessed using the macro F1 score and the pair that
produces the best performance is subsequently employed in training a
classifier on all training profiles $S_{tr}$ prior to assessment on
all test/validation profiles $S_{ts}$. This procedure is repeated $N$
times (in the example below 10) in order to produce $N$ macro F1
estimated generalisation performance values (figures
\@ref(fig:params1) and \@ref(fig:params2)). This procedure is implemented in the
`svmOptimisation`. See `?svmOptimisation` for details, in particular
the range of $C$ and $sigma$ parameters and how the relevant feature
variable is defined by the `fcol` parameters, which defaults to
`"markers"`. 

```{r svmParamOptim, eval = FALSE, message = FALSE, tidy=FALSE}
params <- svmOptimisation(tan2009r1, 
                          fcol = "markers.orig",
                          times = 100, xval = 5,
                          verbose = FALSE)
```

In the interest of time, the optimisation is not executed
but loaded with

```{r loadParams, echo = TRUE}
fn <- dir(system.file("extdata", package = "pRoloc"),
      full.names = TRUE, pattern = "params.rda")
load(fn)
```

```{r showParams}
params
```


```{r params1, echo=TRUE, fig.cap="Assessing parameter optimisation. Here, we see the respective distributions of the 10 macro F1 scores for the best cost/sigma parameter pairs. See also the output of *f1Count* in relation to this plot."}
plot(params)
```

```{r params2, echo=TRUE, fig.cap="Assessing parameter optimisation. Visualisation of the averaged macro F1 scores, for the full range of parameter values."}
levelPlot(params)
```



In addition to the plots on figures \@ref(fig:params1) and
\@ref(fig:params2), `f1Count(params)` returns, for each combination of
parameters, the number of best (highest) F1 observations. One can use
`getParams` to see the default set of parameters that are chosen based
on the executed optimisation. Currently, the first best set is
automatically extracted, and users are advised to critically assess
whether this is the most wise choice.

```{r f1count}
f1Count(params)
getParams(params)
```

**Important** It is essential to emphasise that the accuracy scores
obtained during parameter optimisation are only a reflection of the
classification performance on a set of distinct and ideally separated
spatial clusters. Here, we assume that the data is characterised by
good separation of the various spatial niches which are reflected by
sub-cellular markers. Quality control of the data and the markers
using the visualisation described in section \@ref(sec:viz) is
essential and subsequent analyses are doomed to fail in the absence of
such separation.

These classification scores are not representative of the reliability
of the final classification (described in section \@ref(sec:sml)), in
particular along the boundaries separating the different sub-cellular
niches. High scores at the optimisation stage are a requirement to
proceed with the analysis, but are by no means indicative of the false
positive rate in the final sub-cellular assignment of non-marker
proteins.

### Classification

We can now re-use the result from our parameter optimisation (a best
cost/sigma pair is going to be automatically extracted, using the
`getParams` method, although it is possible to set them
manually), and use them to build a model with all the marker proteins
and predict unknown residents using the *svmClassification*
function (see the manual page for more details). By default, the
organelle markers will be defined by the `"markers"` feature
variables (and can be defined by the `fcol` parameter)
e.g. here we use the original markers in `"markers.orig"` as a
use case. New feature variables containing the organelle assignments
and assignment probabilities, called scores hereafter, are
automatically added to the `featureData` slot; in this case,
using the `svm` and `svm.scores` labels.

**Important** The calculation of the classification probabilities is
dependent on the classification algorithm.  These probabilities are
not to be compared across algorithms; they do **not** reflect any
**biologically relevant** sub-cellular localisation probability but
rather an algorithm-specific classification confidence score.}

```{r svmRes0, warning=FALSE, eval = FALSE, tidy = FALSE}
## manual setting of parameters
svmres <- svmClassification(tan2009r1, 
                            fcol = "markers.orig",
                            sigma = 1, cost = 1)
```

```{r svmRes, warning=FALSE, tidy = FALSE}
## using default best parameters
svmres <- svmClassification(tan2009r1, 
                            fcol = "markers.orig",
                            assessRes = params)

processingData(svmres)
tail(fvarLabels(svmres), 4)
```

The original markers, classification results and scores can be
accessed with the `fData` accessor method, e.g.  `fData(svmres)$svm`
or `fData(svmres)$svm.scores`.  Two helper functions, `getMarkers` and
`getPredictions` are available and add some level of automation and
functionality, assuming that the default feature labels are used. Both
(invisibly) return the corresponding feature variable (the markers or
assigned classification) and print a summary table. The `fcol`
parameter must be specified for `getPredictions`. It is also possible
to defined a classification probability below which classifications
are set to `"unknown"`.

```{r getPredictions}
p1 <- getPredictions(svmres, fcol = "svm")
p1 <- fData(p1)$svm.pred
minprob <- median(fData(svmres)$svm.scores)
p2 <- getPredictions(svmres, fcol = "svm", t = minprob)
p2 <- fData(p2)$svm.pred
table(p1, p2)
```

To graphically illustrate the organelle-specific score distributions,
we can use a boxplot and plot the scores for the respective predicted
SVM classes, as shown on figure \@ref(fig:predscores). As can be seen,
different organelles are characterised by different score
distributions. Using a unique threshold (`minprob` with value
`r round(minprob, 2)` above) results in accepting
`r round(sum(p2 == "ER")/sum(p1 == "ER"), 2) * 100`% of the
initial ER predictions and only
`r round(sum(p2 == "mitochondrion")/sum(p1 == "mitochondrion"), 2) * 100`%
of the mitochondrion predictions. The *getPredictions* function also
accepts organelle-specific score thresholds. Below, we calculate
organelle-specific median scores.


```{r predscores, fig.cap="Organelle-specific SVM score distributions."}
boxplot(svm.scores ~ svm, data = fData(svmres),
    ylab = "SVM scores")
abline(h = minprob, lwd = 2, lty = 2)
```

```{r medscores2, tidy = FALSE}
ts <- orgQuants(svmres, fcol = "svm", t = .5)
```

Using these scores equates to choosing the 50\% predictions with
highest scores for organelle.

```{r medscorepreds, tidy = FALSE}
getPredictions(svmres, fcol = "svm", t = ts)
```


We can now visualise these results using the plotting functions
presented in section \@ref(sec:usml), as shown on figure
\@ref(fig:svmres). We clearly see that besides the organelle marker
clusters that have been assigned high confidence members, many other
proteins have substantially lower prediction scores.


```{r svmres, fig.asp=1, fig.cap="Representation of the svm prediction on the `tan2009r1` data set. The svm scores have been used to set the point size (`cex` argument; the scores have been transformed to emphasise the extremes). Different symbols (`fpch`) are used to differentiate markers and new assignments."}
ptsze <- exp(fData(svmres)$svm.scores) - 1
plot2D(svmres, fcol = "svm", fpch = "markers.orig",
       cex = ptsze)
addLegend(svmres, fcol = "svm",
      where = "bottomright",
      cex = .5)
```

### Class imbalance

Traditional ML methods assume datasets with an equal numbers of labelled
instances per class, for example, here we would assume the same number of marker
proteins per organelle class. In subcellular spatial proteomics data we see the
number of marker proteins varies greatly from organelle to organelle, this is
known as class imbalance in field of ML. 

In the below code chunk we again load a dataset from a LOPIT experiment on
Pluripotent Mouse Embryonic stems ([Christoforou et al 2016](http://www.nature.com/ncomms/2016/160112/ncomms9992/full/ncomms9992.html)),
available and documented in the `r Biocpkg("pRolocdata")` data package as
`hyperlopit2015`. We see from looking at the markers in the data we have over
300 markers for some organelles e.g. the mitochondria, but only 13 for other
compartments such as the endosome.

```{r}
data("hyperLOPIT2015")
getMarkers(hyperLOPIT2015, fcol = "markers")
```

We can account for this class imbalance by setting specific class weights when
generating the SVM model. In the code chunks below, we use class specific
weights when creating the SVM model; the weights are set to be inversely
proportional to class frequencies.

```{r weights, eval = TRUE, echo = TRUE}
w <- classWeights(hyperLOPIT2015, fcol = "markers")
w
```

```{r svmParamOptim2, eval = FALSE, message = FALSE, tidy=FALSE}
params2 <- svmOptimisation(hyperLOPIT2015, 
                           fcol = "markers",
                           times = 100, xval = 5,
                           class.weights = w)
```

As above, in the interest of time the results are pre-computed and available in
the extdata package directory.

```{r}
fn <- dir(system.file("extdata", package = "pRoloc"),
full.names = TRUE, pattern = "params2.rda")
load(fn)
```

```{r}
params2
```

```{r}
hl <- svmClassification(hyperLOPIT2015, 
                        params2, 
                        fcol = "markers",
                        class.weights = w)
```

We can set a classification threshold, below which we choose to not assign
a localisation and leave the protein "unlabelled" i.e. in `pRoloc` terminology
these are denoted by the term "unknown". In the below code chunk we use the
`orgQuants` function to calculate an organelle specific threshold according
to the the third quantile for each organelles SVM scores.

```{r medscores3, tidy = FALSE}
ts <- orgQuants(hl, fcol = "svm", t = .75)
hl <- getPredictions(hl, fcol = "svm", scol = "svm.scores", t = ts)
```

The data is visualised as described previously, we plot the filtered
SVM results using the `plot2D` function (figure \@ref(fig:plothl)).

```{r plothl, fig.asp=1, echo=FALSE, fig.cap="Classification results following prediction from a class-weight SVM classifier on LOPIT data produced from mouse embryonic stem cells."}
## plot new predictions
plot2D(hl, fcol = "svm.pred")
addLegend(hl, fcol = "svm.pred", where = "bottomleft", bty = "n", cex = .5)
```

## Bayesian generative models {#sec:bayes}

We also offer generative models that, as opposed to the descriptive
classifier presented above, explicitly model the spatial proteomics
data. In `pRoloc`, we propose two models using T-augmented Gaussian
mixtures using repectively a Expectration-Maximisation approach to
*maximum a posteriori* estimation of the model parameters (TAGM-MAP),
and an MCMC approach (TAGM-MCMC) that enables a proteome-wide
uncertainty quantitation. These methods are described in the
*pRoloc-bayesian* vignette. For a details description of the methods
and their validation, please refer to [@Crook:2018].


# Conclusions {#sec:ccl}

This tutorial focuses on practical aspects of organelles proteomics
data analysis using `r Biocpkg("pRoloc")`. Two important aspects have
been illustrates: (1) data generation, manipulation and visualisation
and (2) application of contemporary and novel machine learning
techniques.  Other crucial parts of a full analysis pipeline that were
not covered here are raw mass-spectrometry quality control,
quantitation, post-analysis and data validation.


Data analysis is not a trivial task, and in general, one can not
assume that any off-the-shelf algorithm will perform well.  As such,
one of the emphasis of the software presented in this document is
allowing users to track data processing and critically evaluate the
results.


# Acknowledgement {-}

We would like to thank Dr Daniel J.H. Nightingale, Dr Arnoud J. Groen,
Dr Claire M. Mulvey and Dr Andy Christoforou for their organelle
marker contributions.

# Session information {-}

All software and respective versions used to produce this document are
listed below.

```{r sessioninfo, echo=FALSE}
sessionInfo()
```

# References {-}