%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{pRoloc tutorial}
%\VignetteKeywords{Bioinformatics, Machine learning, Organelle, Proteomics}
%\VignettePackage{pRoloc}

\documentclass[12pt, oneside]{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@


\author{
  Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}~and Lisa M. Breckels
}


\bioctitle[\Biocpkg{pRoloc} tutorial]{A short tutorial on using
  \Biocpkg{pRoloc} for spatial proteomics data analysis}

\begin{document}

\maketitle

%% Abstract and keywords %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vskip 0.3in minus 0.1in
\hrule
\begin{abstract}
  This tutorial illustrates the usage of the \Biocpkg{pRoloc} \R{}
  package for the analysis and interpretation of spatial proteomics
  data. It walks the reader through the creation of \Rclass{MSnSet}
  instances, that hold the quantitative proteomics data and meta-data
  and introduces several aspects of data analysis, including data
  visualisation and application of machine learning to predict protein
  localisation.
\end{abstract}
\textit{Keywords}: Bioinformatics, organelle proteomics, machine learning, visualisation 
\vskip 0.1in minus 0.05in
\hrule
\vskip 0.2in minus 0.1in
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage

\tableofcontents

<<env, include=FALSE, echo=FALSE, cache=FALSE>>=
library("knitr")
opts_chunk$set(fig.align = 'center', 
               fig.show = 'hold', 
               par = TRUE,
               prompt = TRUE,
               eval = TRUE,
               stop_on_error = 1L,
               comment = NA)
options(replace.assign = TRUE, 
        width = 55)

suppressPackageStartupMessages(library("MSnbase"))
suppressWarnings(suppressPackageStartupMessages(library("pRoloc")))
suppressPackageStartupMessages(library("pRolocdata"))
suppressPackageStartupMessages(library("class"))
set.seed(1)
@ 
%%$

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage

\input{Foreword.tex}

\input{Bugs.tex}

\newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Introduction}\label{sec:intro} 

\subsection{Spatial proteomics}

Spatial (or organelle) proteomics is the study of the localisation of
proteins inside cells. The sub-cellular compartment can be organelles,
i.e. structures defined by lipid bi-layers,macro-molecular assemblies
of proteins and nucleic acids or large protein complexes. In this
document, we will focus on mass-spectrometry based approaches that
assay a population of cells, as opposed as microscopy based techniques
that monitor single cells, as the former is the primary concern of
\Biocpkg{pRoloc}, although the techniques described below and the
infrastructure in place could also be applied the processed image
data. The typical experimental use-case for using \Biocpkg{pRoloc} is
a set of fractions, originating from a total cell lysate. These
fractions can originate from a continuous gradient, like in the LOPIT
\cite{Dunkley2006} or PCP \cite{Foster2006} approaches, or can be
discrete fractions. The content of the fractions is then identified
and quantified (using labelled or un-labelled quantitation
techniques). Using relative quantitation of known organelle residents,
termed organelle markers, organelle-specific profiles along the
gradient are determined and new residents are identified based on
matching of these distribution profiles. See for example
\cite{Gatto2010} and references therein for a detailed review on
organelle proteomics.

It should be noted that large protein complexes, that are not
necessarily separately enclosed within their own lipid bi-layer, can
be detected by such techniques, as long as a distinct profile can be
defined across the fractions.

\subsection{About \R{} and \Biocpkg{pRoloc}}

\R{} \cite{Rstat} is a statistical programming language and interactive
working environment. It can be expanded by so-called packages to
confer new functionality to users. Many such packages have been
developed for the analysis of high-throughput biology, notably through
the Bioconductor project \cite{Gentleman2004}. Two packages are of
particular interest here, namely \Biocpkg{MSnbase} \cite{Gatto2012}
and \Biocpkg{pRoloc}. The former provides flexible infrastructure to
store and manipulate quantitative proteomics data and the associated
meta-data and the latter implements specific algorithmic technologies
to analyse organelle proteomics data.

Among the advantages of \R{} are robust statistical procedures, good
visualisation capabilities, excellent documentation, reproducible
research\footnote{The content of this document is compiled (the code
  is executed and its output, text and figures, is displayed
  dynamically) to generate the pdf file.}, power and flexibility of
the \R{} language and environment itself and a rich environment for
specialised functionality in many domains of bioinformatics: tools for
many omics technologies, including proteomics, bio-statistics, gene
ontology and biological pathway analysis,~\ldots Although there exists
some specific graphical user interfaces (GUI), interaction with \R{}
is executed through a command line interface. While this mode of
interaction might look alien to new users, experience has proven that
after a first steep learning curve, great results can be achieved by
non-programmers. Furthermore, specific and general documentation is
plenty and beginners and advanced course material are also widely
available.

Once \R{} is started, the first step to enable functionality of a
specific packages is to load them using the \Rfunction{library}
function, as shown in the code chunk below:

<<libraries>>=
library("MSnbase")
library("pRoloc")
library("pRolocdata")
@ 

\Biocpkg{MSnbase} implements the data containers that are used by
\Biocpkg{pRoloc}. \Biocexptpkg{pRolocdata} is a data package that
supplies several published organelle proteomics data sets.

As a final setup step, we set the default colour palette for some of
our custom plotting functionality to use semi-transparent colours in
the code chunk below (see \Rfunction{?setStockcol} for details).  This
facilitates visualisation of overlapping points.

<<setcols>>=
setStockcol(paste0(getStockcol(), 70))
@ 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage

\section{Data structures}\label{sec:data}

\subsection{Example data}

The data used in this tutorial has been published in
\cite{Tan2009}. The LOPIT technique \cite{Dunkley2006} is used to
localise integral and associated membrane proteins in
\textit{Drosophila melanogaster} embryos. Briefly, embryos were
collected at 0 -- 16 hours, homogenised and centrifuged to collect the
supernatant, removing cell debris and nuclei. Membrane fractionation
was performed on a iodixanol gradient and fractions were quantified
using iTRAQ isobaric tags \cite{Ross2004} as follows: fractions 4/5,
114; fractions 12/13, 115; fraction 19, 116 and fraction 21,
117. Labelled peptides were then separated using cation exchange
chromatography and analysed by LS-MS/MS on a QSTAR XL
quadrupole-time-of-flight mass spectrometer (Applied Biosystems). The
original localisation analysis was performed using partial least
square discriminant analysis (PLS-DA). Relative quantitation data was
retrieved from the supplementary file
\texttt{pr800866n\_si\_004.xls}\footnote{\url{http://pubs.acs.org/doi/suppl/10.1021/pr800866n/suppl\_file/pr800866n\_si\_004.xls}}
and imported into \R{} as described below. We will concentrate on the
first replicate.

\subsection{Importing and loading data}

This section illustrates how to import data in comma-separated value
(csv) format into an appropriate \R{} data structure. The first section
shows the original \texttt{csv} (comma separated values) spreadsheet,
as published by the authors, and how one can read such a file into \R
using the \Rfunction{read.csv} function. This spreadsheet file is
similar to the output of many quantitation software.

In the next section, we show 2 \texttt{csv} files containing a subset
of the columns of original \texttt{pr800866n\_si\_004-rep1.csv} file
and another short file, created manually, that will be used to create
the appropriate \R{} data.

\subsubsection{The original data file}\label{sec:orgcsv}

<<readCsvData0>>=
## The original data for replicate 1, available
## from the pRolocdata package
f0 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
csv <- read.csv(f0)
@ 

The three first lines of the original spreadsheet, containing the data
for replicate one, are illustrated below (using the function
\Rfunction{head}). It contains \Sexpr{nrow(csv)} rows (proteins) and
\Sexpr{ncol(csv)} columns, including protein identifiers, database
accession numbers, gene symbols, reporter ion quantitation values,
information related to protein identification, \ldots

<<showOrgCsv>>=
head(csv, n=3)
@ 

\subsubsection{From \texttt{csv} files to \R{} data}\label{sec:csv}

There are several ways to create the desired \R{} data object, termed
\Rclass{MSnSet}, that will be used to perform the actual sub-cellular
localisation prediction. Here, we illustrate a method that uses
separate spreadsheet files for quantitation data, feature meta-data
and sample (fraction) meta-data and the \Rfunction{readMSnSet}
constructor function, that will hopefully be the most straightforward
for new users.

<<readCsvData1, tidy = FALSE>>=
## The quantitation data, from the original data
f1 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, pattern = "exprsFile.csv")
exprsCsv <- read.csv(f1)
## Feature meta-data, from the original data
f2 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, pattern = "fdataFile.csv")
fdataCsv <- read.csv(f2)
## Sample meta-data, a new file
f3 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, pattern = "pdataFile.csv")
pdataCsv <- read.csv(f3)
@ 

\begin{description}

\item[\texttt{exprsFile.csv}] containing the quantitation (expression) data for the \Sexpr{nrow(exprsCsv)} proteins and 4 reporter tags.
<<showExprsFile>>=
head(exprsCsv, n=3)
@
\item[\texttt{fdataFile.csv}] containing meta-data for the \Sexpr{nrow(fdataCsv)} features (here proteins).
<<showFdFile>>=
head(fdataCsv, n=3)
@
\item[\texttt{pdataFile.csv}] containing samples (here fractions) meta-data. This simple file has been created manually.
<<showPdFile>>=
pdataCsv
@
\end{description}

A self-contained data structure, called \Rclass{MSnSet} (defined in
the \Biocpkg{MSnbase} package) can now easily be generated using the
\Rfunction{readMSnSet} constructor, providing the respective
\texttt{csv} file names shown above and specifying that the data is
comma-separated (with \Robject{sep = ","}). Below, we call that object
\Robject{tan2009r1} and display its content.

<<makeMSnSet>>=
tan2009r1 <- readMSnSet(exprsFile = f1,
                        featureDataFile = f2,
                        phenoDataFile = f3,
                        sep = ",")
tan2009r1
@ 

\subsubsection{The \Rclass{MSnSet} class}

Although there are additional specific sub-containers for additional
meta-data (for instance to make the object MIAPE compliant), the
feature (the sub-container, or slot \Robject{featureData}) and sample
(the \Robject{phenoData} slot) are the most important ones. They need
to meet the following validity requirements (see figure
\ref{fig:msnset}):

\begin{itemize}
\item the number of row in the expression/quantitation data and
  feature data must be equal and the row names must match exactly, and
\item the number of columns in the expression/quantitation data and
  number of row in the sample meta-data must be equal and the
  column/row names must match exactly.
\end{itemize}
  
It is common, in the context of \Biocpkg{pRoloc} to update the
feature meta-data (described in section \ref{sec:analysis}) by adding
new columns, without breaking the objects validity. Similarly, the
sample meta-data can also be updated by adding new sample variables. A
detailed description of the \Robject{MSnSet} class is available by
typing \Rfunction{?MSnSet} in the \R{} console.

\begin{figure}[!hbt]
\centering
    \includegraphics[width=0.5\textwidth]{./Figures/msnset.png}
\caption{Dimension requirements for the respective expression, feature and sample meta-data slots. }
\label{fig:msnset}
\end{figure}


The individual parts of this data object can be accessed with their
respective accessor methods:

\begin{itemize}
\item the quantitation data can be retrieved with \Rfunction{exprs(tan2009r1)}, 
\item the feature meta-data with \Rfunction{fData(tan2009r1)} and 
\item the sample meta-data with \Rfunction{pData(tan2009r1)}. 
\end{itemize}
  
The advantage of this structure is that it can be manipulated as a
whole and the respective parts of the data object will remain
compatible. The code chunk below, for example, shows how to extract
the first 5 proteins and 2 first samples:

<<showSubset>>= 
smallTan <- tan2009r1[1:5, 1:2]
dim(smallTan)
exprs(smallTan)
@ 

Several data sets, including the 3 replicates from \cite{Tan2009}, are
distributed as \Rclass{MSnSet} instances in the \Biocexptpkg{pRolocdata}
package. Others include, among others, the \textit{Arabidopsis
  thaliana} LOPIT data from \cite{Dunkley2006} (\Robject{dunkley2006})
and the mouse PCP data from \cite{Foster2006}
(\Robject{foster2006}). Each data set can be loaded with the
\Rfunction{data} function, as show below for the first replicate from
\cite{Tan2009}.

<<rmtan, echo=FALSE>>=
## remove manullay constructred data
rm(tan2009r1)
data(tan2009r1)
@ 

<<loadTan1>>= 
data(tan2009r1)
@ 

The original marker proteins are available as a feature meta-data
variables called \Robject{markers.orig} and the output of the partial least
square discriminant analysis, applied in the original publication, in
the \Robject{PLSDA} feature variable. The most up-to-date marker list for the
experiment can be found in \Robject{markers}. This feature meta-data column
can be added from a simple \texttt{csv} markers files using the
\Rfunction{addMarkers} function - see \Rfunction{?addMarkers} for
details.

The organelle markers are illustrated below using the convenience
function \Rfunction{getMarkers}, but could also be done manually with
\Rfunction{table(fData(tan2009r1)\$markers.orig)} and
\Rfunction{table(fData(tan2009r1)\$PLSDA)} respectively.

<<lookAtTan>>=
getMarkers(tan2009r1, fcol = "markers.orig")
getMarkers(tan2009r1, fcol = "PLSDA")
@ 


\paragraph*{Important} As can be seen above, some proteins are
labelled \texttt{"unknown"}, defining non marker proteins. This is a
convention in many \Biocpkg{pRoloc} functions. Missing annotations
(empty string) will not be considered as of unknown localisation; we
prefer to avoid empty strings and make the absence of known
localisation explicit by using the \texttt{"unknown"} tag. This
information will be used to separate marker and non-marker
(unlabelled) proteins when proceeding with data visualisation and
clustering (sections \ref{sec:viz} and \ref{sec:usml}) and
classification analysis (section \ref{sec:sml}).

\subsection{\Biocpkg{pRoloc}'s organelle markers}

The \Biocpkg{pRoloc} package distributes a set of markers that have
been obtained by mining the \Biocexptpkg{pRolocdata} datasets and
curation by various members of the Cambridge Centre for
Proteomics. The available marker sets can be obtained and loaded using
the \Rfunction{pRolocmarkers} functions:

<<markers>>=
pRolocmarkers()
head(pRolocmarkers("dmel"))
table(pRolocmarkers("dmel"))
@

These markers can then be added to a new \Rclass{MSnSet} using the
\Rfunction{addMarkers} function by matching the marker names (protein
identifiers) and the feature names of the \Rclass{MSnSet}. See
\Rfunction{?addMarkers} for examples.

\subsection{Data processing}

The quantitation data obtained in the supplementary file is normalised
to the sum of intensities of each protein; the sum of fraction
quantitation for each protein equals 1 (considering rounding
errors). This can quickly be verified by computing the row sums of the
expression data.

<<realtiveQuants>>=
summary(rowSums(exprs(tan2009r1)))
@ 

The \Rfunction{normalise} method (also available as
\Rfunction{normalize}) from the \Biocpkg{MSnbase} package can be used
to obtain relative quantitation data, as illustrated below on another
iTRAQ test data set, available from \Biocpkg{MSnbase}. Several
normalisation \Robject{methods} are available and described in
\Rfunction{?normalise}.  For many algorithms, including classifiers in
general and support vector machines in particular, it is important to
properly per-process the data.  Centering and scaling of the data is
also available with the \Rfunction{scale} method, described in the
\Rfunction{scale} manual.


In the code chunk below, we first create a test \Rclass{MSnSet}
instance\footnote{%%
  Briefly, the \Robject{itraqdata} raw iTRAQ4-plex data is quantified
  by trapezoidation using \Biocpkg{MSnbase} functionality. See the
  \Robject{MSnbase-demo} vignette for details.} and illustrate the
effect of \Rfunction{normalise(..., method = "sum")}.

<<norm, echo=TRUE, message=FALSE, cache=TRUE>>=
## create a small illustrative test data
data(itraqdata)
itraqdata <- quantify(itraqdata, method = "trap", 
                      reporters = iTRAQ4, 
                      verbose = FALSE, parallel = FALSE)
## the quantification data
head(exprs(itraqdata), n = 3)
summary(rowSums(exprs(itraqdata)))
## normalising to the sum of feature intensitites
itraqnorm <- normalise(itraqdata, method = "sum")
processingData(itraqnorm)
head(exprs(itraqnorm), n = 3)
summary(rowSums(exprs(itraqnorm)))
@ 

Note above how the processing undergone by the \Rclass{MSnSet}
instances \Robject{itraqdata} and \Robject{itraqnorm} is stored in
another such specific sub-container, the \Robject{processingData}
slot.

\bigskip

The different features (proteins in the \Robject{tan2009r1} data
above, but these could also represent peptides or MS$^2$ spectra) are
characterised by unique names. These can be retrieved with the
\Rfunction{featureNames} function.

<<featurenames>>=
head(featureNames(tan2009r1))
@ 

If we look back at section \ref{sec:csv}, we see that these have been
automatically assigned using the first columns in the
\texttt{exprsFile.csv} and \texttt{fdataFile.csv} files. It is thus
crucial for these respective first columns to be identical. Similarly,
the sample names can be retrieved with
\Rfunction{sampleNames(tan2009r1)}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Data visualisation}\label{sec:viz}

The following sections will focus on two closely related aspects, data
visualisation and data analysis (i.e. organelle assignments). Data
visualisation is used in the context on quality control, to convince
ourselves that the data displays the expected properties so that the
output of further processing can be trusted. Visualising results of
the localisation prediction is also essential, to control the validity
of these results, before proceeding with orthogonal (and often
expensive) \textit{dry} or \textit{wet} validation.

\bigskip

The underlying principle of gradient approaches is that we have
separated organelles along the gradient and by doing so, generated
organelle-specific protein distributions along the gradient
fractions. The most natural visualisation is shown on figure
\ref{fig:plotdist1}, obtained using the sub-setting functionality of
\Rclass{MSnSet} instances and the \Rfunction{plotDist} function, as
illustrated below.

<<showplotdist, echo=TRUE, eval=FALSE, prompt=FALSE>>=
## indices of the mito markers
j <- which(fData(tan2009r1)$markers.orig == "mitochondrion")
## indices of all proteins assigned to the mito
i <- which(fData(tan2009r1)$PLSDA == "mitochondrion")
plotDist(tan2009r1[i, ],
         markers = featureNames(tan2009r1)[j])
@ 

\begin{figure}[!htb]
<<plotdist1, dev='pdf', fig.width=9, fig.height=7, echo=FALSE>>=
par(mfrow = c(2,2), mar = c(4, 4, 2, 0.1))
cls <- getStockcol()[1:4]
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "mitochondrion"), ],
         markers = featureNames(tan2009r1)
         [which(fData(tan2009r1)$markers.orig == "mitochondrion")],
         mcol = cls[1])
title(main = "mitochondrion")
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "ER/Golgi"), ],
         markers = featureNames(tan2009r1)
         [which(fData(tan2009r1)$markers.orig == "ER")],
         mcol = cls[2])
title(main = "ER")
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "ER/Golgi"), ],
         markers = featureNames(tan2009r1)
         [which(fData(tan2009r1)$markers.orig == "Golgi")],
         mcol = cls[3])
title(main = "Golgi")
plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "PM"), ],
         markers = featureNames(tan2009r1)
         [which(fData(tan2009r1)$markers.orig == "PM")],
         mcol = cls[4])
title(main = "PM")
@ 
%% $
\caption{Distribution of protein intensities along the fractions of
  the separation gradient for 4 organelles: mitochondrion (red),
  ER/Golgi (blue, ER markers and green, Golgi markers) and plasma
  membrane (purple). }
\label{fig:plotdist1}
\end{figure}


Alternatively, we can combine all organelle groups in one single 2
dimensional figure by applying a dimensionality reduction technique
such as Principal Component Analysis (PCA) using the
\Rfunction{plot2D} function (see figure \ref{fig:plot2d}). The protein
profile vectors are summarised into 2 values that can be visualised in
two dimensions, so that the variability in the data will be maximised
along the first principal component (PC1). The second principal
component (PC2) is then chosen as to be orthogonal to PC1 while
explaining as much variance in the data as possible, and so on for
PC3, PC4, etc.

Using a PCA representation to visualise a spatial proteomics
experiment, we can easily plot all the proteins on the same figure as
well a many sub-cellular clusters (see figure \ref{fig:pdres2} for a
case with 11 clusters). These clusters are defined in a feature
meta-data column (slot \Robject{featureData}/\Robject{fData}) that is
declaraed with the \Robject{fcol} argument (default is
\texttt{"markers"} which contains the most current known markers for
the experiment under investigation, the original markers published with
the data can be found in the slot \texttt{"markers.orig"}).

\begin{figure}[!hbt]
<<plot2d, dev='pdf', fig.width=5, fig.height=5, echo=TRUE>>=
plot2D(tan2009r1, fcol = "PLSDA", fpch = "markers.orig")
addLegend(tan2009r1, fcol = "PLSDA", 
          where = "bottomright", bty = "n", cex = .7)
@ 
\caption{ Representation of the \Sexpr{nrow(tan2009r1)} protein of
  \Robject{tan2009r1} after reduction of the 4 reporter quantitation
  data to 2 principal components.}
\label{fig:plot2d}
\end{figure}

\subsection{Features of interest}

In addition to highlighting sub-cellular niches as coloured clusters
on the PCA plot, it is also possible to define some arbitrary
\textit{features of interest} that represent, for example, proteins of
a particular pathway or a set of interaction partners. Such sets of
proteins are recorded as \Rclass{FeaturesOfInterest} instances, as
illstrated below (using the ten first features of our experiment):

<<foi0>>=
foi1 <- FeaturesOfInterest(description = "Feats of interest 1",
                           fnames = featureNames(tan2009r1[1:10]))
description(foi1)
foi(foi1)
@

Several such features of interest can be combined into collections:

<<>>=
foi2 <- FeaturesOfInterest(description = "Feats of interest 2",
                           fnames = featureNames(tan2009r1[880:888]))
foic <- FoICollection(list(foi1, foi2))
foic
@

\Rclass{FeaturesOfInterest} instances can now be overlaid on the PCA
plot with the \Rfunction{highlightOnPlot} function.

\begin{figure}[!hbt]
<<plotfoi, dev='pdf', fig.width=5, fig.height=5, echo=TRUE>>=
plot2D(tan2009r1, fcol = "PLSDA")
addLegend(tan2009r1, fcol = "PLSDA", 
          where = "bottomright", bty = "n", cex = .7)
highlightOnPlot(tan2009r1, foi1, col = "black", lwd = 2)
highlightOnPlot(tan2009r1, foi2, col = "purple", lwd = 2)
legend("topright", c("FoI 1", "FoI 2"), bty = "n",
       col = c("black", "purple"), pch = 1)
@ 
\caption{Adding features of interest on a PCA plot. }
\label{fig:plotfoi}
\end{figure}

See \Rfunction{?FeaturesOfInterest} and \Rfunction{?highlightOnPlot}
for more details.

\clearpage

\subsection{Interactive visualisation}\label{sec:gui}

The \Biocpkg{pRolocGUI} application allows one to explore the spatial
proteomics data using an interactive, web-based \CRANpkg{shiny}
interface \cite{shiny}. The package is available from Bioconductor
and can be installed and started as follows:

<<guiinstall, eval = FALSE>>=
library("BiocInstaller")
biocLite("pRolocGUI")
library("pRolocGUI")
pRolocVis(tan2009r1)
@

\begin{figure}[h]
  \centering
  \includegraphics[width=.65\linewidth]{./Figures/Screenshot_PCA2.png}
  \caption{Screenshot of the \Biocpkg{pRolocGUI} interface.}
  \label{fig:gui}
\end{figure}

More details are available in the vignette that can be started from
the application by clicking on any of the question marks, by starting
the vignette from \R{} with \Rfunction{vignette("pRolocGUI")} or can be
accessed
online\footnote{\url{http://bioconductor.org/packages/devel/bioc/vignettes/pRolocGUI/inst/doc/pRolocGUI.html}}.

\clearpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Data analysis}\label{sec:analysis}

Classification of proteins, i.e. assigning sub-cellular localisation
to proteins, is the main aspect of the present data analysis. The
principle is the following and is, in its basic form, a 2 step
process. First, an algorithm learns from the known markers that are
shown to him and models the data space accordingly. This phase is also
called the training phase. In the second phase, un-labelled proteins,
i.e. those that have not been labelled as resident of any organelle,
are matched to the model and assigned to a group (an organelle). This
2 step process is called machine learning (ML), because the computer
(machine) learns by itself how to recognise instances that possess
certain characteristics and classifies them without human
intervention. That does however not mean that results can be trusted
blindly.

In the above paragraph, we have defined what is called supervised ML,
because the algorithm is presented with some know instances from which
it learns (see section \ref{sec:sml}). Alternatively, un-supervised ML
does not make any assumptions about the group memberships, and uses
the structure of the data itself to defined sub-groups (see section
\ref{sec:usml}). It is of course possible to classify data based on
labelled and unlabelled data. This extension of the supervised
classification problem described above is called semi-supervised
learning. In this case, the training data consists of both labelled
and unlabelled instances with the obvious goal of generating a better
classifier than would be possible with the labelled data only. The
\textit{phenoDisco} algorithm, will be illustrated in that context
(section \ref{sec:ssml}).

\subsection{Unsupervised ML}\label{sec:usml}

The \Rfunction{plot2D} can also be used to visualise the data on a PCA
plot omitting any marker definitions, as shown on figure
\ref{fig:plot2dnull}. This approach avoids any bias towards marker
definitions and concentrate on the data and its underlying structure
itself.

\begin{figure}[!hbt]
<<plot2dnull, dev='pdf', fig.width=5, fig.height=5, echo=TRUE>>=
plot2D(tan2009r1, fcol = NULL)
@ 
\caption{Plain PCA representation of the \Robject{tan2009r1} data.}
\label{fig:plot2dnull}
\end{figure}

Alternatively, \Biocpkg{pRoloc} also gives access to
\Biocpkg{MLInterfaces}'s \Rfunction{MLean} unified interface for,
among others, unsupervised approaches using k-means (figure
\ref{fig:plotKmeans} on page \pageref{fig:plotKmeans}), hierarchical
(figure \ref{fig:plotHclust} on page \pageref{fig:plotHclust}) or
partitioning around medoids (figure \ref{fig:plotPam} on page
\pageref{fig:plotPam}), clustering.

%% TODO - describe these plots in more details.
%% partition plot (private from MLInterfaces)
%% silhouette plot
%% screeplot
%% PCA plot
%% hierarchical cluster

%% ALSO
%%  - maybe some mclust-inspired plots.

\begin{figure}[!hbt]
<<plotKmeans, dev='pdf', fig.width=8, fig.height=6, echo=TRUE>>=
kcl <- MLearn( ~ ., tan2009r1,  kmeansI, centers=5)
plot(kcl, exprs(tan2009r1))
@ 
\caption{k-means clustering on the \Robject{tan2009r1} data. }
\label{fig:plotKmeans}
\end{figure}

\begin{figure}[!hbt]
<<plotHclust, dev='pdf', fig.width=8, fig.height=6, echo=TRUE, tidy = FALSE>>=
hcl <- MLearn( ~ ., tan2009r1,  
              hclustI(distFun = dist, cutParm = list(k = 5)))
plot(hcl, labels = FALSE)
@ 
\caption{Hierarchical clustering on the \Robject{tan2009r1} data.}
\label{fig:plotHclust}
\end{figure}

\begin{figure}[!hbt]
<<plotPam, dev='pdf', fig.width=8, fig.height=6, echo=TRUE>>=
pcl <- MLearn( ~ ., tan2009r1,  pamI(dist), k = 5)
plot(pcl, data = exprs(tan2009r1))
@ 
\caption{Partitioning around medoids on the \Robject{tan2009r1} data.}
\label{fig:plotPam}
\end{figure}

\clearpage

\subsection{Supervised ML}\label{sec:sml}

In this section, we show how to use \Biocpkg{pRoloc} to run a typical
supervised ML analysis. Several ML methods are available, including
k-nearest neighbour (knn), partial least square discriminant analysis
(plsda), random forest (rf), support vector machines (svm), \ldots The
detailed description of each method is outside of the scope of this
document. We will use support vector machines to illustrate a typical
pipeline and the important points that should be paid attention
to. These points are equally valid and work, from a \Biocpkg{pRoloc}
user perspective, exactly the same for the other approaches.

\subsubsection{Classification algorithm parameters optimisation}

Before actually generating a model on the new markers and classifying
unknown residents, one has to take care of properly setting the model
parameters. Wrongly set parameters can have a very negative impact on
classification performance. To do so, we create testing (to model) and
training (to predict) subsets using known residents. By comparing
observed and expected classification prediction, we can assess how
well a given model works using the macro F1 score (see below). This
procedure is repeated for a range of possible model parameter values
(this is called a grid search), and the best performing set of
parameters is then used to construct a model on all markers and
predict un-labelled proteins.

Model accuracy is evaluated using the F1 score, $F1 = 2 ~
\frac{precision \times recall}{precision + recall}$, calculated as the
harmonic mean of the precision ($precision = \frac{tp}{tp+fp}$, a
measure of \textit{exactness} -- returned output is a relevant result)
and recall ($recall=\frac{tp}{tp+fn}$, a measure of
\textit{completeness} -- indicating how much was missed from the
output). What we are aiming for are high generalisation accuracy, i.e
high $F1$, indicating that the marker proteins in the test data set
are consistently correctly assigned by the algorithms.

\bigskip

In order to evaluate how well a classifier performs on profiles it was
not exposed to during its creation, we implemented the following
schema. Each set of marker protein profiles, i.e. labelled with known
organelle association, are separated into \textit{training} and
\textit{test/validation} partitions by sampling 80\% of the profile
corresponding to each organelle (i.e. stratified) without replacement
to form the training partition $S_{tr}$ with the remainder becoming
the test/validation partition $S_{ts}$. The svm regularisation
parameter $C$ and Gaussian kernel width $sigma$ are selected using a
further round of stratified five-fold cross-validation on each
training partition. All pairs of parameters $(C_i, sigma_j)$ under
consideration are assessed using the macro F1 score and the pair that
produces the best performance is subsequently employed in training a
classifier on all training profiles $S_{tr}$ prior to assessment on
all test/validation profiles $S_{ts}$. This procedure is repeated $N$
times (in the example below 10) in order to produce $N$ macro F1
estimated generalisation performance values (figure
\ref{fig:params}). This procedure is implemented in the
\Rfunction{svmOptimisation}. See \Rfunction{?svmOptimisation} for
details, in particular the range of $C$ and $sigma$ parameters and how
the relevant feature variable is defined by the \Robject{fcol}
parameters, which defaults to \Robject{"markers"}.  Note that here, we
demonstrate the function with only perform 10 iterations\footnote{%%
  In the interest of time, the optimisation is not executed but loaded
  from \Rfunction{dir(system.file("extdata", package = "pRoloc"),
    full.names = TRUE, pattern = "params.rda")}.%%
} (\Robject{times = 10}), which is enough for testing, but we
recommend 100 (which is the default value) for a more robust analysis.

<<svmParamOptim, eval = FALSE, message = FALSE, tidy=FALSE>>=
params <- svmOptimisation(tan2009r1, fcol = "markers.orig", 
                          times = 10, xval = 5, verbose = FALSE)
@ 

<<loadParams, echo = FALSE>>=
fn <- dir(system.file("extdata", package = "pRoloc"), 
          full.names = TRUE, pattern = "params.rda")
load(fn)
rm(fn)
@ 

<<showParams>>=
params
@ 


\begin{figure}[!hbt]
<<params, dev='pdf', fig.width=4, fig.height=4, echo=TRUE, out.width='.49\\linewidth'>>=
plot(params)
levelPlot(params)
@ 
\caption{Assessing parameter optimisation. On the left, we see the
  respective distributions of the 10 macro F1 scores for the best
  cost/sigma parameter pairs. See also the output of
  \Rfunction{f1Count} in relation to this plot. On the right, we see
  the averaged macro F1 scores, for the full range of parameter
  values.}
\label{fig:params}
\end{figure}

In addition to the plots on figure \ref{fig:params},
\Rfunction{f1Count(params)} returns, for each combination of
parameters, the number of best (highest) F1 observations. One can use
\Rfunction{getParams} to see the default set of parameters that are
chosen based on the executed optimisation. Currently, the first best
set is automatically extracted, and users are advised to critically
assess whether this is the most wise choice.

<<f1count>>=
f1Count(params)
getParams(params)
@ 

\subsubsection{Classification}\label{sec:sml}

We can now re-use the result from our parameter optimisation (a best
cost/sigma pair is going to be automatically extracted, using the
\Rfunction{getParams} method, although it is possible to set them
manually), and use them to build a model with all the marker proteins
and predict unknown residents using the \Rfunction{svmClassification}
function (see the manual page for more details). By default, the
organelle markers will be defined by the \Robject{"markers"} feature
variables (and can be defined by the \Robject{fcol} parameter) e.g. here
we use the original markers in \Robject{"markers.orig"} as a use case. New
feature variables containing the organelle assignments and assignment
probabilities\footnote{%%
  The calculation of the classification probabilities is dependent on
  the classification algorithm.  These probabilities are not to be
  compared across algorithms; they do \textbf{not} reflect any
  \textbf{biologically relevant} sub-cellular localisation probability
  but rather an algorithm-specific classification confidence score.}%%
, called scores hereafter, are automatically added to the
\Robject{featureData} slot; in this case, using the \Robject{svm} and
\Robject{svm.scores} labels.

<<svmRes0, warning=FALSE, eval = FALSE, tidy = FALSE>>=
## manual setting of parameters
svmres <- svmClassification(tan2009r1, fcol = "markers.orig", 
                            sigma = 1, cost = 1)
@ 

<<svmRes, warning=FALSE, tidy = FALSE>>=
## using default best parameters
svmres <- svmClassification(tan2009r1, fcol = "markers.orig",
                            assessRes = params)
processingData(svmres)
tail(fvarLabels(svmres), 4)
@ 

The original markers, classification results and scores can be
accessed with the \Robject{fData} accessor method, e.g.
\Rfunction{fData(svmres)\$svm} or
\Rfunction{fData(svmres)\$svm.scores}.  Two helper functions,
\Rfunction{getMarkers} and \Rfunction{getPredictions} are available
and add some level of automation and functionality, assuming that the
default feature labels are used. Both (invisibly) return the
corresponding feature variable (the markers or assigned
classification) and print a summary table. The \Robject{fcol}
parameter must be specified for \Rfunction{getPredictions}. It is also
possible to defined a classification probability below which
classifications are set to \texttt{"unknown"}.

<<getPredictions>>=
p1 <- getPredictions(svmres, fcol = "svm")
minprob <- median(fData(svmres)$svm.scores)
p2 <- getPredictions(svmres, fcol = "svm", t = minprob)
table(p1, p2)
@ %%$

To graphically illustrate the organelle-specific score distributions, use

\begin{figure}[!hbt]
<<predscoresPlot, dev='pdf', fig.width=6, fig.height=6, echo=TRUE, out.width='.55\\linewidth'>>=
boxplot(svm.scores ~ svm, data = fData(svmres), ylab = "SVM scores")
abline(h = minprob, lwd = 2, lty = 2)
@ 
\caption{ Organelle-specific SVM score distributions. }
\label{fig:predscores}
\end{figure}

As can be see on figure \ref{fig:predscores}, different organelles are
characterised by different score distributions. Using a unique
threshold (\Robject{minprob} with value \Sexpr{round(minprob, 2)}
above) results in accepting 
\Sexpr{round(sum(p2 == "ER")/sum(p1 == "ER"), 2) * 100}
\% of the initial ER predictions and only
\Sexpr{round(sum(p2 == "mitochondrion")/sum(p1 == "mitochondrion"), 2)  * 100} 
\% of the mitochondrion predictions. The
\Rfunction{getPredictions} function also accepts organelle-specific
score thresholds. Below, we calculate organelle-specific median
scores.

<<medscores1, tidy = FALSE>>=
## including marker scores
(ts1 <- tapply(fData(svmres)$svm.scores, fData(svmres)$svm, median))
@


Above, we include markers proteins (that have scores of 1 by default)
when calculating the respective organelle-specific
scores. Alternatively, once might choose to ignoring them.

<<medscores2, tidy = FALSE>>=
## ignoring markers scores (i.e. scores == 1)
(ts2 <- tapply(fData(svmres)$svm.scores, fData(svmres)$svm,
               function(x) {
                   ## assuming scores of 1 are markers
                   scr <- median(x[x != 1])
                   ## in case no proteins were assigned to an organelle, 
                   ## scr would be NA. Setting these cases to 1.
                   ifelse(is.na(scr), 1, scr)
               }))
@

Using these scores equates to choosing the 50\% predictions with
highest scores for organelle.

<<medscorepreds, tidy = FALSE>>=
getPredictions(svmres, fcol = "svm", t = ts2)
@

\bigskip

We can now visualise these results using the plotting functions
presented in section \ref{sec:usml}, as shown on figure
\ref{fig:svmres}. We clearly see that besides the organelle marker
clusters that have been assigned high confidence members, many other
proteins have substantially lower prediction scores.

\begin{figure}[!hbt]
<<svmresfig, dev='pdf', fig.width=5, fig.height=5, echo=TRUE>>=
ptsze <- exp(fData(svmres)$svm.scores) - 1
plot2D(svmres, fcol = "svm", fpch = "markers.orig", cex = ptsze)
addLegend(svmres, fcol = "svm", 
          where = "bottomright", bty = "n", cex = .5)
@ 
%% $
\caption{ Representation of the svm prediction on the
  \Sexpr{nrow(tan2009r1)} data set. The svm scores have been used to
  set the point size (\texttt{cex} argument; the scores have been
  transformed to emphasise the extremes). Different symbols
  (\texttt{fpch}) are used to differentiate markers and new
  assignments. }
\label{fig:svmres}
\end{figure}

\clearpage

\subsection{Semi-supervised ML}\label{sec:ssml}

It is obvious that the original set of markers initially used
(\Sexpr{paste(levels(fData(tan2009r1)$markers.orig)[1:4], collapse = ",
  ")}) %%$
is not a biologically realistic representation or the organelle
diversity. Manually finding markers is however time consuming, as it
requires careful verification of the annotation, and possibly critical
for the subsequent analysis, as markers are directly used in the
training phase of the supervised ML approach.

As can be seen in the PCA plots above, there is inherent structure in
the data that can be made use of to automate the detection of new
clusters. The \textit{phenoDisco} algorithm \cite{Breckels2013} is an
iterative method, that combines classification of proteins to known
groups and detection of new clusters. It is available in
\Biocpkg{pRoloc} though the \Rfunction{phenoDisco}
function\footnote{In the interest of time, \Rfunction{phenoDisco} is
  not executed when the vignette is dynamically built. The data object
  can be located with \Rfunction{dir(system.file("extdata", package =
    "pRoloc"), full.names = TRUE, pattern = "pdres.rda")} and loaded
  with \Rfunction{load}.}.

<<runPhenoDisco, eval=FALSE, warning=FALSE>>=
pdres <- phenoDisco(tan2009r1, GS = 10, times = 100, fcol = "PLSDA")
@ 

<<loadpdres, echo=FALSE>>=
fn <- dir(system.file("extdata", package = "pRoloc"), 
          full.names = TRUE, pattern = "pdres.rda")
load(fn)
rm(fn)
@ 

The results are also appended to the \Robject{featureData} slot.

<<phenoDiscoFvar>>=
processingData(pdres)
tail(fvarLabels(pdres), 3)
@ 

The \Rfunction{plot2D} function, can, as previously, be utilised to
visualise the results, as shown on figure \ref{fig:pdres}.

\begin{figure}[!hbt]
<<pdresfig, dev='pdf', fig.width=5, fig.height=5, echo=TRUE>>=
plot2D(pdres, fcol = "pd")
addLegend(pdres, fcol = "pd", ncol = 2, 
          where = "bottomright", bty = "n", cex = .5)
@ 
%% $
\caption{ Representation of the phenoDisco prediction and cluster discovery results. }
\label{fig:pdres}
\end{figure}

\clearpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Following up on novelty discovery}\label{sec:followup}

The newly discovered phenotypes need to be carefully validated prior
to further analysis.  Indeed, as the structure of the data is made use
of in the discovery algorithm, some might represent peculiar structure
in the data and not match with biologically relevant groups.  The
\Robject{tan2009r1} data has been submitted to a careful
\Rfunction{phenodisco} analysis and validation in
\cite{Breckels2013}. The results of this new, augmented marker set is
available in the \Robject{pd.markers} feature data. These markers
represent a combined set of the original markers and validated
proteins from the new phenotypes.

<<pdmarkers>>=
getMarkers(tan2009r1, fcol = "pd.markers")
@ 

The augmented set of markers is now employed to repeat the
classification using the support vector machine classifier. We apply a
slightly different analysis than described in section \ref{sec:sml}
(page \pageref{sec:sml}). In the code chunks below\footnote{%%
  As previously, the results are pre-computed and available in the
  \texttt{extdata} package directory.%%
}, we use class specific weights when creating the svm model; the
weights are set to be inversely proportional to class frequencies.

<<weights, eval = TRUE, echo = TRUE>>=
w <- table(fData(tan2009r1)[, "pd.markers"])
(w <- 1/w[names(w) != "unknown"])
@ 

<<pdsvmParams, eval = FALSE, tidy = FALSE>>=
params2 <- svmOptimisation(tan2009r1, fcol = "pd.markers",                           
                           times = 10, xval = 5, 
                           class.weights = w,
                           verbose = FALSE)
@ 

<<loadParams2, echo = FALSE>>=
fn <- dir(system.file("extdata", package = "pRoloc"), 
          full.names = TRUE, pattern = "params2.rda")
load(fn)
rm(fn)
@ 

<<pdsvm, cache = TRUE, message = FALSE, warning = FALSE, tidy = FALSE>>=
tan2009r1 <- svmClassification(tan2009r1, params2, 
                               class.weights = w,
                               fcol = "pd.markers")
@ 

The data is visualised as described previously, were we use the svm
classification a-posteriori probability to set the point size.

\begin{figure}[!hbt]
<<pdres2fig, dev='pdf', fig.width=6, fig.height=6, echo=TRUE, tidy=FALSE>>=
ptsze <- exp(fData(tan2009r1)$svm.scores) - 1
plot2D(tan2009r1, fcol = "svm", cex = ptsze)
addLegend(tan2009r1, fcol = "svm", where = "bottomright",
          ncol = 2, bty = "n", cex = .5)
@ 
%% $
\caption{ Second round of classification using the augmented set of markers obtained 
  using \Rfunction{phenoDisco} as detailed in \cite{Breckels2013} 
  and a weighted svm classifier. }
\label{fig:pdres2}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\clearpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\clearpage

\section{Conclusions}\label{sec:ccl}

This tutorial focuses on practical aspects of organelles proteomics
data analysis using \Biocpkg{pRoloc}. Two important aspects have
been illustrates: (1) data generation, manipulation and visualisation
and (2) application of contemporary and novel machine learning
techniques.  Other crucial parts of a full analysis pipeline that were
not covered here are raw mass-spectrometry quality control,
quantitation, post-analysis and data validation.

\bigskip

Data analysis is not a trivial task, and in general, one can not
assume that any off-the-shelf algorithm will perform well.  As such,
one of the emphasis of the software presented in this document is
allowing users to track data processing and critically evaluate the
results.



\clearpage

\section*{Acknowledgement}

We would like to thank Mr Daniel J.H. Nightingale, Dr Arnoud J. Groen,
Dr Claire M. Mulvey and Dr Andy Christoforou for their organelle
marker contributions.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section*{Session information}\label{sec:sessionInfo} 

All software and respective versions used to produce this document are
listed below.

<<sessioninfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\bibliography{pRoloc}

\end{document}