---
title: "A short introduction to *MSnbase* development"
author: 
- name: Laurent Gatto
  affiliation: Computational Proteomics Unit, Cambridge, UK.
- name: Johannes Rainer
  affiliation: Center for Biomedicine, EURAC, Bolzano, Italy.
- name: Sebastian Gibb
  affiliation: Department of Anesthesiology and Intensive Care, University Medicine Greifswald, Germany.
package: MSnbase
abstract: >
  This vignette describes the classes implemented in \Biocpkg{MSnbase}
  package.  It is intended as a starting point for developers or users
  who would like to learn more or further develop/extend mass
  spectrometry and proteomics data structures.
bibliography: MSnbase.bib
output:
  BiocStyle::html_document2:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{A short introduction to `MSnbase` development}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Mass Spectrometry, Proteomics, Infrastructure }
  %\VignetteEncoding{UTF-8}
---


```{r environment, echo=FALSE}
suppressPackageStartupMessages(library("MSnbase"))
suppressPackageStartupMessages(library("BiocStyle"))
```

```{r include_forword, echo=FALSE, results="asis"}
cat(readLines("./Foreword.md"), sep = "\n")
```

```{r include_bugs, echo=FALSE, results="asis"}
cat(readLines("./Bugs.md"), sep = "\n")
```

# Introduction

This document is not a replacement for the individual manual pages,
that document the slots of the `r Biocpkg("MSnbase")` classes. It is a
centralised high-level description of the package design.

`r Biocpkg("MSnbase")` aims at being compatible with the 
`r Biocpkg("Biobase")` infrastructure [@Gentleman2004].  Many meta data
structures that are used in `eSet` and associated classes are also
used here. As such, knowledge of the *Biobase development and the new
eSet* vignette would be beneficial; the vignette can directly be
accessed with `vignette("BiobaseDevelopment", package="Biobase")`.

The initial goal is to use the `r Biocpkg("MSnbase")` infrastructure
for MS2 labelled (iTRAQ [@Ross2004] and TMT [@Thompson2003]) and
label-free (spectral counting, index and abundance) quantitation
- see the documentation for the `quantify` function for details. The
infrastructure is currently extended to support a wider range of
technologies, including metabolomics.


# `r Biocpkg("MSnbase")` classes

All classes have a `.__classVersion__` slot, of class `Versioned` from
the `r Biocpkg("Biobase")` package. This slot documents the class
version for any instance to be used for debugging and object update
purposes. Any change in a class implementation should trigger a
version change.


## `pSet`: a virtual class for raw mass spectrometry data and meta data

This virtual class is the main container for mass spectrometry data,
i.e spectra, and meta data. It is based on the `eSet` implementation
for genomic data. The main difference with `eSet` is that the
`assayData` slot is an environment containing any number of
`Spectrum` instances (see the [`Spectrum` section](#Spectrum)).

One new slot is introduced, namely `processingData`, that contains one
`MSnProcess` instance (see the [`MSnProcess` section](#MSnProcess)).
and the `experimentData` slot is now expected to contain `MIAPE` data.
The `annotation` slot has not been implemented, as no prior feature
annotation is known in shotgun proteomics.


```{r pSet}
getClass("pSet")
```

## `MSnExp`: a class for MS experiments


`MSnExp` extends `pSet` to store MS experiments.  It does not add any
new slots to `pSet`. Accessors and setters are all inherited from
`pSet` and new ones should be implemented for `pSet`.  Methods that
manipulate actual data in experiments are implemented for
`MSnExp` objects.

```{r MSnExp}
getClass("MSnExp")
```

## `OnDiskMSnExp`: a on-disk implementation of the `MSnExp` class


The `OnDiskMSnExp` class extends `MSnExp` and inherits all of its
functionality but is aimed to use as little memory as possible based
on a balance between memory demand and performance. Most of the
spectrum-specific data, like retention time, polarity, total ion
current are stored within the object's `featureData` slot. The actual
M/Z and intensity values from the individual spectra are, in contrast
to `MSnExp` objects, not kept in memory (in the `assayData` slot), but
are fetched from the original files on-demand. Because mzML files are
indexed, using the `r Biocpkg("mzR")` package to read the relevant
spectrum data is fast and only moderately slower than for in-memory
`MSnExp`^[The *benchmarking* vignette compares data size and operation speed of the two implementations.].

To keep track of data manipulation steps that are applied to spectrum
data (such as performed by methods `removePeaks` or `clean`) a *lazy
execution* framework was implemented. Methods that manipulate or
subset a spectrum's M/Z or intensity values can not be applied
directly to a `OnDiskMSnExp` object, since the relevant data is not
kept in memory. Thus, any call to a processing method that changes or
subset M/Z or intensity values are added as `ProcessingStep` items to
the object's `spectraProcessingQueue`. When the spectrum data is then
queried from an `OnDiskMSnExp`, the spectra are read in from the file
and all these processing steps are applied on-the-fly to the spectrum
data before being returned to the user.

The operations involving extracting or manipulating spectrum data are
applied on a per-file basis, which enables parallel processing. Thus,
all corresponding method implementations for `OnDiskMSnExp` objects
have an argument `BPPARAM` and users can set a `PARALLEL_THRESH`
option flag^[see `?MSnbaseOptions` for details.] that enables to
define how and when parallel processing should be performed (using the
`r Biocpkg("BiocParallel")` package).

Note that all data manipulations that are not applied to M/Z or
intensity values of a spectrum (e.g. sub-setting by retention time
etc) are very fast as they operate directly to the object's
`featureData` slot.

```{r OnDiskMSnExp}
getClass("OnDiskMSnExp")
```

The distinction between `MSnExp` and `OnDiskMSnExp` is often not
explicitly stated as it should not matter, from a user's perspective,
which data structure they are working with, as both behave in
equivalent ways. Often, they are referred to as *in-memory* and
*on-disk* `MSnExp` implementations.


## `MSnSet`: a class for quantitative proteomics data

This class stores quantitation data and meta data after running
`quantify` on an `MSnExp` object or by creating an `MSnSet` instance
from an external file, as described in the *MSnbase-io* vignette and
in `?readMSnSet`, `readMzTabData`, etc. The quantitative data is in
form of a *n* by *p* matrix, where *n* is the number of
features/spectra originally in the `MSnExp` used as parameter in
`quantify` and *p* is the number of reporter ions. If read from an
external file, *n* corresponds to the number of features (protein
groups, proteins, peptides, spectra) in the file and $p$ is the number
of columns with quantitative data (samples) in the file.

This prompted to keep a similar implementation as the `ExpressionSet`
class, while adding the proteomics-specific annotation slot introduced
in the `pSet` class, namely `processingData` for objects of class
`MSnProcess`.

```{r MSnSet}
getClass("MSnSet")
```

The `MSnSet` class extends the virtual `eSet` class to provide
compatibility for `ExpressionSet`-like behaviour.  The experiment
meta-data in `experimentData` is also of class `MIAPE` .  The
`annotation` slot, inherited from `eSet` is not used. As a result, it
is easy to convert `ExpressionSet` data from/to `MSnSet` objects with
the coersion method `as`.

```{r as}
data(msnset)
class(msnset)
class(as(msnset, "ExpressionSet"))

data(sample.ExpressionSet)
class(sample.ExpressionSet)
class(as(sample.ExpressionSet, "MSnSet"))
```

## `MSnProcess`: a class for logging processing meta data {#MSnProcess}


This class aims at recording specific manipulations applied to
`MSnExp` or `MSnSet` instances. The `processing`
slot is a `character` vector that describes major
processing. Most other slots are of class `logical` that
indicate whether the data has been centroided, smoothed, \ldots
although many of the functionality is not implemented yet.  Any new
processing that is implemented should be documented and logged here.

It also documents the raw data file from which the data originates
(`files` slot) and the `r Biocpkg("MSnbase")` version that was in
use when the `MSnProcess` instance, and hence the
`MSnExp`/`MSnSet` objects, were originally created.

```{r MSnProcess}
getClass("MSnProcess")
```


## `MIAPE`: Minimum Information About a Proteomics Experiment

The Minimum Information About a Proteomics Experiment
[@Taylor2007; @Taylor2008] `MIAPE` class describes the experiment,
including contact details, information about the mass spectrometer and
control and analysis software.

```{r MIAPE}
getClass("MIAPE")
```


## `Spectrum`  *et al.*: classes for MS spectra {#Spectum}

`Spectrum` is a virtual class that defines common attributes to all
types of spectra. MS1 and MS2 specific attributes are defined in the
`Spectrum1` and `Spectrum2` classes, that directly extend `Spectrum`.


```{r Spectrum}
getClass("Spectrum")
```

```{r Spectrum1}
getClass("Spectrum1")
```

```{r Spectrum2}
getClass("Spectrum2")
```


## `ReporterIons`: a class for isobaric tags

The iTRAQ and TMT (or any other peak of interest) are implemented
`ReporterIons` instances, that essentially defines an expected MZ
position for the peak and a width around this value as well a names
for the reporters.

```{r ReporterIons}
getClass("ReporterIons")
```


## `NAnnotatedDataFrame`: multiplexed `AnnotatedDataFrame`s

The simple expansion of the `AnnotatedDataFrame` classes adds the
`multiplex` and `multiLabel` slots to document the number and names of
multiplexed samples.

```{r NAnnotatedDF}
getClass("NAnnotatedDataFrame")
```

## Other classes

### Lists of `MSnSet` instances {-}

When several `MSnSet` instances are related to each other and should
be stored together as different objects, they can be grouped as a list
into and `MSnSetList` object. In addition to the actual `list` slot,
this class also has basic logging functionality and enables iteration
over the `MSnSet` instances using a dedicated `lapply` methods.

```{r msl}
getClass("MSnSetList")
```

# Miscellaneous

#### Unit tests {-}

`r Biocpkg("MSnbase")` implements unit tests with the
`r CRANpkg("testthat")` package.

#### Processing methods {-}

Methods that process raw data, i.e. spectra should be implemented for
`Spectrum` objects first and then `eapply`ed (or similar) to the
`assayData` slot of an `MSnExp` instance in the specific method.


# Session information

```{r si}
sessionInfo()
```

# References