%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteIndexEntry{The TargetSearch Package}
%\VignetteDepends{TargetSearchData}
%\VignetteKeywords{preprocess, GC-MS, analysis}
%\VignettePackage{TargetSearch}
\documentclass[12pt]{article}

\usepackage{hyperref}
\usepackage{graphicx}
\usepackage[authoryear,round]{natbib}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in 


\title{The TargetSearch Package}
\author{Alvaro Cuadros-Inostroza, Jan Lisec,\\Henning Redestig and Matthew A Hannah\\
\small{Max Planck Institute for Molecular Plant Physiology}\\
\small{Potsdam, Germany}\\
\small{\url{http://www.mpimp-golm.mpg.de/}}
}

\begin{document}
\maketitle

This document describes how to use \Rpackage{TargetSearch} to preprocess
GC-MS data.

<<LibraryPreload, echo = FALSE, results = hide>>=
library(xcms)
library(TargetSearch)
library(TargetSearchData)
@

\section{Supplied Files}

This section describes the files that have to be prepared before running 
\Rpackage{TargetSearch}. They are the sample file, the reference library 
file and the retention marker definition. For example purposes, we will 
use the example files provided by the package 
\Rpackage{TargetSearchData}. 

\subsection{NetCDF Files and Sample File} 

\Rpackage{TargetSearch} can currently read only NetCDF files. Many GC-MS 
software packages are able to convert raw chromatograms to NetCDF. It is 
also recommended to baseline correct your chromatograms before exporting 
to NetCDF. Please refer to your software documentation for details. 

After exporting the NetCDF files, please place them in a convenient 
location. Then prepare a text file to describe your samples. It must be 
tab-delimited and have at least two columns: ``CDF\_FILE'' and 
``MEASUREMENT\_DAY''. Other columns such as sample name, sample group, 
treatment, etc. may be additionally included to aid sample sub-setting 
and downstream analyses. An example is shown in table~\ref{sample:file}. 


\begin{table}[ht]
\centering
\begin{tabular}{lll}
\hline
CDF\_FILE & MEASUREMENT\_DAY & TIME\_POINT \\
\hline
7235eg08.cdf & 7235 & 1 \\
7235eg11.cdf & 7235 & 1 \\
7235eg26.cdf & 7235 & 1 \\
7235eg04.cdf & 7235 & 3 \\
7235eg30.cdf & 7235 & 3 \\
7235eg32.cdf & 7235 & 3 \\
...&&\\
\hline
\end{tabular}
\caption{Sample file example, ``samples.txt''}
\label{sample:file}
\end{table}

To import the sample list into R, use the function 
\Rfunction{ImportSamples()}. You also need to specify the directory 
where the NetCDF files are (\Rfunarg{CDFpath}) and a directory where the 
transformed cdf files, the so called RI files, will be saved 
(\Rfunarg{RIpath}). 


<<ImportSample>>=
library(TargetSearchData)
library(TargetSearch)
cdf.path <- system.file("gc-ms-data", package = "TargetSearchData")
sample.file <- file.path(cdf.path, "samples.txt")
samples <- ImportSamples(sample.file, CDFpath = cdf.path,
                         RIpath = ".")
@

You could alternatively create a \Robject{tsSample} object by using the sample class
methods.

<<ImportSample2>>=
cdffiles <- dir(cdf.path, pattern="cdf$")
# define the RI file names
rifiles  <- paste("RI_", sub("cdf","txt", cdffiles), sep = "")
# take the measurement day info from the cdf file names.
days <- substring(cdffiles, 1, 4)
# sample names 
smp_names <- sub("\\.cdf", "", cdffiles)
# add some sample info
smp_data <- data.frame(CDF_FILE =cdffiles, GROUP = gl(5,3))
# create the sample object
samples <- new("tsSample", Names = smp_names, CDFfiles = cdffiles,
               CDFpath = cdf.path, RIpath = ".", days = days,
               RIfiles = rifiles, data = smp_data)
@

\subsection{Retention Time Markers}

The RI markers time definition, time window and $m/z$ values has to be provided
in a tab-delimited text file. The first two columns indicate the lower and
upper window limits where the retention marker will be searched. The third
column is the RI of that particular marker. An example is shown in
table \ref{rim:file}.

\begin{table}[ht]
\centering
\begin{tabular}{rrr}
\hline
LowerLimit & UpperLimit & RIstandard \\
\hline
230 & 280 &	262320\\
290	& 340 &	323120\\
350 & 400 &	381020\\
\hline
\end{tabular}
\caption{Retention time definition file example, ``rimLimits.txt''}
\label{rim:file}
\end{table}

Use the function \Rfunction{ImportFameSettings()} to import the limits and
to set the $mz/z$ markers. Alternatively, the markers could be specified
by an additional column in the RI markers file.

<<ImportFameSettings>>=
rim.file  <- file.path(cdf.path, "rimLimits.txt")
rimLimits <- ImportFameSettings(rim.file, mass = 87)
@

This will import the limits in ``rimLimits.txt'' file and set the
marker mass to 87.

If you do not use RI markers, you can skip this part by setting the parameter
\Rfunarg{rimLimits} to \Robject{NULL} in \Rfunction{RIcorrect} function.
Please note that in this case, no retention time correction will be performed.

\section{Baseline correction, peak identification and RI correction}

Initially, \Rpackage{TargetSearch} identifies the local apex intensities 
in all chromatograms, finds the retention time of the RI markers and converts
the retention time to RI using linear interpolation \citep{VandenDool1963}.
This is done by the function \Rfunction{RIcorrect()}. It takes as parameter the
sample and retention time limits objects, the mass range to extract (\Rfunarg{
MassRange = c(85,500)}), the intensity threshhold (\Rfunarg{IntThreshold = 10}),
the peak picking method (\Rfunarg{pp.method}) and a \Rfunarg{Window} parameter
that will be used by said method. The function will return a matrix
with the retention times of the retention time markers and creates a
tab-delimited file (RI file) per every chromatogram in the selected directory.
These files contein the extracted peak list of the respective NetCDF file.


<<RIcorrection>>=
RImatrix <- RIcorrect(samples, rimLimits, massRange = c(85,500),
            IntThreshold = 10, pp.method = "smoothing", Window = 7)
@


There are two peak picking methods available. ``smoothing'' implements the
algorithm used by Tagfinder \citet{Luedemann2008}. The ``ppc'' algorithm is a
port from the package \Rpackage{ppc} function \Rfunction{ppc.peaks}. It looks for
the local maxima within a given time window. The \Rfunarg{Window} parameter
sets the window size of the chosen peak picking method.

In case that the chromatograms were not baseline corrected by the platform
specific GC-MS software, it is possible to do so with the function
\Rfunction{RIcorrect()} as well. There are to extra arguments you have
to include: \Rfunarg{baseline} is a logical value that tells \Rfunction{RIcorrect()}
whether to baseline correct the chromatograms or not. \Rfunarg{baseline.opts}
is a list of options passed to \Rfunction{baselineCorrection()} function.
This algorithm is based on the work of \citet{chang07} and a description
is found in \Rfunction{baselineCorrection()} documentation.

After that, it is possible to check for outliers in the samples by using
the function \Rfunction{FAMEoutliers()}. It creates a PDF report
of the retention time markers informing of possible outliers that the
user can remove or not. Alternatively, retention index markers can
be checked manually by the function \Rfunction{plotFAME}. Figure~\ref{fig:01}
shows the retention times of the marker 1.

<<PeakIdentification>>=
outliers <- FAMEoutliers(samples, RImatrix, threshold = 3)
@

Here \Rfunarg{threshold} sets the number of standard deviation a sample will
be considered outlier. In this example, no outliers were detected. 

\begin{figure}[htp]
\centering
<<plotFAME, fig=true, width=8, height=5>>=
plotFAME(samples, RImatrix, 1)
@
\caption{Retention Index Marker 1.}\label{fig:01}
\end{figure}

\section{Library Search}

\subsection{Reference Library File}

The ``reference library'' file contains the information of the
metabolites or mass spectral tags (MSTs) that will be searched for in
the chromatograms. A public spectra database could be found here
\url{http://csbdb.mpimp-golm.mpg.de/csbdb/gmd/gmd.html} at
\emph{The Golm Metabolome Database} \citep{Kopka2005}.

Required information is the metabolite name
(``Name''), expected retention time index (``RI''), 
selective masses (``SEL\_MASS''), most abundant masses
(``TOP\_MASS''), spectrum (``SPECTRUM'') and RI deviations
(``Win\_1'', ``Win\_2'', ``Win\_3''). See example in table~\ref{library:file}.
The columns ``Name'' and ``RI'' are mandatory and you have
at least to include one of the columns ``SEL\_MASS'', ``TOP\_MASS''
or ``SPECTRUM'' in the file (see below). The RI deviation columns are
optional.

\begin{table}[ht]
\centering
{\footnotesize
\begin{tabular}{lllll}
\hline
Name & RI & Win\_1 & SEL\_MASS & SPECTRUM\\
\hline
Pyruvic acid    & 222767 & 4000 & 89;115;158;174;189  & 85:7 86:14 87:7 88:5 8... \\
Glycine (2TMS)  & 228554 & 4000 & 86;102;147;176;204  & 86:26 87:19 88:8 89:4...\\
Valine & 271500 & 2000 & 100;144;156;218;246          & 85:8 86:14 87:6 88:5 8...\\
Glycerol (3TMS) & 292183 & 2000 & 103;117;205;293     & 85:14 86:2 87:16 88:13...\\
Leucine         & 306800 & 1500 & 102;158;232;260     & 158:999 159:148 160:45...\\
Isoleucine      & 319900 & 1500 & 102;103;158;163;218 & 90:11 91:2 92:1 93:1 9...\\
Glycine & 325000 & 2000 & 86;100;174;248;276          & 85:6 86:245 87:24 88:12...\\
\hline
\end{tabular}
}
\caption{Reference Library example, ``library.txt''}
\label{library:file}
\end{table}

In this file, masses and intensities must be positive integers. RIs and RI deviations
can be any positive real number. The selective and most abundant masses list
must be delimited by semicolon (;). The spectrum is described by
a list of mass and intensity pair. Every mass-intensity pair is separated by
colon (:) and different pairs are separated by spaces.

The function \Rfunction{ImportLibrary()} imports the reference file. 

<<ImportLibrary>>=
lib.file <- file.path(cdf.path, "library.txt")
lib      <- ImportLibrary(lib.file, RI_dev = c(2000,1000,200),
            TopMasses = 15, ExcludeMasses = c(147, 148, 149))
@


Here we set the RI window deviations to 2000, 1000 and 200 RI units. Since ``Win\_1''
column is already in the file, the first value (2000) is ignored. Also, the 15th most
abundant masses are taken but excluding the masses 147, 148 and 149 (common
confounding masses)

\subsection{Library Search Algorithm}

The library search is performed in three steps. First, for every metabolite,
selective masses are searched in a given time window around the expected RI.
This is done by the function \Rfunction{medianRILib()}. This function calculates
the median RI of the selective masses and return new library object with the
updated RI. The time deviation is given either in the library file (column
``Win\_1'') or when the library is imported (see \Rfunction{ImportLibrary()}).


<<LibrarySearch1>>=
lib <- medianRILib(samples, lib)
@

It is also posible to examinate visually the RI deviation of the metabolites
by setting the parameter \Rfunarg{makeReport=TRUE}, which creates a pdf report
like the one shown in figure~\ref{fig:02}. This may help to set or update
the expected RI deviation.

\begin{figure}[htp]
\centering
<<medianLib, fig=true, width=8, height=8, echo=false>>=
resPeaks <- FindPeaks(RIfiles(samples), refLib(lib, w = 1, sel = TRUE))
plotRIdev(lib, resPeaks, libId = 1:9)
@
\caption{RI deviation of first 9 metabolites in the library.}\label{fig:02}
\end{figure}

In the second step, the function \Rfunction{sampleRI()} searches the selective
masses again, but using the updated RI and the RI deviation defined in the library
object (``Win\_2''). After that, the intensities of the selected masses
are normalised to the median of the day, and then used to extract
other masses with correlated apex profiles. The masses for which
the Pearson correlation coefficient is above \Rfunarg{r\_thres} are
taken as metabolite markers and their RIs are averaged on a per
sample basis. This average RI represents the exact position where
the metabolite elutes in the respective sample, which is returned
in a matrix form.

<<LibrarySearch2>>=
cor_RI <- sampleRI(samples, lib, r_thres = 0.95,
                     method = "dayNorm")
@

The third step will look up for all the masses (selective and most abundant masses)
in all the samples. This is done by the function \Rfunction{peakFind()}. It returns
a \Rclass{tsMSdata} object with the intensities and RI of every mass (rows) and every
sample (columns) that were search for.


<<LibrarySearch3>>=
peakData <- peakFind(samples, lib, cor_RI)
@


The intensity and RI matrices can be accessed by using the \Rmethod{Intensity}
and \Rmethod{retIndex} methods.


<<LibrarySearch4>>=
met.RI        <- retIndex(peakData)
met.Intensity <- Intensity(peakData)
@


\section{Metabolite Profile}

The function \Rfunction{Profile} makes a profile of the MS data by
averaging all the normalised mass intensities whose Pearson coefficient
is greater that \Rfunarg{r\_thresh}.


<<MetaboliteProfile>>=
MetabProfile <- Profile(samples, lib, peakData, r_thres = 0.95,
                      method = "dayNorm")
@


A \Rclass{msProfile} object is returned. The averaged intensities
and RI matrices that can be obtained by \Rmethod{Intensity} and \Rmethod{retIndex}
methods. The profile information is represented by a \Rclass{data.frame}
in the \Rmethod{info} slot (accessible by \Rmethod{profileInfo} method). The
columns are:

\begin{description}

\item[Name]
    The metabolite/analyte name. 

\item[Lib\_RI]
    The expected RI (library RI).

\item[Mass\_count]
    The number of correlating masses.

\item[Non\_consecutive\_Mass\_count]
     Same as above, but not counting the consecutive masses.

\item[Sample\_count]
    The total number of masses that were found in the samples.
   
\item[Masses]
    The correlating masses.

\item[RI]
    The average RI.

\item[Score\_all\_masses]
    The similarity score calculated using the average intensity of all
	the masses that were searched for, regardless of whether they are
	correlating masses.

\item[Score\_cor\_masses]
    Same as above, but only correlating masses are considered.

\end{description}

As metabolites with similar selective masses and RIs can be
present in metabolite libraries, it is necessary to reduce redundancy.
This is performed by the function \Rfunction{ProfileCleanUp} which selects
peaks for which the RI gap is smaller than \Rfunarg{timeSplit} and
computes the Pearson correlation between them. When two
metabolites within such a time-group are tightly correlated (given by \Rfunarg{r\_thres})
only the one with more correlated masses is retained.

<<MetaboliteProfile2>>=
finalProfile <- ProfileCleanUp(MetabProfile, timeSplit = 500,
                                r_thres = 0.95)
@

The function returns a \Rclass{msProfile} object. The \Rmethod{info} slot
is similar as described above, but extra columns with a ``Cor\_'' preffix
(e.g., ``Cor\_Name'') are included. They provide information about
metabolite redundancy.

\section{Peaks and Spectra Visualisation}

Finally, it may be of interest to check the chromatographic peak of selected
metabolites and compare the median spectra of the metabolites, i.e., the median
intensities of the selected masses across all the samples, with the reference
spectra of the library. There are two functions to do so: \Rfunction{plotPeak}
and \Rfunction{plotSpectra}.

For example, we can compare the median spectrum of ``Valine'' against its
spectrum reference. Here we look for the library index of ``Valine'' and plot the
spectra comparison in a ``head-tail'' plot (figure~\ref{fig:03}).

\begin{figure}[htp]
\centering
<<plotSpectra, fig=true, width=8, height=6>>=
grep("Valine", libName(lib))
plotSpectra(lib, peakData, libId = 3, type = "ht")
@
\caption{Spectra comparison of ``Valine''}\label{fig:03}
\end{figure}

To look at the chromatographic peak of ``Valine'' in a given sample,
we use the functions \Rfunction{peakCDFextraction} to extract
the raw chromatogram and \Rfunction{plotPeak} to plot the peak (figure~\ref{fig:04}).

\begin{figure}[htp]
\centering
<<plotPeak, fig=true, width=8, height=6>>=
# select "Valine" top masses
top.masses <- topMass(lib)[[3]]
# we select the first sample
sample.id <- 1
cdf.file  <- file.path(cdf.path, cdffiles[sample.id])
rawpeaks  <- peakCDFextraction(cdf.file, massRange = c(85,500))
plotPeak(rawpeaks, time.range = libRI(lib)[3] + c(-2000,2000),
    masses = top.masses, useRI = TRUE, rimTime = RImatrix[,sample.id],
    standard = rimStandard(rimLimits), massRange = c(85, 500),
    main = "Valine")
@
\caption{Chromatographic peak of Valine.}\label{fig:04}
\end{figure}

Refer to the documentation of the functions \Rfunction{plotPeak}
and \Rfunction{plotSpectra} for further options not covered here.

\section{TargetSearch GUI}

We provide a grahical user interface intended to facilitate the use
of \emph{TargetSearch} for users unfamiliar with R. Many parameters that
would be set calling the individual \Rpackage{TargetSearch} functions as
described in this document can be set here ``in one go'' before running the
complete analysis. A screenshot of the GUI is shown in figure~\ref{fig:05}.

\begin{figure}[htp]
\centering
\includegraphics[scale=0.7]{tsGUI.png}
\caption{The \Rpackage{TargetSearch} GUI.}\label{fig:05}
\end{figure}

This is a descript of all the GUI options.

\begin{description}
\item[Working Directory]: Use the \emph{Browse}-button to select the folder
on your hard drive containing all your GC-MS data files. The output of
\Rpackage{TargetSearch} will be written to this folder too.

\item[File Import]: Clicking \emph{NetCDF Data} or \emph{Apex Data} radio buttons will open
a file select dialog. Choose the files you would like to be processed. You
may check your selection pressing the \emph{Show}-button.

\item[Baseline Correction]: Clicking \emph{on/off} button will perform
baseline correction before peak detection. If selected, the \Rfunarg{threshold}
parameter is a numeric value between 0 and 1. A value of one returns a baseline
above the noise, 0.5 in the middle of the noise and 0 below the noise. See
\Rfunction{baselineCorrection} documentation for further details.

\item[Retention Index Correction]: Retention Index Correction is neccessary and
applied only if you supply NetCDF Data (Apex Data contain already Retention
Indices). You may \emph{Load} or \emph{Create} the search windows for your RI-Markers
here. 

\item[Peak Detection]: \emph{Search Windows} refers to the allowed RI deviation of your
metabolites which are narrowed in 3 consecutive searches. \emph{Intensity Counts
threshold} defines the minimum apex intensity incorporated in the analysis.
A value of 1 would include all peaks. \emph{Mass Range} allows to limit the mass
values ($m/z$) to be included in the analysis. \emph{Smoothing} averages raw data
to eliminate some inherent noise leading to multiple peaks otherwise. 

\item[Library]: A Library (to detect metabolites) usable by \Rpackage{TargetSearch} contains
at least information about the metabolite `Name', its expected `RI' and the
selective masses in its spectrum `SEL\_MASS'. You may \emph{Load} or \emph{Create} one
yourself using the respective buttons. A more detailed description of
the file formats can be found in \Rfunction{ImportLibrary}.

\item[Normalization]: This selects how the data will be normalized during
the metabolite search. Options are \Rfunarg{dayNorm}, a day based median
normalization, \Rfunarg{medianNorm}, normalization using the median of all the intensities
of a given mass, and \Rfunarg{none}, no normalization at all.

\item[Final Profiles]: Here you may set the parameters used by the functions
\Rfunction{Profile} and \Rfunction{ProfileCleanUp}. \emph{timesplit} sets an
RI window that will be used to look for metabolites that could have been redundanly
identified. \emph{correl. thr.} is the correlation threshold and \emph{min. number
of correlation samples} is a threshold used to make sure that correlations are
computed with at least said number of observations.

\item[Parameters]: You may \emph{Save} the current parameters as an \texttt{*.RData}
file or \emph{Load} previously saved parameters to compare the outcome of different settings or
just repeat the analysis.

\item[Program]: \emph{Run} starts to process all currently selected files using the current
parameters and saving output to \emph{Working Directory}. \emph{Quit} closes the GUI.
\end{description}

\section{Untargeted search}

Although \Rpackage{TargetSearch} was designed to be targeted oriented, it is possible
to perform untargeted searches. The basic idea is to create a library that contains
evenly distributed ``metabolites'' in time and every ``metabolite'' uses the whole
range of possible masses as selective masses. An example:

<<untargeted1>>=
metRI    <- seq(200000, 300000, by = 5000)
metMZ    <- 85:250
metNames <- paste("Metab",format(metRI,digits=6), sep = "_")
@

Here we define a set of metabolites located every 5000 RI units
from each other in the RI range 200000-300000, with selective masses
in the range of 85-300, and assign them a name. After that, we
create an \Robject{tsLib} object.

<<untargeted2>>=
metLib   <- new("tsLib", Name = metNames, RI = metRI,
     selMass = rep(list(metMZ), length(metRI)), RIdev = c(3000, 1500, 500))
@

Now we can use this library object to perform a targetive search with
this library on the \emph{E. coli} samples as we did before.

<<untargeted3>>=
metLib <- medianRILib(samples, metLib)
metCorRI <- sampleRI(samples, metLib)
metPeakData <- peakFind(samples, metLib, metCorRI)
metProfile <- Profile(samples, metLib, metPeakData)
metFinProf <- ProfileCleanUp(metProfile, timeSplit = 500)
@

The \Robject{metFinProf} object can be used to create
a new library by taking only the metabolites that have, for
example, more than 5 correlating masses and using the correlating
masses as selective ones.

<<untargeted4>>=
sum(  profileInfo(metFinProf)$Mass_count > 5)
tmp   <-  profileInfo(metFinProf)$Mass_count > 5
metRI    <- profileInfo(metFinProf)$RI[tmp]
metNames <-  as.character( profileInfo(metFinProf)$Name[tmp] )
metMZ <- sapply(profileInfo(metFinProf)$Masses[tmp],
	function(x) as.numeric(unlist(strsplit(x,";"))) )
metLib   <- new("tsLib", Name = metNames, RI = metRI,
     selMass = metMZ, RIdev = c(1500, 750, 250))
@

After the new library object is created, the process can be
repeated as shown above.

Finally, by using the function \Rfunction{writeMSP},
it is possible to export the spectrum of the
unknown metabolites to MSP format used by NIST mass spectra
search (\url{http://www.nist.gov/srd/mslist.htm}),
so the unknown spectra can be search
against known metabolite spectra databases.

\bibliographystyle{plainnat}
\bibliography{biblio}

\end{document}