%\VignetteIndexEntry{tigre User Guide}
%\VignetteKeywords{TimeCourse, GeneExpression, Transcription}
%\VignetteDepends{Biobase, annotate, puma, BiocStyle}
%\VignettePackage{tigre}
\documentclass[a4paper]{article}
\usepackage{url}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\title{tigre User Guide}
\author{Antti Honkela, Pei Gao,\\
  Jonatan Ropponen, Miika-Petteri Matikainen,\\
  Magnus Rattray, and Neil D. Lawrence}

\newcommand{\tigre}{\Biocpkg{tigre}}

\usepackage{Sweave}
\begin{document}
\maketitle


\section{Abstract}

The \tigre{} package implements our methodology of Gaussian process
differential equation models for analysis of gene expression time
series from single input motif networks.  The package can be used for
inferring unobserved transcription factor (TF) protein concentrations
from expression measurements of known target genes, or for ranking
candidate targets of a TF.

\section{Citing \tigre{}}

The \tigre{} package is based on a body of methodological research.
Citing \tigre{} in publications will usually involve citing one or
more of the methodology papers
\cite{Honkela2010PNAS,Gao2008,Lawrence2007} that the software is
based on as well as citing the software package itself
\cite{Honkela2011}.


\section{Introductory example analysis - Drosophila development}
\label{section:Introductory example}

In this section we introduce the main functions of the \Rpackage{puma}
package by repeating some of the analysis from the PNAS
paper~\cite{Honkela2010PNAS}\footnote{Note that the results reported
  in the paper were run using an earlier version of this package for
  MATLAB, so there can be minor differences.}.

\subsection{Installing the \tigre{} package}

The recommended way to install \tigre{} is to use the
\Rfunction{biocLite} function available from the bioconductor
website. Installing in this way should ensure that all appropriate
dependencies are met.

\begin{Schunk}
\begin{Sinput}
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("tigre")
\end{Sinput}
\end{Schunk}

% To install the tigre software, unpack the software and run
% \begin{verbatim}
% R CMD INSTALL tigre
% \end{verbatim}

To load the package start R and run
\begin{Schunk}
\begin{Sinput}
> library(tigre)
\end{Sinput}
\end{Schunk}

\subsection{Loading the data}

To get started, you need some preprocessed time series expression
data.  If the data originates from Affymetrix arrays, we highly
recommend processing it with \Rfunction{mmgmos} from the
\Rpackage{puma} package.  This processing extracts error bars on the
expression measurements directly from the array data to allow judging
the reliability of individual measurements.  This information is
directly utilised by all the models in this package.

To start from scratch on Affymetrix data, the .CEL files from
\url{ftp://ftp.fruitfly.org/pub/embryo_tc_array_data/} may be
processed using:
\begin{Schunk}
\begin{Sinput}
> # Names of CEL files
> expfiles <- c(paste("embryo_tc_4_", 1:12, ".CEL", sep=""),
+               paste("embryo_tc_6_", 1:12, ".CEL", sep=""),
+               paste("embryo_tc_8_", 1:12, ".CEL", sep=""))
> # Load the CEL files
> expdata <- ReadAffy(filenames=expfiles,
+                     celfile.path="embryo_tc_array_data")
> # Setup experimental data (observation times)
> pData(expdata) <- data.frame("time.h" = rep(1:12, 3),
+                              row.names=rownames(pData(expdata)))
> # Run mmgMOS processing (requires several minutes to complete)
> drosophila_mmgmos_exprs <- mmgmos(expdata)
> drosophila_mmgmos_fragment <- drosophila_mmgmos_exprs
\end{Sinput}
\end{Schunk}

This data needs to be further processed to make it suitable for our
models.  This can be done using
\begin{Schunk}
\begin{Sinput}
> drosophila_gpsim_fragment <-
+   processData(drosophila_mmgmos_fragment,
+               experiments=rep(1:3, each=12))
\end{Sinput}
\end{Schunk}

Here the last argument specifies that we have three independent time
series of measurements.

In order to save time with the demos, a part of the result of this is
included in this package and can be loaded using
\begin{Schunk}
\begin{Sinput}
> data(drosophila_gpsim_fragment)
\end{Sinput}
\end{Schunk}

\subsection{Learning individual models}

Let us now recreate some the models shown in the plots of the PNAS
paper~\cite{Honkela2010PNAS}:
\begin{Schunk}
\begin{Sinput}
> # FBgn names of target genes
> targets <- c('FBgn0003486', 'FBgn0033188', 'FBgn0035257')
> # Load gene annotations
> library(annotate)
> aliasMapping <- getAnnMap("ALIAS2PROBE",
+                   annotation(drosophila_gpsim_fragment))
> # Get the probe identifier for TF 'twi'
> twi <- get('twi', env=aliasMapping)
> # Load alternative gene annotations
> fbgnMapping <- getAnnMap("FLYBASE2PROBE",
+                  annotation(drosophila_gpsim_fragment))
> # Get the probe identifiers for target genes
> targetProbes <- mget(targets, env=fbgnMapping)
> st_models <- list()
> # Learn single-target models for each gene individually
> for (i in seq(along=targetProbes)) {
+   st_models[[i]] <- GPLearn(drosophila_gpsim_fragment,
+                             TF=twi, targets=targetProbes[i],
+                             quiet=TRUE)
+ }
> # Learn a joint model for all targets
> mt_model <- GPLearn(drosophila_gpsim_fragment, TF=twi,
+                     targets=targetProbes,
+                     quiet=TRUE)
> # Display the joint model parameters
> show(mt_model)
\end{Sinput}
\begin{Soutput}
Gaussian process driving input single input motif model:
  Number of time points: 
  Driving TF: 143396_at
  Target genes (3):
    148227_at
    152715_at
    147995_at
  Basal transcription rate:
    Gene 1: 23.0628118698704
    Gene 2: 0.00777234767448941
    Gene 3: 9.61284386889363e-07
  Kernel:
    Multiple output block kernel:
    Block 1
    Normalised version of the kernel.
    RBF inverse width: 0.7713904 (length scale 1.138578)
    RBF variance: 1.756029
    Block 2
    Normalised version of the kernel
    DISIM decay: 0.07312269
    DISIM inverse width: 0.7713904 (length scale 1.138578)
    DISIM Variance: 1
    SIM decay: 1472.725
    SIM Variance: 0.002370256
    RBF Variance: 1.756029
    Block 3
    Normalised version of the kernel
    DISIM decay: 0.07312269
    DISIM inverse width: 0.7713904 (length scale 1.138578)
    DISIM Variance: 1
    SIM decay: 0.4974917
    SIM Variance: 0.03221548
    RBF Variance: 1.756029
    Block 4
    Normalised version of the kernel
    DISIM decay: 0.07312269
    DISIM inverse width: 0.7713904 (length scale 1.138578)
    DISIM Variance: 1
    SIM decay: 0.0001589233
    SIM Variance: 0.003267953
    RBF Variance: 1.756029
  Log-likelihood: -31.84627 
\end{Soutput}
\begin{Sinput}
> # Learn a model without TF mRNA and TF protein translation
> nt_model <- GPLearn(drosophila_gpsim_fragment,
+                     targets=c(twi, targetProbes[1:2]), quiet=TRUE)
\end{Sinput}
\end{Schunk}

\subsection{Visualising the models}

The models can be plotted using commands like
\begin{Schunk}
\begin{Sinput}
> GPPlot(st_models[[1]], nameMapping=getAnnMap("FLYBASE",
+                         annotation(drosophila_gpsim_fragment)))
> GPPlot(mt_model, nameMapping=getAnnMap("FLYBASE",
+                   annotation(drosophila_gpsim_fragment)))
> GPPlot(nt_model, nameMapping=getAnnMap("FLYBASE",
+                   annotation(drosophila_gpsim_fragment)))
\end{Sinput}
\end{Schunk}

\begin{figure}
  \begin{center}
\includegraphics{tigre-010}
\end{center}
\caption{Single target models for the gene FBgn0003486. The models for
  each repeated time series are shown in different columns.}
\end{figure}

\begin{figure}
  \begin{center}
\includegraphics{tigre-011}
\end{center}
\caption{Multiple-target model for all the example genes.  The
  call creates independent figures for each repeated time series.}
\end{figure}

\begin{figure}
  \begin{center}
\includegraphics{tigre-012}
\end{center}
\caption{Multiple-target model without TF protein translation for
  selected example genes without.  The
  call creates independent figures for each repeated time series.}
\end{figure}

\subsection{Ranking the targets}

Bulk ranking of candidate targets can be accomplished using
\begin{Schunk}
\begin{Sinput}
> ## Rank the targets, filtering weakly expressed genes with average
> ## expression z-score below 1.8
> scores <- GPRankTargets(drosophila_gpsim_fragment, TF=twi,
+                         testTargets=targetProbes,
+                         options=list(quiet=TRUE),
+                         filterLimit=1.8)
> ## Sort the returned list according to log-likelihood
> scores <- sort(scores, decreasing=TRUE)
> write.scores(scores)
\end{Sinput}
\begin{Soutput}
"log-likelihood" "null_log-likelihood"
"147995_at" 6.75316505600457 -487.893231050121
"148227_at" -1.48337994116984 -73.4806804255218
"152715_at" -1.51379773568251 -539.73619673943
\end{Soutput}
\end{Schunk}

To save space, \Rfunction{GPRankTargets} does not return the models by
default.  If those are needed later e.g. for plotting, they can be
recreated using the inferred parameters saved together with the
ranking using
\begin{Schunk}
\begin{Sinput}
> topmodel <- generateModels(drosophila_gpsim_fragment,
+                            scores[1])
> show(topmodel)
\end{Sinput}
\begin{Soutput}
[[1]]
Gaussian process driving input single input motif model:
  Number of time points: 
  Driving TF: 143396_at
  Target genes (1):
    147995_at
  Basal transcription rate:
    Gene 1: 0.000141988952743346
  Kernel:
    Multiple output block kernel:
    Block 1
    Normalised version of the kernel.
    RBF inverse width: 0.7604155 (length scale 1.146765)
    RBF variance: 1.808011
    Block 2
    Normalised version of the kernel
    DISIM decay: 0.02024595
    DISIM inverse width: 0.7604155 (length scale 1.146765)
    DISIM Variance: 1
    SIM decay: 0.02016988
    SIM Variance: 0.002793356
    RBF Variance: 1.808011
  Log-likelihood: 6.753165 
\end{Soutput}
\end{Schunk}

\subsection{Ranking using known targets with multiple-target models}

Ranking using known targets with multiple-target models can be
accomplished simply by adding the \texttt{knownTargets} argument
\begin{Schunk}
\begin{Sinput}
> ## Rank the targets, filtering weakly expressed genes with average
> ## expression z-score below 1.8
> scores <- GPRankTargets(drosophila_gpsim_fragment, TF=twi,
+                         knownTargets=targetProbes[1],
+                         testTargets=targetProbes[2:3],
+                         options=list(quiet=TRUE),
+                         filterLimit=1.8)
> ## Sort the returned list according to log-likelihood
> scores <- sort(scores, decreasing=TRUE)
> write.scores(scores)
\end{Sinput}
\begin{Soutput}
"log-likelihood" "null_log-likelihood"
"152715_at" -28.0630300387158 -539.73619673943
"147995_at" -240.364785399555 -487.893231050121
\end{Soutput}
\end{Schunk}

\subsection{Running ranking in a batch environment}

\Rfunction{GPRankTargets} can be easily run in a batch environment
using the argument \texttt{scoreSaveFile}.  This indicates a file to
which scores are saved after processing each gene.  Thus one could,
for example, split the data to, say, 3 separate blocks according to
the reminder after division by 3 and run each of these independently.
The first for loop could then be run in parallel (e.g. as separate
jobs on a cluster), as each step is independent of the others.  After
these have all completed, the latter loop could be used to gather the
results.
\begin{Schunk}
\begin{Sinput}
> for (i in seq(1, 3)) {
+   targetIndices <- seq(i,
+     length(featureNames(drosophila_gpsim_fragment)), by=3)
+   outfile <- paste('ranking_results_', i, '.Rdata', sep='')
+   scores <- GPrankTargets(preprocData, TF=twi,
+                           testTargets=targetIndices,
+                           scoreSaveFile=outfile)
+ }
> for (i in seq(1, 3)) {
+   outfile <- paste('ranking_results_', i, '.Rdata', sep='')
+   load(outfile)
+   if (i==1)
+     scores <- scoreList
+   else
+     scores <- c(scores, scoreList)
+ }
> show(scores)
\end{Sinput}
\end{Schunk}

\section{Experimental feature: Using non-Affymetrix data}

Using non-Affymetrix data, or data without associated uncertainty
information for the expression data in general, requires more because
of two reasons
\begin{itemize}
\item noise variances need to be estimated together with other model
  parameters; and
\item weakly expressed genes cannot be easily filtered \emph{a
    priori}.
\end{itemize}

The first of these is automatically taken care of by all the above
functions, but the latter requires some extra steps after fitting the
models.

In order to get started, you need to create an
\Rfunction{ExpressionTimeSeries} object of your data set.  This can be
accomplished with the function
\begin{Schunk}
\begin{Sinput}
> procData <- processRawData(data, times=c(...),
+                            experiments=c(...))
\end{Sinput}
\end{Schunk}

Filtering of weakly expressed genes requires more care and visualising
the fitted models is highly recommended to avoid mistakes.

Based on initial experiments, it seems possible to perform the
filtering based on the statistic \texttt{loglikelihoods(scores) -
  baseloglikelihoods(scores)}, but selection of suitable threshold is
highly dataset specific.

\section{Exporting results to an SQLite database}

The results of the analysis can be stored to an SQLite database. The
database can then be browsed and queried using the
\href{http://users.ics.tkk.fi/ahonkela/tigre/}{tigreBrowser}
result browser. The data
is inserted to the database by using \Rfunction{export.scores} function.

An example of the usage of \Rfunction{export.scores} is given below
\begin{Schunk}
\begin{Sinput}
> export.scores(scores, datasetName='Drosophila',
+               experimentSet='GPSIM/GPDISIM',
+               database='database.sqlite',
+               preprocData=drosophila_gpsim_fragment,
+               models=models,
+               aliasTypes=c('SYMBOL', 'GENENAME', 'FLYBASE', 'ENTREZID'))
\end{Sinput}
\end{Schunk}

In this example, \texttt{scores} is the return value of
\Rfunction{GPRankTargets}, \texttt{'Drosophila'} is the name of a dataset in
database and \texttt{'GPSIM/GPDISIM'} is the name of an experiment set in
database. In general, results with the same dataset name are considered to be
part of same dataset. That is, if no results with a given dataset are already
in the database, a new dataset entry is created.  On the other hand, if the
database already contains results with the same dataset name, new results will
be added to the old dataset.

Also, results from different experiments can be combined into a set of
experiments by giving them the same experiment set name. This is useful as a
result browser may display results depending on the experiment set.

\texttt{database.sqlite} is the filename of a database file. The file will
be created if it does not exist already.

The function will create model figures and add them to the database if
preprocessed data is given. In this example, models are given to the
function as a parameter. This is not necessary, however, as the function can
create models if preprocessed data is supplied. Nevertheless, the function will
finish faster if it does not have to (re-)create models.

In addition to log likelihoods and z-scores, this function will also export
different gene names and aliases to the database. By default, the function will
read GENENAME, SYMBOL and ENTREZID datas from relevant annotations and insert
those into the database. The function takes also \texttt{aliasTypes} argument
which is used to define which annotation information is inserted. In the
example above, FLYBASE gene numbers are also added to the genes in the
database. The insertion of alias annotations and z-scores requires that the
preprocessed data is supplied.

\section{Session Info}

\begin{Schunk}
\begin{Sinput}
> sessionInfo()
\end{Sinput}
\begin{Soutput}
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices
[6] utils     datasets  methods   base     

other attached packages:
 [1] drosgenome1.db_3.0.0  org.Dm.eg.db_3.0.0   
 [3] RSQLite_0.11.4        DBI_0.3.1            
 [5] annotate_1.43.5       XML_3.98-1.1         
 [7] AnnotationDbi_1.27.16 GenomeInfoDb_1.1.23  
 [9] IRanges_1.99.28       S4Vectors_0.2.4      
[11] tigre_1.19.1          Biobase_2.25.0       
[13] BiocGenerics_0.11.5  

loaded via a namespace (and not attached):
[1] BiocStyle_1.3.14   bitops_1.0-6       caTools_1.17.1    
[4] gdata_2.13.3       gplots_2.14.2      gtools_3.4.1      
[7] KernSmooth_2.23-13 tools_3.1.1        xtable_1.7-4      
\end{Soutput}
\end{Schunk}

\bibliography{gpsim}

\end{document}