%\VignetteIndexEntry{pint}
%The above line is needed to remove a warning in R CMD check
\documentclass[a4paper]{article}

\title{pint:\\Pairwise integration of functional genomics data}

\author{Olli-Pekka Huovilainen\footnote{ohuovila@gmail.com} and Leo Lahti\\
  Department of Information and Computer Science,\\Aalto University
  School of Science and Technology, Finland}

\usepackage{Sweave}
\usepackage{float}

\usepackage{amsmath,amssymb,amsfonts}
\usepackage[authoryear,round]{natbib}

\usepackage{hyperref}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\begin{document}

\maketitle

%\VignetteIndexEntry{Finding chromosome areas with strong dependencies between data sets}

\section{Introduction}

%* motivation

Multiple types of genomic observations from the same patients are
increasingly available in biomedical studies, including measurements
of gene- and miRNA expression levels, gene copy number, and
methylation status.  By investigating the dependencies between these
data sets it is possible to discover functional mechanisms and
interactions that are not seen in the individual data sets. For
example, integration of gene expression and copy number has been shown
to reveal cancer-associated chromosomal regions and associated genes
with potential diagnostic, prognostic and clinical impact
\cite{Lahti09mlsp}. 

We demonstrate how to integrate gene or micro-RNA expression with DNA
copy number (aCGH) measurements to discover functionally active
chromosomal aberrations. The models capture the shared signal in
paired observations, and indicate the affected genes and patients.
The methods are potentially applicable also to other types of
biomedical data, including methylation, SNPs, alternative splicing and
transcription factor binding, or in other application fields.

The package provides general-purpose tools for the discovery and
analysis of statistical dependencies between co-occurring data
sources. The methods are based on a principled framework,
probabilistic canonical correlation analysis \cite{Bach05} and its
extensions \cite{Archambeau06,Klami08,Lahti09mlsp}. Probabilistic
formulation deals rigorously with uncertainty associated with small
sample sizes common in biomedical studies, and the package also
provides tools to guide dependency modeling through Bayesian priors
\cite{Lahti09mlsp}.

%* references

% \bibliographystyle{IEEEbib}
% \bibliography{../../doc/mi,../../doc/gene/gene}

\section{Examples}

This Section shows how to apply the methods for dependency detection
in functional genomics studies. The general dependency modeling
framework and tools are described in Section ~\ref{sec:framework}.

\subsection{Example data}

Use of the package is demonstrated with an example data set containing
paired observations of gene expression and copy number from a set of
gastric cancer patients \cite{Myllykangas08}.

Load the package and example data:

<<>>=
require(pint)
data(chromosome17)
@

Each example data set ($geneExp$ and $geneCopyNum$) consists of a list
with two items: $data$ and $info$. The probes in gene expression and
gene copy number are assumed to be paired. $data$ is a data matrix
with gene expression or gene copy number data. Genes are in rows and
samples in columns and rows and columns should be named. $info$ is a
data frame with additional information about genes. It has three
elements: $loc$, $chr$ and $arm$. $loc$ indicates the genomic location
of the probes in base pairs (numeric); $chr$ and $arm$ are factors 
indicating the chromosome and chromosomal arm of the probe.

\subsubsection{Modeling assumptions}

Note that the currently implemented dependency models assume
approximately Gaussian distributed observations. With microarray data
sets, this is typically obtained by presenting the data in the
\(log_2\) domain; this is the default in many microarray preprocessing
methods.

\subsection{Discovering functionally active copy number changes}

Chromosomal regions with frequent copy number alterations and
associated gene expression changes are potential candidates for cancer
genes. Such regions are assumed to have high dependency between gene
copy number and expression levels. To detect these regions, we measure
the dependency between expression and copy number for each region and
pick the regions showing the highest dependency. In practice, a
sliding window over the genome is used.


The following example screens chromosome arm 17q for dependendent
regions.

<<>>=
models <- screen.cgh.mrna(geneExp, geneCopyNum, windowSize = 10, chr = 17, arm = 'q')
@

The dependency is measured separately for each gene within a
chromosomal region ('window') around the gene. A fixed dimensionality
(window size) is necessary to ensure comparability of the dependency
scores between windows. The scale of the chromosomal regions can be
tuned by changing the window size ('windowSize'). The default
dependency modeling method is a constrained version of probabilistic
CCA; \cite{Lahti09mlsp}. See help(screen.cgh.mrna) for further
options.

\subsection{Visualizing the results}

A dependency plot reveals chromosomal regions with the strongest
dependency between gene expression and copy number:

\begin{figure}[H]
\begin{center}
<<fig = true>>=
plot(models, showTop = 10)
@
\end{center}
\caption{The dependency plot reveals chromosomal regions with the strongest
  dependency between gene expression and copy number.}
\label{fig:pSimCCA}
\end{figure}

Here the highest dependency is between 30-40Mbp which is a known
gastric cancer-associated region. The top-5 genes with the highest
dependency in their chromosomal neighborghood can be retrieved with:

<<>>=
topGenes(models, 5)
@

It is possible investigate the contribution of individual patients or
probes on the overall dependency (Fig. ~\ref{fig:modelplot}). 

This is based on the model parameters \(W\) and the latent variable
\(z\) that are easily retrieved from the learned dependency model (see
Section~\ref{sec:framework} for description of the model
parameters). In 1-dimensional case the interpretation is
straightforward: \(z\) indicates the strength of shared signal in each
sample (patient), and \(W\) describes how this signal is captured by
each gene expression or copy number probe. With multi-dimensional
\(W\) and \(z\), the variable- and sample effects are approximated
(for visualization purposes) by the loadings and projection scores
corresponding of the first principal component of \(Wz\) which
describes the shared signal in each data set.

\subsection{Additional parameters}

The dimensionality of the shared latent variable \(Z\) and the
dependency modeling method can also be set by the user. For example,
use probabilistic CCA with 1-dimensional \(Z\):

<<eval = FALSE>>=
model17qpCCA <- screen.cgh.mrna(geneExp, geneCopyNum, windowSize = 10, chr = 17, arm = 'q', method = "pCCA", params = list("zDimension = 1")) 
@


\begin{figure}[H]
\begin{center}
<<fig=true>>=
model <- topModels(models, 1)[[1]]
plot(model, geneExp, geneCopyNum)
@
\end{center}
\caption{Samples and variable contribution to the dependencies around
  the gene with the highest dependency score between gene expression
  and copy number measurements in the chromosomal region. The
  visualization highlights the affected patients and genes.}
\label{fig:modelplot}
\end{figure}


\section{Dependency modeling framework}
\label{sec:framework}

\begin{figure}[htb]
\centering \includegraphics[height=3.5cm, keepaspectratio=TRUE]{cca2}
\caption{Graphical description of the shared latent variable model showing
generation of data sets $X$ and $Y$ from latent shared variable $\mathbf{z}$ 
through $W_x$ and $W_y$}
\label{modelpic}
\end{figure}

Modeling of dependencies is based on the probabilistic canonical
correlation analysis framework \cite{Bach05} and its extensions
\cite{Archambeau06,Lahti09mlsp}. This is a latent variable model that
assumes that the two data sets, \(X\) and \(Y\) can be decomposed in
{\it shared} and {\it data set-specific} components
(Figure~\ref{modelpic}). Our task is to discover these components,
given modeling assumptions.

The shared signal is modeled with a shared latent variable
$\mathbf{z}$. Intuitively, this measures the strength of the shared
signal in each patient. While the variation is shared, it can have
different manifestation in each data set. This is described by
\(W_xz\) and \(W_yz\) where \(W_x\), \(W_y\) indicate how the shared
signal is observed in the individual data sets. Assuming a Gaussian
model for the shared latent variable and data set-specific effects,
this leads to the following model:

\begin{equation}\label{eq:model}             
X \sim \mathcal{N}(W_x\mathbf{z}, \Psi_x)   
\end{equation}
\begin{equation}\label{eq:model2}             
Y \sim \mathcal{N}(W_y\mathbf{z}, \Psi_y)                                      
\end{equation}

The latent variable \(\mathbf{z}\) is assumed to follow standard
multivariate normal distribution, i.e. \(\mathbf{z} ~ N(0, I)\). The
data set-specific effects are described by the covariance matrices
$\Psi_x$ and $\Psi_y$. The model parameters are estimated with an
expectation-maximization (EM) algorithm (see
'fit.dependency.model'). After fitting the model parameters \(W\),
\(\Psi\), a maximum-likelihood estimate of \(\mathbf{z}\) can be
calculated (see 'z.expectation').

\subsection{Special cases}

Particular models are obtained as special cases of the above modeling
framework. This leads to a set of alternative models for dependency
detection, including

\begin{itemize}
  \item probabilistic PCA (pPCA)
  \item probabilistic factor analysis (pFA)  
  \item probabilistic CCA (pCCA)
  \item similarity-constrained probabilistic CCA (pSimCCA)
\end{itemize}

These correspond to different assumptions regarding the structure of
the data set-specific effects and types of dependency. While PCA and
factor analysis are typically used for analysing individual data sets,
they are special cases of the described framework and can therefore
also be used to model dependencies between data sets. For discussion
of the differences between these models, see \cite{Bach05,
  Lahti09mlsp}.

\subsubsection{Probabilistic PCA}

Probabilistic PCA (pPCA) assumes an isotropic model for the data
set-specific effects, with identical covariance matrices:

\begin{equation}\label{eq:pPCA}
\Psi_x = \Psi_y = \sigma I.
\end{equation}

This model is called pPCA since it is identical to concatenating
\(X\), \(Y\), and fitting ordinary probabilistic PCA on the
concatenated data set.

\subsubsection{Probabilistic factor analysis}

Probabilistic factor analysis (pFA) assumes a diagonal model for
\(\Psi_x\), \(\Psi_y\). Note that in general, \(\Psi_x \neq
\Psi_y\). The package also implements a special case with isotropic
but not necessarily identical (as in pPCA) covariance matrices.

This model is called pFA since it is identical to concatenating \(X\),
\(Y\), and fitting ordinary probabilistic factor analysis on the
concatenated data set.

\subsubsection{Probabilistic CCA}

Probabilistic CCA (pCCA) assumes full covariance matrices \(\Psi_x\),
\(\Psi_y\), giving the most detailed model for the data set specific
effects in the described modeling framework. The connection of this
latent variable model and the traditional canonical correlation
analysis has been established in \cite{Bach05}.

\subsubsection{Probabilistic SimCCA}

We also provide toos to guide dependency modeling through Bayesian
priors \cite{Lahti09mlsp}. Similarity-constrained probabilistic CCA
(pSimCCA) imposes a prior on the relation between $W_x$ and
$W_y$. This can be used to guide modeling to focus on certain types of
dependencies, and to avoid overfitting.

The relationship between $W_x$ and $W_y$ is described with \(W_y =
TW_x\). A prior on \(T\) can be used to focus the modeling on certain
types of dependencies. We use matrix normal prior distribution:

\begin{equation}\label{T}
P(T) = N_m(H, \sigma^2_TI, \sigma^2_TI)
\end{equation}

By default, $H = I$ and \(\sigma^2_T = 0\), which results in identical
manifestation of the shared signal in the two data sets: $W_y =
W_x$. This model is denoted pSimCCA in the package. However, the prior
can be loosened by tuning $sigma^2_T$. With $sigma^2_T \rightarrow
\infty$, estimation of $W_x$ and $W_y$ become independent, which leads
to ordinary probabilistic CCA. It is also possible to tune the mean
matrix $H$. This would set a particular relationship between the
manifestations of the shared component in each data set, and
$sigma^2_T$ is again be used to tune the strength of such prior.

\subsection{Measuring the dependency}

Dependency between the data sets \(X\), \(Y\) is measured by the ratio
of shared vs. data set-specific signal (see '?dependency.score'):

\begin{equation}\label{depscore}
  \frac{Tr(WW^T)}{Tr(\Psi)}
\end{equation}

\subsection{Functions for dependency modeling}

The package implements the dependency modeling framework (see
'fit.dependency.model'), and provides wrappers for the special cases
of the model.


\subsection*{Acknowledgements}

We would like to thank prof. Sakari Knuutila (University of Helsinki)
for providing the example data set for the package.

\begin{thebibliography}{1}

\bibitem{Archambeau06}
C{\'e}dric Archambeau, Nicolas Delannay, and Michel Verleysen.
\newblock Robust probabilistic projections.
\newblock In W.W. Cohen and A.~Moore, editors, {\em Proceedings of the 23rd
  International conference on machine learning}, pages 33--40. ACM, 2006.

\bibitem{Bach05}
Francis~R. Bach and Michael~I. Jordan.
\newblock A probabilistic interpretation of canonical correlation analysis.
\newblock Technical Report 688, Department of Statistics, University of
  California, Berkeley, 2005.

\bibitem{Klami08}
Arto Klami and Samuel Kaski.
\newblock Probabilistic approach to detecting dependencies between data sets.
\newblock {\em Neurocomputing}, 72(1-3):39--46, 2008.

\bibitem{Lahti09mlsp}
Leo Lahti, Samuel Myllykangas, Sakari Knuutila, and Samuel Kaski.
\newblock Dependency detection with similarity constraints.
\newblock In {\em Proc. MLSP'09 IEEE International Workshop on Machine Learning
  for Signal Processing}, 2009.

\bibitem{Myllykangas08}
Samuel Myllykangas, Siina Junnila, Arto Kokkola, Reija Autio, Ilari Scheinin, Tuula Kiviluoto, Marja-Liisa Karjalainen-Lindsberg, Jaakko Hollm\'{e}n, Sakari Knuutila, Pauli Puolakkainen, Outi Monni
\newblock Integrated gene copy number and expression microarray analysis of gastric cancer highlights potential target genes
\newblock {\em International Journal of Cancer}, 123(4):817-25, 2008.


\end{thebibliography}


\end{document}