%\VignetteIndexEntry{SDAMS Vignette}
%\VignettePackage{SDAMS}
%\VignetteKeyword{Semi-parametric Differential Adundance Analysis}

\documentclass[12pt]{article}

\usepackage{float}
\usepackage{Sweave}
\usepackage{amsmath}
\usepackage{amssymb}


\RequirePackage{Bioconductor}
\AtBeginDocument{\bibliographystyle{unsrturl}}

\renewcommand{\baselinestretch}{1.3}


\SweaveOpts{keep.source=TRUE,eps=FALSE,include=TRUE,width=4,height=4}



\author{Yuntong Li$^{1}$, Chi Wang$^{2,3}$\footnote{to whom correspondence
should be addressed}, Li Chen$^{2,3}$\footnote{to whom correspondence
should be addressed}\\[1em]
\small{$^{1}$Department of Statistics , University of Kentucky,Lexington, KY;}\\
\small{$^{2}$Markey Cancer Center, University of Kentucky, Lexington, KY ;}\\
\small{$^{3}$Department of Biostatistics, University of Kentucky,
Lexington, KY;}\\
\small{\texttt{yuntong.li@uky.edu}}\\
\small{\texttt{chi.wang@uky.edu}}\\
\small{\texttt{lichenuky@uky.edu}}}



\title{\textsf{\textbf{The SDAMS package}}}

%\bibliographystyle{abbrv}

\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle

\begin{abstract}
This vignette introduces the use of the Bioconductor package
{\tt SDAMS}, which is designed for differential abundance analysis for
metabolomics and proteomics data from mass spectrometry. These data may contain
a large fraction of zero values and non-zero part may not be normally
distributed. {\tt SDAMS} considers a two-part semi-parametric model, a logistic
regression for the zero proportion and a semi-parametric log-linear model for
the non-zero values. A kernel-smoothed likelihood method is proposed to estimate
regression coefficients in the two-part model and a likelihood ratio test is
constructed for differential abundant analysis.

\end{abstract}


\newpage

\tableofcontents

\newpage


\section{Citation}
The package {\tt SDAMS} implements statistical methods from the following
publication. If you use {\tt SDAMS} in the published research, please cite: \\
Yuntong Li, Teresa W.M. Fan, Andrew N. Lane, Woo-Young Kang, Susanne M. Arnold,
Arnold J. Stromberg, Chi Wang and Li Chen: A Two-Part Semi-Parametric Model for
Metabolomics and Proteomics Data. (Manuscript)

\section{Quick Start}
This section show the most basic {\tt SDAMS} work flow for a differential
abundance analysis for metabolomics and proteomics data from mass spectrometry:
\begin{enumerate}
\item Create a {\tt SummarizedExperiment} object using function
      {\tt createSEFromMatrix} or {\tt createSEFromCSV}.
      In this section we use an example {\tt SummarizedExperiment} object
      directly, which is an object of {\tt SummarizedExperiment} class named
      {\tt exampleSumExp} contained in this package.
\item Perform a differential abundance analysis using {\tt SDA}.
\end{enumerate}


<<quick start , eval=FALSE >>=
library("SDAMS")
data("exampleSumExp")
results <- SDA(exampleSumExp)
@

Here, the {\tt SummarizedExperiment} class object {\tt exampleSumExp} contained
in the package is the proteomics dataset, which a matrix-like container for
proteomic features with experimental subject grouping information. There are
560 features for 202 experimental subjects with 49 prostate cancer subjects and
153 healthy subjects (0 for healthy control and 1 for patient in this case).
This is a 10\% subsample of the original dataset. The features are stored as a
matrix in the assay slot. Each row in this matrix represents a proteomic
feature and each column represents a subject. See Reference~
\cite{siwy2011human} for detailed information regarding this dataset.



\section{Data Input}


\subsection{Create SummarizedExperiment object from csv files}
The proteomics or metabolomics data is stored as a matrix with each
row being a feature and each column corresponding to a subject. All data in this
matrix are non-negative. Another information required is the phenotype
covariates. Here we focus on the binary grouping information, such as numeric 1
for control group and 0 for case group. But it can also be characters, such as
"healthy" and "disease". To utilize {\tt SDAMS} package, we should have two
separate csv files (for example 'feature.csv' and 'group.csv') as inputs for
{\tt createSEfromCSV} to creat a {\tt SummarizedExperiment} object.

Note:
\begin{enumerate}
\item The $1^{st}$ column in 'feature.csv' represents feature names and the
      $1^{st}$ row represents subject codes.
\item The $1^{st}$ column in 'group.csv' represents subject codes, for example,
      Subject1, Subject2....
\end{enumerate}


The format for "csv files" should look like as Figure~\ref{example feature}
and Figure~\ref{example group}:

\begin{figure}[h!]
  \centering
  \includegraphics{feature.png}
  \caption{Example of 'feature.csv' pattern}
  \label{example feature}
\end{figure}
\begin{figure}[ht]
  \centering
  \includegraphics[width=2cm]{group.png}
  \caption{Example of 'group.csv' pattern}
  \label{example group}
\end{figure}


After creating the two csv files, we need the paths for the two csv files:

<<directory, eval=FALSE>>=
path1 <- "/path/to/your/feature.csv/"
path2 <- "/path/to/your/group.csv/"
@

Here for demonstration purpose, we use the data stored in {\tt inst/extdata}
directory. This is the csv format of the data in exampleSumExp which is a
{\tt SummarizedExperiment} object we described before.

<<GetDirectory>>=
directory1 <- system.file("extdata", package = "SDAMS", mustWork = TRUE)
path1 <- file.path(directory1, "ProstateFeature.csv")
directory2 <- system.file("extdata", package = "SDAMS", mustWork = TRUE)
path2 <- file.path(directory2, "ProstateGroup.csv")
@

then use the function {\tt createSEFromCSV} after loading the {\tt SDAMS}
package.
<<CsvInput>>=
library("SDAMS")
exampleSE1 <- createSEFromCSV(path1, path2)
exampleSE1
@

The feature data and grouping information can be accessed using
{\tt SummarizedExperiment} commands:
<<Accessors>>=
head(assay(exampleSE1)[,1:10])
head(colData(exampleSE1)$grouping)
@


\subsection{Create SummarizedExperiment object from seperate matrix}
If the two datasets have been already claeaned and loaded into R as matrices,
then we can use {\tt createSEFromMatrix} to create a
{\tt SummarizedExperiment} object.

<<MatrixInput>>=
set.seed(100)
featureInfo <- matrix(runif(800, -2, 5), ncol = 40)
featureInfo[featureInfo<0] <- 0
rownames(featureInfo) <- paste("feature", 1:20, sep = '')
colnames(featureInfo) <- paste('subject', 1:40, sep = '')
groupInfo <- data.frame(grouping=matrix(sample(0:1, 40, replace = TRUE),
                        ncol = 1))
rownames(groupInfo) <- colnames(featureInfo)

exampleSE2 <- createSEFromMatrix(feature = featureInfo, group = groupInfo)
exampleSE2
head(assay(exampleSE2)[,1:10])
head(colData(exampleSE2)$grouping)

@

\section{Data Analysis}

Finally, we perform differential abundance analyais using
{\tt SummarizedExperiment} object created in the last section. This can be done
by using function {\tt SDA}. The theory behind {\tt SDA} can be reached at
section \ref{theory}. A list with point estimates, p-values, q-values
and corresponding feature names is returned. Below is the results generated by
using the {\tt SummarizedExperiment} object exampleSE1.

<<resultsForGamma>>=
results <- SDA(exampleSE1)
head(results$gamma)
head(results$pv_gamma)
head(results$qv_gamma)
@
In this example , there is only one group covariate applied to each subject.
Here $\textbf{X}_i$ is one dimension. The covariate effect on the fraction of
zero values for feature $g$ is $\gamma_g$, which is estimated to be 0.11 for
the first feature, and 0.86 for the second feature, etc. The
corresponding hypothesis is $H_0$: $\gamma_g=0$ vs. $H_1$: $\gamma_g \ne 0$. The
p-values calculated from likelihood ratio test are returned in {\tt pv\_gamma}.
Users can determine their own significance level to make inference, such as 0.05
nominal level. We also provide a FDR adjustment method
\cite{storey2003statistical} used in {\tt SDA} for multiple comparison issues.
Those results for $\gamma_g$ are stored in {\tt qv\_gamma}.

<<resultsForBeta>>=
head(results$beta)
head(results$pv_beta)
head(results$qv_beta)
@
The model parameter $\beta_g$ is the log fold change in the non-zero abundance
comparing different values of the single group covariate for feature $g$. The
corresponding two-sided hypothesis is $H_0$: $\beta_g=0$ vs. $H_1$:
$\beta_g \ne 0$. Again, {\tt SDA} will return p-values and adjusted
p-values(q-values) for parameter $\beta_g$, and they are stored in
{\tt pv\_beta} and {\tt qv\_beta} respectively.

<<resultsFor2part>>=
head(results$pv_2part)
head(results$qv_2part)
@

Hypothesis testing on overall effect of group covariate on the $g$th feature is
performed by assessing $\gamma_{g}$ and $\beta_{g}$. The null hypothesis $H_0:$
$\gamma_{g}=0$ and $\beta_{g}=0$ against alternative hypothesis $H_1:$ at least
one of the two parameters is non-zero. The p-values are calculated based on
chi-square distribution with 2 degrees of freedom. And the corresponding
q-values are calculated using the same procedure as in one-part model.

<<outputForFeatureName>>=
head(results$feat.names)
@
A vector of feature names is returned for convenience which corresponds to the
results in the other components.
\section{Theory for SDAMS}\label{theory}

As mentioned in the abstract, MS data is a mixture of zero intensity values
and possibly non-normally distributed non-zero intensity values. Therefore, the
differential abundance analysis needs to be performed to compare both the zero
proportion and the mean of non-zero values between groups. SDA is a two-part
model which addresses these issues that uses a logistic regression model
to characterize the zero proportion and a semiparametric model to characterize
non-zero values.
\subsection{A two-part semi-parametric model}

The differential abundance analysis in SDAMS has the following forms.
For binary part:
\[\mathrm{log}(\frac{\pi_{ig}}{1-\pi_{ig}})=\gamma_{0g}+\boldsymbol{\gamma}_g
\boldsymbol{X}_{i} , \]
For continuous non-zero part:
\[\mathrm{log}(Y_{ig})=\boldsymbol{\beta}_g \boldsymbol{X}_i +
\varepsilon_{ig}, \]

where $Y_{ig}$ is the random variable representing the abundance of feature $g$
in subject $i$, $\pi_{ig}=Pr(Y_{ig}=0)$ is the probability of point mass.
$\boldsymbol{X}_i=(X_{i1},X_{i2},...,X_{iQ})^T$ is a $Q$-vector covariates that
specifies the treatment conditions applied to subject $i$. The corresponding
$Q$-vector of model parameters $\boldsymbol{\gamma}_g=(\gamma_{1g},\gamma_{2g},
...,\gamma_{Qg})^T$ quantify the covariates effects on the fraction of zero
values for feature $g$ and $\gamma_{0g}$ is the intercept.
$\boldsymbol{\beta}_g=(\beta_{1g},\beta_{2g},...,\beta_{Qg})$ is a
$Q$-vector of model parameters quantifying the covariates effects on the
non-zero values for the feature, and $\varepsilon_{ig}$'s $(i=1,2,..n)$ are
independent error terms with a common but completely unspecified density
function $f_g$. Importantly, we do not impose any distributional assumption on
$f_g$. Without assuming a specific parametric distribution for
$\varepsilon_{ig}$, this model is much more flexible to characterize data with
unknown and possibly non-normal distribution.

\subsection{Identification of differentially abundant features}
We replace $f_g$ by its kernel density estimator in the likelihood function. The
maximum likelihood estimator is obtained through a trust region maximization
algorithm. The likelihood ratio test is performed on the null hypothesis $H_0:$
$\gamma_{qg}=0$ and $\beta_{qg}=0$ against alternative hypothesis $H_1:$ at
least one of the two parameters is non-zero. We also consider the hypotheses for
testing $\gamma_{qg}=0$ and $\beta_{qg}=0$ separately. To adjust for multiple
comparisons across features, the false discovery discovery rate (FDR) q-value is
calculated based on the {\tt qvalue} function in {\tt qvalue} package in
R/Bioconductor.

\section{Session Info}

<<sessionInfo, results=tex, print=TRUE, eval=TRUE>>=
toLatex(sessionInfo())
@


\bibliography{reference}


\end{document}