%\VignetteIndexEntry{mBPCR}
%\VignetteDepends{}
%\VignetteKeywords{Bayesian Piecewise Constant Regression for DNA copy number estimation}
%\VignettePackage{mBPCR}

\documentclass[11pt]{article}


\usepackage{amsmath}

\SweaveOpts{echo=FALSE}

\begin{document}

\setkeys{Gin}{width=0.99\textwidth}

\title{\bf mBPCR: A package for DNA copy number profile estimation}

\author{P. M. V. Rancoita$^{1,2,3}$ and M. Hutter$^{4}$}
\date{}
\maketitle


\noindent
$^1$Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA), Manno-Lugano, Switzerland\\
$^2$Laboratory of Experimental Oncology, Oncology Institute of Southern Switzerland (IOSI), Bellinzona, Switzerland\\
$^3$Dipartimento di Matematica, Universit\`{a} degli Studi di Milano, Milano, Italy\\
 $^4$RSISE$\,@\,$ ANU and SML$\,@\,$NICTA, Canberra,  ACT, 0200, Australia


\begin{center}

{\tt paola@idsia.ch}
\end{center}


\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

The algorithm mBPCR is a tool for estimating the  profile of the
log$_2$ratio of copy number data. The procedure is  a Bayesian
piecewise constant regression and can be applied, generally, to
estimate  any piecewise constant function (like the log$_2$ratio of
the copy number data). The method is described in
\cite{Rancoita:08a} and represents a significant improvement of the
original algorithm BPCR, presented in \cite{Hutter:07pcregx} and
\cite{Hutter:07pcreg}.

This document shows several examples of how to use  the package. The
data used are principally: the Affymetrix GeneChip Mapping 10K Array data of
cell line REC-1 \cite{Rinaldi:06a} and the Affymetrix GeneChip
Mapping 250K Array data of chromosome 11 of cell line JEKO-1
(unpublished). 



\section{Example 1: profile estimation}

In this example we estimate the copy number profile of sample {\tt rec10k}.


<<echo=TRUE,print=FALSE>>=
library(mBPCR)
@ 

\noindent
First, we import the 10K Array data of cell line REC-1. 

<<echo=TRUE,print=FALSE>>=
data(rec10k)
@ 

\noindent
During the computation, the algorithm needs to create a vector of size ({\tt maxProbeNumber}+1)({\tt maxProbeNumber}+2)/2,
where {\tt maxProbeNumber} is the maximum number of probes of a chromosome (or arm of a chromosome, for denser Array).
Hence, before the estimation, we must verify if  we have enough RAM to allocate such a vector.
In case of the 10K Array data, we know that all chromosomes have less than 1000 probes, thus we verify
if we can set {\tt maxProbeNumber=1000}, with the following commands,

<<echo=TRUE,print=FALSE>>=
maxProbeNumber <- 1000
@ 

<<echo=TRUE,print=FALSE>>=
A <- array(1, dim=(maxProbeNumber+1)*(maxProbeNumber+2)/2)
@ 

\noindent
If last command does not give any error regarding the memory allocation, then we can set {\tt maxProbeNumber=1000} and remove {\tt A} to save space.


<<echo=TRUE,print=FALSE>>=
remove(A)
@ 

\noindent
To estimate the profile of one or more chromosomes, we need to set the parameter {\tt chrToBeAnalyzed} with the vector
of the names of the chromosomes that we want to analyze (the names allowed are: {\tt X}, {\tt Y} and any integer from {\tt 1} to {\tt 22}). In the following example, we estimate the profile of chromosomes 3 and 5 of sample REC-1.
Instead, to estimate the profile of the whole genome, we need to set  {\tt chrToBeAnalyzed = c(1:22,"X")}.


<<echo=TRUE,print=FALSE>>=
results <- estProfileWithMBPCR(rec10k$SNPname, rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBeAnalyzed=c(3,5), maxProbeNumber=1000)
@ 

\noindent

We can nicely write the results on tab delimited files in the working directory, by using the function {\tt writeEstProfile}
 (Tables \ref{tab_mBPCR} and \ref{tab_bounds} show the first lines of the two tables created by the command below). 
Setting {\tt sampleName="rec10k"}, the name of the files will contain the 
name of the sample {\tt rec10k}. If {\tt path=NULL}, the tables will not be written on files, but only returned by the function.

<<echo=TRUE,print=FALSE>>=
writeEstProfile(path='', sampleName='rec10k',rec10k$SNPname, rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBeWritten=c(3,5), results$estPC, results$estBoundaries)
@ 

<<echo=FALSE,print=FALSE>>=
library(xtable)
temp <- writeEstProfile(path=NULL, sampleName='rec10k',rec10k$SNPname,rec10k$Chromosome,  rec10k$PhysicalPosition, rec10k$log2ratio,chrToBeWritten=c(3,5), results$estPC, results$estBoundaries) 
@

<<genTable1,echo=FALSE,results=tex>>=
xx <- xtable(head(temp$mBPCRestimate))
label(xx) <- "tab_mBPCR"
caption(xx) <- "Example of table containing the profile estimated with mBPCR."
align(xx) <- c("|c|","|c|","c|","c|","c|","c|")
print(xx,latex.environments=c("center","small"),include.rownames=FALSE)
@

\tabcolsep=2pt

<<genTable2,echo=FALSE,results=tex>>=
xx <- xtable(head(temp$mBPCRbreakpoints))
label(xx) <- "tab_bounds"
caption(xx) <- "Example of table containing a summary of the breakpoints estimated with mBPCR."
align(xx) <- c("|c|","|c|","c|","c|","c|","c|","c|","c|")
print(xx,latex.environments=c("center","small"),include.rownames=FALSE)
@




\noindent
We can also estimate the profile with a Bayesian regression curve  \cite{Rancoita:08a}. For example, with the following command
we estimate the profile of chromosome~3 using both mBPCR and the Bayesian Regression Curve with $\hat{K}_2$.

<<echo=TRUE,print=FALSE>>=
results <- estProfileWithMBPCR(rec10k$SNPname, rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBeAnalyzed=3, regr='BRC', maxProbeNumber=1000)
@ 

\noindent
After the estimation, we can plot the profiles using the function {\tt plotEstProfile}. For example, the
following command plots the  profile of chromosome~3 estimated with both methods.

<<echo=TRUE,fig=TRUE, height=6, width=6>>=
plotEstProfile(sampleName='rec10k', rec10k$Chromosome, rec10k$PhysicalPosition, rec10k$log2ratio, chrToBePlotted=3, results$estPC, maxProbeNumber=2000, regrCurve=results$regrCurve, regr='BRC')
@




As second example, we estimate the profile of chromosome 11 of sample
JEKO-1. Notice that we need to set {\tt maxProbeNumber <- 9000}
 (because both arms of chromosome 11 contain less than 9000 probes) and, if this is
possible on your machine,  the computation can be long. Moreover, for
the estimation, we use the estimates of the parameters computed on
the whole genome to achieve a better profile (for the estimation of the global parameters, see the use of function
{\tt estGlobParam} in Section~3).

<<echo=TRUE,print=FALSE,eval=FALSE>>=
data(jekoChr11Array250Knsp)
@

<<echo=TRUE,print=FALSE,eval=FALSE>>=
maxProbeNumber <- 9000
@ 

<<echo=TRUE,print=FALSE,eval=FALSE>>=
A <- array(1, dim=(maxProbeNumber+1)*(maxProbeNumber+2)/2)
@ 


<<echo=TRUE,print=FALSE,eval=FALSE>>=
remove(A)
@ 

<<echo=TRUE,print=FALSE,eval=FALSE>>=
results <- estProfileWithMBPCR(jekoChr11Array250Knsp$SNPname, jekoChr11Array250Knsp$Chromosome, jekoChr11Array250Knsp$PhysicalPosition, jekoChr11Array250Knsp$log2ratio, chrToBeAnalyzed=11, maxProbeNumber=9000, rhoSquare=0.0479, nu=-3.012772e-10, sigmaSquare=0.0699)
@

<<echo=TRUE,eval=FALSE>>=
plotEstProfile(sampleName='jeko250Knsp', jekoChr11Array250Knsp$Chromosome, jekoChr11Array250Knsp$PhysicalPosition, jekoChr11Array250Knsp$log2ratio, chrToBePlotted=11, results$estPC, maxProbeNumber=9000)
@


\section{Example 2: use of function {\tt estGlobParam}}

In general, even if we are not interested in the analysis of the whole genome, the global parameters
should be estimated on the entire sample, using the function {\tt estGlobParam}. Here, we estimate the
global parameters of sample REC-1 (in the following, the variance of the segment $\rho^2$ is estimated with $\hat{\rho}^2_1$).

<<echo=TRUE,print=FALSE>>=
data(rec10k)
@

<<echo=TRUE,print=FALSE>>=
estGlobParam(rec10k$log2ratio)
@

\section{Example 3: use of function {\tt computeMBPCR}}

If we are interested in estimating only a part of a chromosome or a
simulated sample, we should not use the function {\tt
estProfileWithMBPCR}, but use  the function {\tt computeMBPCR} which
estimates the profile directly. In the following example, we
estimates the profile of a part of chromosome 11 of sample JEKO-1.



<<echo=TRUE,print=FALSE>>=
data(jekoChr11Array250Knsp)
@ 
 
 \noindent
We select a part of chromosome 11.
 
<<echo=TRUE,print=FALSE>>=
y <- jekoChr11Array250Knsp$log2ratio[10600:11200]
@
 
<<echo=TRUE,print=FALSE>>=
p <- jekoChr11Array250Knsp$PhysicalPosition[10600:11200]
@

\noindent
We estimate the profile with mBPCR and BRC with $\hat{K}_2$, using the global parameters estimated on the whole genome.

 
<<echo=TRUE,print=FALSE>>=
results <- computeMBPCR(y, nu=-3.012772e-10, rhoSquare=0.0479, sigmaSquare=0.0699, regr='BRC')
@

\noindent
Finally, we plot the results.

<<echo=TRUE,fig=TRUE, height=6, width=6>>=
plot(p,y)
points(p, results$estPC, type='l', col='red')
points(p, results$regrCurve, type='l', col='green')
legend(x='bottomleft', legend=c('mBPCR', 'BRC with K_2'), lty=c(1, 1), col=c(4, 2))
@
 



\section{Example 4: {\tt importCNData}, an easy function to import  data}

There is also the possibility to easily import external data, by using
{\tt importCNData}. The data should be in a tab delimited file and the data table should  have at least four columns representing, respectively, the probe names, 
the chromosome to which each probe belongs, the physical positions of the probes inside the chromosome and the copy number data (an example of table
can be found in Table \ref{tab_example}). The allowed names of the chromosomes are: {\tt X}, {\tt Y} and any integer from {\tt 1} to {\tt 22}). 
 In the following example, we import  data of
sample REC-1.

\begin{table}
  \centering
  \begin{small}
  \begin{tabular}{|c|c|c|c|}
    \hline
    % after \\: \hline or \cline{col1-col2} \cline{col3-col4} ...
    SNPname & Chromosome & PhysicalPosition  &  log2ratio\\
    \hline
SNP\_A-1509443  & 1  & 2882121 & -0.184424571\\
 SNP\_A-1518557 &  1&
3985402& 0.097610797 \\
SNP\_A-1517286  & 1   &4804829& 0.443606651 \\
SNP\_A-1516024& 1 &4982250 &-1.089267338\\
 SNP\_A-1514538  & 1&
5468765 &-0.862496476\\ SNP\_A-1516403& 1  & 5596686 &1.097610797\\
$\vdots$&$\vdots$&$\vdots$&$\vdots$\\
    \hline
  \end{tabular}
  \end{small}
  \caption{Example of data table.}\label{tab_example}
\end{table}


As first step, we need to set a variable with the path of the file
containing the data. To import our data, we set {\tt path} as the path of REC-1 data in the folder of package {\tt mBPCR},

<<echo=TRUE,print=FALSE>>=
path <- system.file("extdata", "rec10k.txt", package = "mBPCR")
@



\noindent
Then, we use  function {\tt importCNData}. The parameter {\tt NRowSkip} denotes how
many rows there are before the table (notice that the name of the
columns must be skipped). If the copy
number data are not in log$_2$ratio scale, the parameter {\tt
ifLogRatio} should be put as zero.

<<echo=TRUE,print=FALSE>>=
rec10k <- importCNData(path, NRowSkip=1)
@

\noindent
Now, the SNP name are in the variable {\tt SNPname}, the chromosomes of the probes are in {\tt chr}, the physical positions
in {\tt position} and the raw log$_2$ratio data in {\tt logratio}. Here, we plot the raw data of chromosome 3.

<<echo=TRUE,fig=TRUE, height=6, width=6>>=
plot(rec10k$position[rec10k$chr == 3], rec10k$logratio[rec10k$chr == 3], xlab='Chromosome 3', ylab='log2ratio')        
@



\section{Example 5: estimation of samples in an {\tt oligoSnpSet} object}

In this example, we estimate the profile of chromosome 2 of a sample that is contained in
an {\tt oligoSnpSet} object. We load the data that are contained in the package {\tt oligoClasses}. 


<<echo=TRUE,print=FALSE>>=
data(oligoSetExample, package="oligoClasses")
@

The object {\tt oligoSet} contains the data of a HapMap sample. Since the sample should not have 
many copy number changes, we use {\tt rhoSquare} equal to the one of the sample REC-1. Moreover, the data are not in log$_2$ratio scale, 
thus we set {\tt ifLogRatio=0}.

<<echo=TRUE,print=FALSE>>=
library(mBPCR)
@ 

<<echo=TRUE,print=FALSE>>=
library(SNPchip)
@

<<echo=TRUE,print=FALSE>>=
r <-estProfileWithMBPCRforOligoSnpSet(oligoSet, sampleToBeAnalyzed=1, chrToBeAnalyzed=2, maxProbeNumber=1000, ifLogRatio=0, rhoSquare=0.0889637)
@

After the estimation, we can plot the profiles using a function of the package {\tt SNPchip}.
<<echo=TRUE,print=FALSE>>=
cc <- r$estPC
@
<<echo=TRUE,print=FALSE>>=
cc1 <- cc[chromosome(cc) == "2",1]
@
<<echo=TRUE,print=FALSE>>=
par(las=1)
@
<<echo=TRUE,print=FALSE>>=
plot(position(cc1), copyNumber(cc1), ylim=c(-0.23, 0.1), ylab="copy number",xaxt="n")
@
<<echo=TRUE,print=FALSE>>=
plotIdiogram(2,  build=genomeBuild(oligoSet), ylim=c(-0.225, -0.19), new=FALSE)
@


\section{Suggestions}

\noindent

For an optimal use of mBPCR, especially in case of samples coming from patients, we suggest to take care
to the following issues:
\begin{itemize}
  \item even if the goal is to estimate the profile of only a part of the genome, the global parameters should be
 estimated on the whole genome;
  \item if the goal is to estimate the profile of one or more patients, it is better to estimate the variance of the segment
 levels ($\rho^2$) on a cell line, or on a sample with many aberrations, and use this value in the profile 
 estimation of all patients. 
 In fact, we need many aberrations to estimate well $\rho^2$.
\end{itemize}


\begin{thebibliography}{99}

\bibitem{Hutter:07pcregx}
 M. Hutter. Exact Bayesian regression of piecewise constant
functions. {\it {Bayesian Analysis}},  2(4): 635--664, 2007.


\bibitem{Hutter:07pcreg}
 M. Hutter. Bayesian Regression of Piecewise Constant
Functions. In J.M. Bernardo,  M.J. Bayarri,  J.O. Berger,  A.P.
                David,  D.  Heckerman,  A.F.M. Smith,  and M. West, editors, {\it {Bayesian Statistics: Proceedings of the Eighth Valencia International Meeting}}. %pages?
Universitat de Val\`{e}ncia and International  Society for Bayesian
Analysis, 2007.



\bibitem{Rancoita:08a}
P.M.V. Rancoita, M. Hutter, F. Bertoni, and I. Kwee. Bayesian DNA
copy number analysis. {\it BMC Bioinformatics}, 10(10), 2009.


\bibitem{Rinaldi:06a}
 A. Rinaldi, I. Kwee, M.  Taborelli, C.  Largo, S. Uccella, V.
Martin, G.   Poretti, G. Gaidano, G. Calabrese, G. Martinelli, {\it
et al.}. Genomic and expression profiling identifies the B-cell
associated tyrosine kinase Syk as a possible therapeutic target in
mantle cell lymphoma. {\it {British Journal of Haematology}},  132:
303--316, 2006.

\end{thebibliography}

\end{document}