%\VignetteIndexEntry{An Introduction to intansv}
%\VignetteDepends{}
%\VignetteKeywords{Structural variation}
%\VignettePackage{intansv}
\documentclass[10pt]{article}

\textwidth=6.5in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\program}[1]{\textsf{#1}}
\newcommand{\R}{\program{R}}
\newcommand{\intansv}{\Rpackage{intansv}}

\usepackage{hyperref}

\usepackage[authoryear,round]{natbib}
\usepackage{times}

\bibliographystyle{plainnat}

\title{An Introduction to \intansv{}}
\author{Wen Yao}
\date{\today}

\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\tableofcontents

<<options,echo=FALSE>>=
options(width=72)
@

\section{Introduction}

Currently, dozens of programs have been developped to predict structural
variations (SV) utilizing next-generation sequencing data. However, the 
prediction of multiple methods have to be integrated to get relatively 
reliable results \citep{Drosophila}. The \intansv{} package is designed for 
integrating results of different methods, annotating effects caused by SVs to 
genes and its elements, displaying the genomic distribution of SVs and 
visualizing SVs in specific genomic region. In this vignette,
we will rely on a simple, illustrative example dataset to explain 
the usage of \intansv{}.

The \intansv{} package is available at bioconductor.org and can be
downloaded via \Rfunction{biocLite}:

<<biocLite, eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
biocLite("intansv")
@
<<initialize, results=hide>>=
library(intansv)
@


\section{Usage of \intansv{} with example dataset}

The example dataset was obtained by running seven different programs with the 
alignment results of a public available NGS dataset (NCBI SRA accession number
 SRA012177) to the rice reference genome 
(\url{http://rice.plantbiology.msu.edu/}). 
These seven programs including BreakDancer \citep{BreakDancer}, 
Pindel \citep{Pindel}, CNVnator \citep{CNVnator}, DELLY \citep{DELLY},
Lumpy \citep{Lumpy}, softSearch \citep{softSearch}, and 
SVseq2 \citep{svseq} were run with default parameters.
The predicted SVs of chromosomes, \emph{chr05} and \emph{chr10}, 
were provided in the example dataset. 

\subsection{Read in predictions of different programs}

Most of the predictions of programs are tedious and need to be formatted 
for further analysis. \intansv{} provides functions to read in the 
predictions of five popular programs, BreakDancer, Pindel, CNVnator, 
DELLY and SVseq2. During this process, the SV predictions with low quality 
would be filtered out and overlapped predictions would be resolved.

We begin our discussion by reading in some example data.
<<read in results of BreakDancer>>=
breakdancer.file.path <- system.file("extdata/ZS97.breakdancer.sv",
                               package="intansv")
breakdancer.file.path
breakdancer <- readBreakDancer(breakdancer.file.path)
str(breakdancer)
@

BreakDancer is able to predict deletions and inversions. The prediction of
all type of SVs were provided as a single file by BreakDancer. You can feed
this file to \Rfunction{readBreakdancer} and it will do all the tedious work
for you.

<<read in results of cnvnator>>=
cnvnator.dir.path <- system.file("extdata/cnvnator", package="intansv")
cnvnator.dir.path
cnvnator <- readCnvnator(cnvnator.dir.path)
str(cnvnator)
@

CNVnator is able to predict deletions and duplications. The final output of
CNVnator usually contains several files and each file is the output for a 
specific chromosome. You should put all these files in the same directory and
feed the path of this directory to \Rfunction{readCnvnator}. Then it will do 
all the jobs for you. However, the directory given to \Rfunction{readCnvnator}
should only contain the final output files of CNVnator. See the example dataset 
for more details.

<<read in results of SVseq2>>=
svseq.dir.path <- system.file("extdata/svseq2", package="intansv")
svseq.dir.path
svseq <- readSvseq(svseq.dir.path)
str(svseq)
@

SVseq2 is able to predict deletions. The final output of SVseq2 contains 
several files and each file is the output for a specific chromosome. What's more,
different type of SVs were written to different files. You should put all these
files in the same directory and feed the path of this directory to 
\Rfunction{readSvseq}. However, the files contain the predicted deletions in this
directory should be named with suffix of \emph{.del}. See the example dataset 
for more details.

<<read in results of DELLY>>=
delly.dir.path <- system.file("extdata/delly", package="intansv")
delly.dir.path
delly <- readDelly(delly.dir.path)
str(delly)
@

DELLY is able to predict deletions, inversions and duplications. The final 
prediction of different type of SVs given by DELLY were written to different 
files. DELLY v0.7.6 outputs vcf files now. You should put all these files 
in the same directory and feed the path of this directory to \Rfunction{readDelly}. 
See the example dataset for more details.

<<read in results of Pindel>>=
pindel.dir.path <- system.file("extdata/pindel", package="intansv")
pindel.dir.path
pindel <- readPindel(pindel.dir.path)
str(pindel)
@

Pindel is able to predict deletions, inversions and duplications. The final 
prediction for different chromosomes given by Pindel were written to different 
files. And different type of SVs were written to different files. 
You should put all these files in the same directory and feed the path of 
this directory to \Rfunction{readPindel}. However, the files contain the 
predicted deletions, inversions and duplication in this directory should be 
named with suffix of \emph{\_D}, \emph{\_INV} and \emph{\_TD} respectively. See 
the example dataset for more details.

<<read in results of Lumpy>>=
lumpy.file.path <- system.file("extdata/ZS97.lumpy.pesr.bedpe",
                   package="intansv")
lumpy.file.path
lumpy <- readLumpy(lumpy.file.path)
str(lumpy)
@

Lumpy is able to predict deletions, inversions and duplications. The prediction
 of all type of SVs were provided as a single file by Lumpy. You can feed
this file to \Rfunction{readLumpy} and it will do all the tedious work
for you.

<<read in results of softSearch>>=
softSearch.file.path <- system.file("extdata/ZS97.softsearch", package="intansv")
softSearch.file.path
softSearch <- readSoftSearch(softSearch.file.path)
str(softSearch)
@

softSearch is able to predict deletions, inversions and duplications. The 
prediction of all type of SVs were provided as a single file by softSearch. 
You can feed this file to \Rfunction{readSoftSearch} and it will do all the 
tedious work for you.

The results of these seven programs have now been read into R and storaged as
R data structure of lists with different compoents representing different types
of SVs. We only care about three types of SVs, deletions, duplications and
inversions.


\subsection{Integrate predictions of different programs}

To get more reliable results, we need to integrate the predictions of
different programs. See our paper \citep{intansv} for more details on 
how \intansv{} integrate the predictions of different programs.
<<MethodsMerge>>=
sv_all_methods <- methodsMerge(breakdancer, pindel, cnvnator, delly, svseq)
str(sv_all_methods)
@

The integrated results contain 897 deletions, 13 duplications and 12 inversions. 
However, it's not necessary for you to feed \intansv{} the output of all five programs.
The following example shows the integration of four programs: BreakDancer, CNVnator,
DELLY and SVseq2.
<<MethodsMerge>>=
sv_all_methods.nopindel <- methodsMerge(breakdancer,cnvnator,delly,svseq)
str(sv_all_methods.nopindel)
@

\intansv{} is also able to integrate SV predictions of other programs. However,
predictions of other programs should be provided as a data frame with six columns:

\begin{itemize}
\item{chromosome}{\qquad the chromosome ID of a SV}
\item{pos1}{\qquad the start coordinate of a SV}
\item{pos2}{\qquad the end coordinate of a SV}
\item{size}{\qquad the size of a SV}
\item{type}{\qquad the type of a SV}
\item{methods}{\qquad the program used to get this SV prediction}
\end{itemize}

\subsection{Annotate the effects of SVs}

To annotate the effects of SVs to genes, we need the genomic annotation file. 
To avoid confusion, this file should be a plain text file with 6 columns, 
the chromosome ID, tag of each genome element, start coordinate, end
coordinate, strand, ID of each genome element. An example 
genomic annotation file has been storaged in the example dataset of \intansv{}.


<<read in annotation file into R>>=
anno.file.path <- system.file("extdata/chr05_chr10.anno.txt", package="intansv")
anno.file.path
msu_gff_v7 <- read.table(anno.file.path, head=TRUE, as.is=TRUE)
head(msu_gff_v7)
@

<<sv annotation>>=
sv_all_methods.anno <- llply(sv_all_methods,svAnnotation,
                             genomeAnnotation=msu_gff_v7)
names(sv_all_methods.anno)
head(sv_all_methods.anno$del)
@

Since the function \Rfunction{svAnnotation} only accept structural variations 
of the data frame format, we need to apply this function to each component of 
\emph{sv\_all\_methods}, which is a list.

\subsection{Display the genomic distribution of SVs}

Now, let's get a genomic view of SVs. We first split chromosomes into 
windows of 1 Mb and then display the number of SVs in each window as circular 
barplot. We also need the length of each chromosome, which was storaged 
in the example dataset of \intansv{}.
%
<<genomic_distribution, fig=TRUE, include=FALSE, eps=FALSE, width=7,height=7>>=
genome.file.path <- system.file("extdata/chr05_chr10.genome.txt", package="intansv")
genome.file.path
genome <- read.table(genome.file.path, head=TRUE, as.is=TRUE)
plotChromosome(genome, sv_all_methods,1000000)
@

\begin{figure}[!htb]
  \begin{center}
     \includegraphics[width=0.8\textwidth]{intansvOverview-genomic_distribution}
     \caption{\label{genomic_distribution}%
      Visualization of SVs distribution in the whole genome. Blue: deletions, 
      Red: duplications, Green: inversions.}
  \end{center}
\end{figure}


\subsection{Visualize SVs in specific genomic region}

We could also visualize SVs in specific genomic region. Here, we also need the 
genomic annotation file. 

%
<<region_visualization, fig=TRUE, include=FALSE, eps=FALSE, width=7,height=7>>=
head(msu_gff_v7, n=3)
plotRegion(sv_all_methods,msu_gff_v7, "chr05", 1, 200000)
@

\begin{figure}[!htb]
  \begin{center}
     \includegraphics[width=0.7\textwidth]{intansvOverview-region_visualization}  
     \caption{\label{region_visualization}%
      SVs in genomic region chr05:1-200000. Blue: genes, Red: deletions, 
      Green: duplications, Purple: inversions.
     }
  \end{center}
\end{figure}

This command showed the SVs in the genomic region \emph{chr05:1-200000}. 
The genes and SVs were shown as circular rectangles with different color.

\pagebreak[4]

\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE>>=
sessionInfo()
@

\bibliography{intansvOverview}

\end{document}