% \VignetteIndexEntry{gmapR}
% \VignetteKeywords{gmap,gsnap}
% \VignettePackage{gmapR}
\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{Sweave}

\textwidth=6.5in
\textheight=8.5in
% \parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}

\title{gmapR: Use the GMAP Suite of Tools in R}
\author{Michael Lawrence, Cory Barr}
\date{\today}

\begin{document}

\maketitle
\tableofcontents
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \Rpackage{gmapR} packages provides users with a way to access
\software{GSNAP}, \software{bam\_tally}, and other utilities from the
\software{GMAP} suite of tools within an R session. In this vignette,
we briefly look at the \software{GMAP} suite of tools available
through the \Rpackage{gmapR} package and work through an example.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{What is \software{GMAP}, \software{GSNAP}, and \software{bam\_tally}?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \software{GMAP} suite offers useful tools for the following:

\begin{itemize}

  \item Genomic mapping: Given a cDNA, find where it best aligns to
    an entire genome

  \item Genomic alignment: Given a cDNA and a genomic segment,
    provide a nucleotide-level correspondence for the exons of the
    cDNA to the genomic segment

  \item Summarization via coverage plus reference and allele
    nucleotide polymorphism counts for an aligned set of sequencing
    reads over a given genomic location

\end{itemize}

\software{GMAP} (Genomic Mapping and Alignment Program) is
particularly suited to relatively long mRNA and EST sequences such as
those that are obtained from Roche 454 or Pacific Biosciences
sequencing technologies. (At present, only \software{GSNAP} is
available through the \Rpackage{gmapR}. \software{GMAP} integration is
scheduled for the near future.)

\software{GSNAP} (Genomic Short-read Nucleotide Alignment Program)
also provides users with genomic mapping and alignment capabilities,
but is optimized to handle issues that arise when dealing with the
alignment of short reads generated from sequencing technologies such
as those from Illumina/Solexa or ABI/SOLiD. \software{GSNAP} offers
the following functionality, as mentioned in
\href{http://bioinformatics.oxfordjournals.org/content/26/7/873.full}{Fast
  and SNP-tolerant detection of complex variants and splicing in short
  reads} by Thomas D. Wu and Serban Nacu:

\begin{itemize}
  
  \item fast detection of complex variants and splicing in short
    reads, based on a successively constrained search process of
    merging and filtering position lists from a genomic index

  \item alignment of both single- and paired-end reads as short as 14
    nt and of arbitrarily long length
    
  \item detection of short- and long-distance splicing, including
    interchromosomal splicing, in individual reads, using
    probabilistic models or a database of known splice sites
    
  \item SNP-tolerant alignment to a reference space of all possible
    combinations of major and minor alleles

  \item alignment of reads from bisulfite-treated DNA for the study of
    methylation state
    
\end{itemize}
    
\software{bam\_tally} provides users with the coverage as well as
counts of both reference alleles, alternative alleles, and indels over
a genomic region.

For more detailed information on the \software{GMAP} suite of tools
including a detailed explication of algorithmic specifics, see
\url{http://research-pub.gene.com/gmap/}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Create a \Rclass{GmapGenome} Object}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
To align reads with \software{GMAP} or \software{GSNAP}, or to use
\software{bam\_tally}, you will need to either obtain or create a
\Rclass{GmapGenome} object. \Rclass{GmapGenome} objects can be created
from FASTA files or \Rclass{BSgenome} objects, as the following
example demonstrates:

<<create_GmapGenome, eval=FALSE>>=
library(gmapR)

if (!suppressWarnings(require(BSgenome.Dmelanogaster.UCSC.dm3))) {
  source("http://bioconductor.org/biocLite.R")
  biocLite("BSgenome.Dmelanogaster.UCSC.dm3")
  library(BSgenome.Dmelanogaster.UCSC.dm3)
}

gmapGenomePath <- file.path(getwd(), "flyGenome")
gmapGenomeDirectory <- GmapGenomeDirectory(gmapGenomePath, create = TRUE)
##> gmapGenomeDirectory
##GmapGenomeDirectory object
##path: /reshpcfs/home/coryba/projects/gmapR2/testGenome 

gmapGenome <- GmapGenome(genome=Dmelanogaster,
                         directory=gmapGenomeDirectory,
                         name="dm3",
                         create=TRUE,
                         k = 12L)
##> gmapGenome
##GmapGenome object
##genome: dm3 
##directory: /reshpcfs/home/coryba/projects/gmapR2/testGenome 
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Aligning with \software{GSNAP}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The \software{GSNAP} algorithm incorporates biological knowledge to
provide accurate alignments, particularly for RNA-seq data. In this
section, we will align reads from an RNA-seq experiment provided in a
fastq file to a selected region of the human genome. In this example,
we will align to the region of the human genome containing TP53 plus
an additional one megabase on each side of this gene.

First we need to obtain the desired region of interest from the genome:

<<get_TP53_coordinates>>=
library("org.Hs.eg.db")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
eg <- org.Hs.eg.db::org.Hs.egSYMBOL2EG[["TP53"]]
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
tx <- transcripts(txdb, vals = list(gene_id = eg))
roi <- range(tx) + 1e6
strand(roi) <- "*"
@ 

Next we get the genetic sequence and use it to create a GmapGenome
object. (Please note that the TP53\_demo GmapGenome object is used by
many examples and tests in the gmapR and VariationTools packages. If
the object has been created before, its subsequent creation will be
instantaneous.)

<<get_TP53_seq>>=
library("BSgenome.Hsapiens.UCSC.hg19")
library("gmapR")
p53Seq <- getSeq(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, roi,
                 as.character = FALSE) 
names(p53Seq) <- "TP53"
gmapGenome <- GmapGenome(genome = p53Seq, 
                         name = paste0("TP53_demo_", 
                           packageVersion("TxDb.Hsapiens.UCSC.hg19.knownGene")), 
                         create = TRUE, 
                         k = 12L)
@

We add the known transcripts (splice sites) to the genome index:
<<set_TP53_splicesites>>=
exons <- gmapR:::subsetRegion(exonsBy(txdb), roi, "TP53")
spliceSites(gmapGenome, "knownGene") <- exons
@ 

The data package \software{LungCancerLines} contains fastqs of reads
obtained from sequencing H1993 and H2073 cell lines. We will use these
fastqs to demonstrate \software{GSNAP} alignment with \software{gmapR}.

<<get_lung_cancer_fastqs>>=
library("LungCancerLines")
fastqs <- LungCancerFastqFiles()
@ 

\software{GSNAP} is highly configurable. Users create
\Rclass{GsnapParam} objects to specify desired \software{GSNAP}
behavior. 

<<create_gsnapParam>>=
##specify how GSNAP should behave using a GsnapParam object
gsnapParam <- GsnapParam(genome=gmapGenome,
                         unique_only=FALSE,
                         suboptimal_levels=2L, 
                         npaths=10L,
                         novelsplicing=TRUE,
                         splicing="knownGene",
                         indel_penalty=1L,
                         distant_splice_penalty=1L,
                         clip_overlap=TRUE)
@ 

Now we are ready to align.

<<align_with_gsnap, eval=FALSE>>=
gsnapOutput <- gsnap(input_a=fastqs["H1993.first"],
                     input_b=fastqs["H1993.last"],
                     params=gsnapParam)

##gsnapOutput
##An object of class "GsnapOutput"
##Slot "path":
##[1] "/local/Rtmporwsvr"
##
##Slot "param":
##A GsnapParams object
##genome: dm3 (/reshpcfs/home/coryba/projects/gmapR2/testGenome)
##part: NULL
##batch: 2
##max_mismatches: NULL
##suboptimal_levels: 0
##snps: NULL
##mode: standard
##nthreads: 1
##novelsplicing: FALSE
##splicing: NULL
##npaths: 10
##quiet_if_excessive: FALSE
##nofails: FALSE
##split_output: TRUE
##extra: list()
##
##Slot "version":
## [1] NA NA NA NA NA NA NA NA NA NA NA NA
##
@ 

The \Rcode{gsnapOutput} object will be of the class
\Rclass{GsnapOutput}. It will provide access to the BAM files of
alignments that \software{GSNAP} outputs along with other utilities.

<<get_gsnap_bam_files, eval=FALSE>>=
##> dir(path(gsnapOutput), full.names=TRUE, pattern="\\.bam$")
##[1] "/local/Rtmporwsvr/file1cbc73503e9.1.nomapping.bam"        
##[2] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_mult.bam"    
##[3] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_transloc.bam"
##[4] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_uniq.bam"    
@ 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Using \Rmethod{bam\_tally}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Running the \Rmethod{bam\_tally} method will return a \Rclass{GRanges}
of information per nucleotide. Below is an example demonstrating how
to find variants in the TP53 gene of the human genome. See the
documentation for the \Rmethod{bam\_tally} method for more details.

\software{gmapR} provides access to a demo genome for examples. This
genome encompasses the TP53 gene along with a 1-megabase flanking
region on each side.
<<TP53Genome_accessor>>=
genome <- TP53Genome()
@ 

The \software{LungCancerLines} R package contains a BAM file of reads
aligned to the TP53 region of the human genome. We'll use this file to
demonstrate the use of \software{bam\_tally} through the
\software{gmapR} package. The resulting data structure will contain
the needed information such as number of alternative alleles, quality
scores, and read position for each allele.

The call to \Rfunction{bam\_tally} returns an opaque pointer to a
C-level data structure. We anticipate having multiple means of
summarizing these data into R data structures. For now, there is one:
\Rfunction{variantSummary}, which returns a \Rcode{VRanges} object
describing putative genetic variants in the sample.
<<run_bamtally, eval=FALSE>>=
bam_file <- system.file("extdata/H1993.analyzed.bam", 
                        package="LungCancerLines", mustWork=TRUE)
breaks <- c(0L, 15L, 60L, 75L)
bqual <- 56L
mapq <- 13L
param <- BamTallyParam(genome,
                       minimum_mapq = mapq,
                       concordant_only = FALSE, unique_only = FALSE,
                       primary_only = FALSE,
                       min_depth = 0L, variant_strand = 1L,
                       ignore_query_Ns = TRUE,
                       indels = FALSE, include_soft_clips = 1L)
tallies <-bam_tally(bam_file,
                    param)
variantSummary(tallies, high_base_quality = 23L)
@ 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Creating a \Rclass{GmapGenome} Package}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

After creating a \Rclass{GmapGenome} object, you might want to
distribute it for collaboration or version it for reproducible
research. The function \Rcode{makeGmapGenomePackage} allows you to do
this. Continuing on with the D. melanogaster example from above, here
is how to archive the D. melanogaster \Rclass{GmapGenome} object in a
\Rclass{GmapGenome} package:

<<create_GmapGenomePackge, eval=FALSE>>=
makeGmapGenomePackage(gmapGenome=gmapGenome,
                      version="1.0.0",
                      maintainer="<your.name@somewhere.com>",
                      author="Your Name",
                      destDir="myDestDir",
                      license="Artistic-2.0",
                      pkgName="GmapGenome.Dmelanogaster.UCSC.dm3")
@

After creating the package, you can run \Rcode{R CMD INSTALL
  myDestDir} to install it, or run \Rcode{R CMD build myDestDir} to
create a distribution of the package.

Many users with be working with the human genome. Many of the examples
used in \software{gmapR} make use of a particular build of the human
genome. As such, creating a \Rclass{GmapGenome} of hg19 is
recommended. Here is one way to create it, using a \Rclass{BSgenome}
object:

<<create_hg19_GmapGenome, eval=FALSE>>=
if (!suppressWarnings(require(BSgenome.Hsapiens.UCSC.hg19))) {
  source("http://bioconductor.org/biocLite.R")
  biocLite("BSgenome.Hsapiens.UCSC.hg19")
  library(BSgenome.Hsapiens.UCSC.hg19)
}
gmapGenome <- GmapGenome(genome=Hsapiens,
                         directory = "Hsapiens",
                         name = "hg19",
                         create = TRUE)
destDir <- "HsapiensGmapGenome"
pkgName <- "GmapGenome.Hsapiens.UCSC.hg19"
makeGmapGenomePackage(gmapGenome=gmapGenome,
                      version="1.0.0",
                      maintainer="<your.name@somewhere.com>",
                      author="Your Name",
                      destDir=destDir,
                      license="Artistic-2.0",
                      pkgName="GmapGenome.Hsapiens.UCSC.hg19")
@ 

After running the above code, you should be able to run \Rcode{R CMD
  INSTALL GmapGenome.Hsapiens.UCSC.hg19} in the appropriate directory
from the command line, submit GmapGenome.Hsapiens.UCSC.hg19 to a
repository, etc. After installation,
\Rcode{library("GmapGenome.Hsapiens.UCSC.hg19")} will load a
\Rclass{GmapGenome} object that has the same name as the package.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\section{Aligning with SNP Tolerance}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Both \software{GSNAP} and \software{GMAP} have the ability to perform
%alignment without penalizing against a set of known SNPs. In this
%section, we work through an example of aligning with this SNP
%tolerance. By using SNP-tolerance, we'll perform alignments without
%penalizing reads that match SNPs contained in the provided SNPs. In
%this case, we'll use dbSNP to provide the SNPs.
%
%This example will use hg19. If you have not installed
%\Rcode{GmapGenome.Hsapiens.UCSC.hg19}, please refer to the section
%\ref{create_hg19_GmapGenome}.
%
%First we load the Hsapiens \Rclass{GmapGenome} object:
%
%<<<load_GmapGenomeHsapiens>>=
%library("GmapGenome.Hsapiens.UCSC.hg19")
%@ 

<<SessionInfo>>=
sessionInfo()
@ 
\end{document}