\documentclass{article}
\usepackage{array}
% \VignetteIndexEntry{sangerseqR}
% \VignetteEngine{knitr::knitr}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
  BiocStyle::latex()
@
\renewcommand{\arraystretch}{1.5}

\title{Walkthrough for using the \Biocpkg{sangerseqR} package}
\author{Jonathon T. Hill, PhD}

\begin{document}

<<knitr, eval=TRUE, include=FALSE>>=
  library(knitr, quietly=TRUE)
  library(sangerseqR, quietly=TRUE)
  library(Biostrings, quietly=TRUE)
  opts_chunk$set(tidy=TRUE)

#modified from: 
#https://github.com/yihui/knitr-examples/blob/master/077-wrap-output.md
hook_output = knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
        n <- 90
        x <- knitr:::split_lines(x)
        # any lines wider than n should be wrapped
        if (any(nchar(x) > n)) {
           x <- gsub(sprintf('(.{%d})', n), "\\1\n## ", x)
        }
        
    hook_output(x, options)
})
@
\maketitle
\tableofcontents

\section{Introduction}
The \Biocpkg{sangerseqR} package provides basic functions for importing and 
working with sanger sequencing data files. It currently functions with Scf and 
ABIF files. The Scf file specification is an open source, although somewhat 
limited, data file type. Several tools designed to view and or edit chromatogram
data can convert file types to Scf. The ABIF file specification is a proprietary
data storage file specification for sequencing data generated by Applied 
Biosystems machines. More information on each filetype can be found at the 
following sites:

\begin{description}
\item[]{}
\item[ABIF:] {
\url{http://home.appliedbiosystems.com/support/software_community/
ABIF_File_Format.pdf}}
\item[]{}
\item[SCF:] {\url{http://staden.sourceforge.net/manual/formats_unix_2.html}}
\item[]{}
\end{description}

The objects and functions included in this package were developed as part of the
Poly Peak Parser web application 
(\url{http://yost.genetics.utah.edu/software.php}), which automates the process 
of seperating ambiguous double peaks from Sanger sequencing individuals 
containing heterozygous indels. This package contains a complete working copy of 
the Poly Peak Parser web application that can be run locally using the 
\Rfunction{PolyPeakParser} function. In addition to the web program, this package
also provides general objects and functions for working with Sanger sequencing
data. 

This vignette will walk you through a typical workflow using two sequence files:
1) homozygous.scf and 2) heterozygous.ab1. As their names indicate, the first 
example contains results typical of sequencing from a PCR product of a 
homozygous individual or from a plasmid. The second example contains results 
from sequencing the same region in an individual with a small indel. 

\section{Loading Data}
The first step of most workflows will be to upload data from a sequencing 
results file. This can be done using one of three included functions: 
\Rfunction{read.abif}, \Rfunction{read.scf} and \Rfunction{readsangerseq}. The 
first two functions directly import all of the fields into \Rclass{abif} and 
\Rclass{scf} class objects, respectively. These classes are meant as 
intermediate classes and exist to allow the user to inspect the file contents, 
as file contents may vary between basecallers and sequencing machines. Users 
should generally use the \Rfunction{readsangerseq} function. This function 
automatically detects and reads in the file type and then extracts the fields 
necessary to create a \Rclass{sangerseq} class object, which is used by all of 
the other functions in this package.

\subsection{read.abif}
\Rfunction{read.abif} takes a single argument for the filename of the abif file 
to be read. The resulting object contains three major parts. The header, 
containing information on the file structure, the directory, containing 
information on each of the data fields included in the file, and the data 
fields. Here is an example:

<<>>=
hetab1 <- read.abif(system.file("extdata", 
                                "heterozygous.ab1", 
                                package="sangerseqR"))
str(hetab1, list.len=20)
@

\noindent As you can see, the file is very long and contains a lot of Data 
fields (130 in this example). However, most of these contain run information and
only a few are directly relevant to data analysis: 

\begin{tabular}{ >{\bfseries}p{1.5cm} p{14cm} }
DATA.9--DATA.12 & Vectors containing the signal intensities for each channel. \\
FWO\_.1 & A string containing the base corresponding to each channel. For 
example, if it is "ACGT", then DATA.9 = A, DATA.10 = C, DATA.11 = G and 
DATA.12 = T.\\
PLOC.2 & Peak locations as an index of the trace vectors.\\
PBAS.1, PBAS.2 & Primary basecalls. PBAS.1 may contain bases edited in the 
original basecaller, while PBAS.2 always contains the basecaller's calls.\\
P1AM.1 & Amplitude of primary basecall peaks. \\
P2BA.1 & (optional) Contains the secondary basecalls.\\
P2AM.1 & (optional) Amplitude of the secondary basecall peaks. \\
\end{tabular}

\subsection{read.scf}
Like \Rfunction{read.abif}, \Rfunction{read.scf} takes a single argument with 
the filename. However, the data structure of the resulting \Rclass{scf} object 
is far less complicated, containing only a header with file structure 
information, a matrix of the trace data (\Rcode{@sample\_points}), a matrix of 
relative probabilities of each base at each position (\Rcode{@sequence\_probs}),
basecall positions (\Rcode{@basecall\_positions}), basecalls 
(\Rcode{@basecalls}) and optionally a comments sections with the run data 
(\Rcode{@comments}). The last slot (\Rcode{@private}) is rarely used and 
impossible to interpret without knowing how it was created. 

<<>>=
homoscf <- read.scf(system.file("extdata", 
                                "homozygous.scf", 
                                package="sangerseqR"))
str(homoscf)
@

\subsection{readsangerseq}
The \Rfunction{readsangerseq} function is a convenience function equivalent to 
\Rcode{sangerseq(read.abif(file))} or \\
\Rcode{sangerseq(read.scf(file))}. It should generally be used when the contents
of the file do not need to be directly accessed because it returns a 
\Rclass{sangerseq} object, described below. 
\newpage
\section{Sangerseq Class Objects}
The \Rclass{sangerseq} class is the backbone of the sangerseqR package and 
contains the chromatogram data necesary to perform all other functions. It can 
be created in two ways: from an \Rclass{abif} or \Rclass{scf} object using the 
\Rfunction{sangerseq} method or directly from an abif or scf file using 
\Rfunction{readsangerseq}.

<<>>=
#from a sequence file object
homosangerseq <- sangerseq(homoscf)

#directly from the file
hetsangerseq <- readsangerseq(system.file("extdata", 
                                          "heterozygous.ab1", 
                                          package="sangerseqR"))
str(hetsangerseq)
@

\noindent The slots are as follows:

\begin{tabular}{ >{\bfseries}l p{13cm} }
primarySeqID & Identification of the primary Basecalls.\\
primarySeq & The primary Basecalls formatted as a DNAString object.\\
secondarySeqID & Identification of the secondary Basecalls.\\
secondarySeq & The secondary Basecalls formatted as a DNAString object.\\
traceMatrix & A numerical matrix containing 4 columns corresponding to the 
normalized signal values for the chromatogram traces. Column order = A,C,G,T.\\
peakPosMatrix & A numerical matrix containing the position of the maximum peak 
values for each base within each Basecall window. If no peak was detected for a 
given base in a given window, then "NA". Column order = A,C,G,T.\\
peakAmpMatrix & A numerical matrix containing the maximum peak amplitudes for 
each 
base within each Basecall window. If no peak was detected for a given base in 
a given window, then 0. 
Column order = A,C,G,T.\\
 & \\
\end{tabular}

Accessor functions also exist for each slot in the sangerseq object. Most of the
accessors return the data in its native format, but the \Rfunction{primarySeq} 
and \Rfunction{secondarySeq} accessors can optionally return the data as a 
character string or a \Rclass{DNAString} class object from the 
\Biocpkg{Biostrings} package by setting \Rcode{string=TRUE} or 
\Rcode{string=FALSE}, respectively. The \Rclass{DNAString} class contains 
several convenient functions for manipulating the sequence, including 
generating the reverse compliment and performing alignments. The 
\Biocpkg{Biostrings} package is automatically loaded with the 
\Biocpkg{sangerseq} package, so all methods should be available.

<<>>=
#default is to return a DNAString object
Seq1 <- primarySeq(homosangerseq)
reverseComplement(Seq1)

#can return as string
primarySeq(homosangerseq, string=TRUE)
@

\section{Creating Chromatograms}
Basic chromatogram plots can be made using the \Rfunction{chromatogram} 
function. These plots are optimized for printing, so they contain several rows 
to plot all of the data simultaneously. The downside of this is that it can give
an error if the graphics device dimensions are not large enough. If this occurs,
we suggest you provide a filename in the command to save it to a pdf 
automatically sized to fit everything. Several parameters can also be set to 
affect how the plot appears. These are documented in the chromatogram help file.

<<>>=
chromatogram(hetsangerseq, width=200, height=2, trim5=50, trim3=100, 
             showcalls='both', filename="chromatogram.pdf")
@
\incfig{chromatogram.pdf}{\textwidth}{Example Chromatogram.}
      {Chromatogram of heterozygous sequencing results. Notice the double peak 
      region caused by an indel.}
\newpage
\section{Making Basecalls}
As shown in the chromatogram, secondary basecalls are sometimes provided in ab1 
files (Scf files are unable to show them). However, the exact nature of these 
calls is inconsistent. In the heterozygous.ab1 file used here, it is any peak 
near the primary peak, no matter how small. For example, base 100 (first base on
the second line) has a primary call of "C" and a secondary call of "A", even 
though the A peak is very small and likely noise. In homozygous sequencing 
results, these calls should simply be ignored and are hidden in the 
chromatogram by default (\Rcode{showcalls="primary"}). When heterozygous 
regions of the sequence are present, the \Rfunction{makeBaseCalls} can be used 
to determine whether a particular peak is homozygous or heterozygous and call 
the appropriate bases. 

Let's use the chromatogram we created in the previous section as an example. The
chromatogram contains a homozygous region from bases 1 to approximately 160, but
then breaks down into a series of double peaks for the remainder of the 
chromatogram. This is due to an indel in one allele of the sequenced region. 
\Rfunction{makeBaseCalls} can be used to show this more clearly or to add the 
secondary basecalls if the data file does not contain them. The function 
essentially divides the sequence into a series of basecall windows and 
identifies the tallest peak for each fluorescence channel within the window. 
These peaks are converted to signal ratio to the tallest peak. A cutoff ratio is
then applied to determine if a peak is signal or noise. Peaks below this ratio 
are ignored. Remaining peaks in each window are used to make primary and 
secondary basecalls.

<<>>=
hetcalls <- makeBaseCalls(hetsangerseq, ratio=0.33)
hetcalls
@
    
The resulting file now contains the maximum peak in the \Rcode{@primarySeq} slot
and the second tallest peak, if it is above the cutoff, in the 
\Rcode{@secondarySeq} slot. If only one peak is above the cutoff ratio, then 
this call matches the primary basecall. If three peaks were above the cutoff 
ratio, then the peak with the maximum amplitude is the primary basecall and an 
ambiguous base code is used as the secondary basecall. The resulting 
chromatogram also shows this:

<<>>=
chromatogram(hetcalls, width=100, height=2, trim5=50, trim3=100, 
             showcalls='both', filename="chromatogram2.pdf")
@
\incfig{chromatogram2.pdf}{\textwidth}
{Chromatogram from above after making basecalls.}
{Chromatogram of heterozygous sequencing results after making basecalls. Primary
and secondary basecalls now match for homozygous peaks}
\newpage
\section{Parsing Wildtype and Alternate Alleles}
Although \Rfunction{makeBaseCalls} has fixed the primary and secondary peak 
calls. It still does not tell us anything about the nature of the mutation. For 
this, we need to set the allele phase using a reference base sequence from an 
online source or from another sequencing run on a homozygous sample. The 
examples used in this vignette are from heterozygous and homozygous siblings, so
we will use the primary basecalls from the homozygous sibling (loaded earlier) 
as our reference. The beginnings and ends of these sequences do not need to 
match, but the reference should ideally encompass the sequenced region. 
\Rfunction{setAllelePhase} will then separate the primary and secondary 
basecalls into reference and non-reference bases at each position and set 
(\Rcode{@primarySeq}) to the reference and \Rcode{@secondarySeq} to the 
non-reference allele. 

<<>>=
ref <- subseq(primarySeq(homosangerseq, string=TRUE), start=30, width=500)
hetseqalleles <- setAllelePhase(hetcalls, ref, trim5=50, trim3=300)
hetseqalleles
@

At this point, we could plot the chromatogram again, but it is more informative 
to align the resulting sequences to see how the alleles differ. Since 
\Biocpkg{sangerseqR} depends on \Biocpkg{Biostrings}, 
\Rfunction{pairwiseAlignment} can be used.

<<>>=
pa <- pairwiseAlignment(primarySeq(hetseqalleles)[1:400], 
                        secondarySeq(hetseqalleles)[1:400], 
                        type="global-local")
writePairwiseAlignments(pa)
@

\section{Conclusion}
In this vignette, we have walked you through the basic functions in the 
\Biocpkg{sangerseqR} package. This work is a work in progress and we hope to 
improve its functionality. For example, improving the base calling algorithm and
adding an interactive chromatogram function. If you have any suggestions or 
requested features, please email Jonathon Hill at 
\email{jhill@genetics.utah.edu}.
\end{document}