%\VignetteIndexEntry{An introduction to ShortRead}
%\VignetteDepends{}
%\VignetteKeywords{Short read, I/0, quality assessment}
%\VignettePackage{ShortRead}
\documentclass[]{article}

\usepackage{times}
\usepackage{hyperref}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\ShortRead}{\Rpackage{ShortRead}}

\newcommand{\ELAND}{\software{ELAND}}
\newcommand{\MAQ}{\software{MAQ}}
\newcommand{\Bowtie}{\software{Bowtie}}

\title{An Introduction to \Rpackage{ShortRead}}
\author{Martin Morgan}
\date{Modified: 28 September 2010. Compiled: \today}

\begin{document}

\maketitle

<<options,echo=FALSE>>=
options(width=60)
@ 

<<preliminaries>>=
library("ShortRead")
@ 

The \Rpackage{ShortRead} package aims to provide key functionality for
input, quality assurance, and basic manipulation of `short read' DNA
sequences such as those produced by Solexa, 454, and related
technologies, including flexible import of common short read data
formats. This vignette introduces key functionality.

Support is most fully developed for Solexa; contributions from the
community are welcome.

\section{A first workflow}

This section walks through a simple work flow. It outlines the
hierarchy of files produced by Solexa. It then illustrates a common
way for reading short read data into \R{}.

\subsection{\Rclass{SolexaPath}: navigating Solexa output}

\Rclass{SolexaPath} provides functionality to navigate files produced
by Solexa Genome Analyzer pipeline software. A typical way to start a
\ShortRead{} session is to point to the root of the output file
hierarchy. The \ShortRead{} package includes a very small subset of
files emulating this hierarchy. The root is found at
<<SolexaPath-root>>=
exptPath <- system.file("extdata", package="ShortRead")
@ 
%% 
Usually \Rcode{exptPath} would be a location on the users' file system. Key
components of the hierarchy are parsed into \R{} with
<<SolexaPat>>=
sp <- SolexaPath(exptPath)
sp
@ 
%% 
\Rfunction{SolexaPath} scans the directory hierarchy to identifying
useful directories. For instance, image intensity files are in the
`Firecrest' directory, while summary and alignment files are in the
analysis directory
<<firecrest>>=
imageAnalysisPath(sp)
analysisPath(sp)
@ 
%% 
Most functionality in \ShortRead{} uses \Rcode{baseCallPath} or
\Rcode{analysisPath}. Solexa documentation provides details of file
content. \Rfunction{SolexaPath} accepts additional arguments that
allow individual file paths to be specified.
 
Many functions for Solexa data input `know' where appropriate files
are located. Specifying \Rcode{sp} is often sufficient for identifying
the desired directory path. Examples of this are illustrated below,
with for instance \Rfunction{readAligned} and \Rfunction{readFastq}.

Displaying an object, e.g., \Robject{sp}, provides hints at how to
access information in the object, e.g., \Rfunction{analysisPath}. This
is a convention in \ShortRead{}.

\subsection{\Rfunction{readAligned}: reading aligned data into \R{}}

Solexa \texttt{s\_N\_export.txt} files (\texttt{\_N\_} is a
placeholder for the lane identifier) represent one place to start
working the short read data in \R{}. These files result from running
ANALYSIS eland\_extended in the Solexa Genome Analyzer. The files
contain information on all reads, including alignment information for
those reads successfully aligned to the genome.  \ShortRead{} parses
additional alignment files, including \MAQ{} binary and text
(\texttt{mapview}) files and \Bowtie{} text files; consult the help page
for \Rfunction{readAligned} for details. \ShortRead{} flexibly parses
many other Solexa files; aligned reads represent just one entry point.

To read a single \texttt{s\_N\_export.txt} file into \R{}, for instance
from lane 2, use the command
<<readAligned-simple>>=
aln <- readAligned(sp, "s_2_export.txt")
aln
@ 
%% 
This illustrates the convention used for identifying files for input
into \R{} and used by \ShortRead{}. The function takes a directory
path and a pattern (as a regular expression, similar to the \R{}
function \Rfunction{list.files}) of file names to match in the
directory. Usually, all files matching the pattern are read into a
\emph{single} \R{} object; this behavior is desirable for several of
the input functions in \ShortRead{}. In the present case the usual
expectation is that a single \texttt{s\_N\_export.txt} file will be
read into a single \R{} object, so the \Rfunarg{pattern} argument will
identify a single file.

\subsubsection{Input of other aligned read files}

\ELAND{} software provides access to much interesting data, in
addition to alignments, but if the interest is in aligned reads then
input may come from any of a number of different software
packages. Many of these alignments can be input with
\Rpackage{ShortRead}.

\Bowtie{} is a very fast aligner, taking a few tens of minutes to
align entire lanes of reads to reference genomes. Use
\Rfunction{readAligned} with the \Rfunarg{type="Bowtie"} argument to
input alignments.  Reading \Bowtie{} output using
\Rfunction{readAligned} produces the same class of object as reading
\ELAND{} output.  Like \ELAND{}, \Bowtie{} provides information on
short read, quality, chromosome, position, and strand; there is no
information on alignment quality avaiable from \Bowtie{}. \ELAND{} and
\Bowtie{} provide very different auxiliary information.  Consult the
\Rfunction{readAligned} and help page \Bowtie{} manual for additional
detail.

\MAQ{} is another poplar aligner. \Rpackage{ShortRead} can input
\MAQ{} binary or text formats (see the arguments
\Rfunarg{type="MAQMapShort"}, \Rfunarg{"MAQMap"}, and
\Rfunarg{"MAQMapview"}). As with \Bowtie{}, \MAQ{} provides essential
information about reads and their aligments, plus additional
information that differs somewhat from the additional information
provided by \ELAND{}.

Alignment information may come in a variety of different text-based
formats. Not all of these will be supported by
\Rpackage{ShortRead}. There are a number of tools available to input
this into \R{}. 

A basic strategy is involves two passes over the data, followed by
synthesis of results into an \Rclass{AlignedRead} object.  First,
input alignment data using functions such as
\Rfunction{read.table}. Use the \Rfunarg{colClasses} argument to
`mask-out' (i.e., avoid importing) DNA and quality sequences. Next,
use \Rfunction{readXStringColumns} or \Rfunction{readFastq} to import
the short read and quality information. Finally, use the alignment
data and reads as arugments to the \Rfunction{AlignedRead} function to
synthesize the input. The following illustrates use of
\Rfunction{readXStringColumns} and \Rfunction{readFastq}. These
functions receive further attention below.

\subsubsection{Cautions}

There are several confusing areas of input. (1) Some alignment
programs and genome resources start numbering nucleotides of the
subject sequence at 0, whereas others start at 1. (2) Some alignment
programs report matches on the minus strand in terms of the
`left-most' position of the read (i.e., the location of the 3' end of
the aligned read), whereas other report `five-prime''matches (i.e., in
terms of the 5' end of the read), regardless of whether the alignment
is on the plus or minus strand. (3) Some alignment programs reverse
complement the sequence of reads aligned to the minus strand. (4) Base
qualities are sometimes encoded as character strings, but the encoding
differs between `fastq' and `solexa fastq'. It seems that all
combinations of these choices are common `in the wild'.

The help page for \Rfunction{readAligned} attempts to be explicit
about how reads are formatted. Briefly:
\begin{itemize}
\item Subject sequence nucleotides are numbered starting at 1, rather
  than zero. \Rfunction{readAligned} adjusts the coordinate system of
  input reads if necessary (e.g., reading \MAQ{} alignments).
\item Alignments on the minus strand are reported in `left-most'
  coordinates systems.
\item \ELAND{} and \Bowtie{} alignments on the minus strand are not reverse
  complemented.
\item Character-encoded base quality scores are intrepreted as the
  default for the software package being parsed, e.g., as `Solexa
  fastq' for \ELAND{}. The object returned by \Rfunction{quality}
  applied to an \Rclass{AlignedRead} object is either
  \Rclass{FastqQuality} or \Rclass{SFastqQuality}.
\end{itemize}
Alignment programs sometimes offer the opportunity to custommize
output; such customization needs to be accomodated when reads are
input using \Rpackage{ShortRead}.

\subsubsection{Filtering input}

Downstream analysis may often want to use a well-defined subset of
reads. These can be selected with the \Rfunarg{filter} argument of
\Rfunction{readAligned}. There are built-in filters, for instance to
remove all reads containing an \texttt{N} nucleotide, to select just
those reads that map to the genome file \texttt{chr5.fa}, to select
reads on the \texttt{+} strand, or to `level the playing field' by
selecting only a single read for any chromosome, position and strand:
<<filter-egs>>=
nfilt <- nFilter()
cfilt <- chromosomeFilter('chr5.fa')
sfilt <- strandFilter("+")
ofilt <- occurrenceFilter(withSread=FALSE)
@
%% 
Here we select only those reads that map to \texttt{chr5.fa}:
<<readAligned-filter>>=
chr5 <- readAligned(sp, "s_2_export.txt", filter=cfilt)
@ 
%% 
Filters can be `composed' to act in unison, e.g., selecting only reads
mapping to \texttt{chr5.fa} and on the \texttt{+} strand:
<<readAligned-compose-filter>>=
filt <- compose(cfilt, sfilt)
chr5plus <- readAligned(sp, "s_2_export.txt", filter=filt)
@ 
%% 
Filters can subset aligned reads at other stages in the work flow,
using a paradigm like the following:
<<AlignedRead-filter>>=
chr5 <- aln[cfilt(aln)]
@ 
%% 
Users can easily create their own filter by writing a function that
accepts an object of class \Rcode{AlignedRead}, and returns a logical
vector indicating which reads in the object pass the filter. See the
example on the \Rcode{srFilter} help page for details, and for
information about additional built-in filters.

\subsection{Exploring \ShortRead{} objects}

\Robject{aln} is an object of \Rclass{\Sexpr{class(aln)}} class. It
contains short reads and their (calibrated) qualities:
<<aln-sread-quality>>=
sread(aln)
quality(aln)
@ 

The short reads are stored as a \Rclass{DNAStringSet} class.  This
class is defined in \Rpackage{Biostrings}. It represents DNA sequence
data relatively efficiently.  There are a number of very useful
methods defined for \Rclass{DNAStringSet}. Some of these methods are
illustrated in this vignette. Other methods are described in the help
pages and vignettes of the \Rpackage{Biostrings} and
\Rpackage{IRanges} packages.

Qualities are represented as \Sexpr{class(quality(aln))}-class
objects. The qualities in the \Robject{aln} object returned by
\Rfunction{readAligned} are of class \Rclass{BStringSet}. The
\Rclass{BStringSet} class is also defined in \Rpackage{Biostrings},
and shares many methods with those of \Rclass{DNAStringSet}.

The \Robject{aln} object contains additional information about
alignments. Some of this additional information is expected from any
alignment, whether generated by Solexa or other software.  For
example, \Robject{aln} contains the particular sequence within a
target (e.g., chromosomes in a genome assembly), the position (e.g.,
base pair coordinate), and strand to which the alignment was made, and
the quality of the alignment. The display of \Robject{aln} suggests
how to access this information. For instance, the strand to which
alignments are made can be extracted (as a factor with three levels
and possibly \Rcode{NA}; the level \Rcode{"*"} corresponds to reads
for which strand alignment is intrinsically not meaningful, whereas
\Rcode{NA} represents the traditional concept of information not
available, e.g., because the read did not align at all) and tabulated
using familiar \R{} functions.
<<chromosomes>>=
whichStrand <- strand(aln)
class(whichStrand)
levels(whichStrand)
table(whichStrand, useNA="ifany")
@ 
%% 
This shows that about 
%% 
\Sexpr{format(100*sum(is.na(whichStrand))/ length(whichStrand))}\%{}
%% 
of reads were not aligned (level \Rcode{NA}).

The \Robject{aln} object contains information in addition to that
expected of all alignments. This information is accessible using
\Rfunction{alignData}:
<<alignData>>=
alignData(aln)
@ 
%% 
Users familiar with the \Rclass{ExpressionSet} class in
\Rpackage{Biobase} will recognize this as an
\Rclass{AnnotatedDataFrame}-like object, containing a data frame with
rows for each short read. The \Rclass{AlignedDataFrame} contains
additional meta data about the meaning of each column. For instance,
data extracted from the Solexa export file includes:
<<varMetadata>>=
varMetadata(alignData(aln))
@ 
%% 
Guides to the precise meaning of this data are on the help page for
the \Rclass{AlignedRead} class, and in the manufacturer manuals. 

Simple information about the alignments can be found by querying this
object. For instance, unaligned reads have \Rcode{NA} as their
position, and reads passing Solexa `filtering' (their base purity and
chastity criteria) have a component of their auxiliary
\Robject{alignData} set to \Rcode{"Y"}. Thus the fraction of unaligned
reads and reads passing filtering are
<<aln-okreads>>=
mapped <- !is.na(position(aln))
filtered <- alignData(aln)[["filtering"]] =="Y"
sum(!mapped) / length(aln)
sum(filtered) / length(aln)
@ 

Extracting the reads that passed filtering but were unmapped is
accomplished with
<<aln-failed>>=
failedAlign <- aln[filtered & !mapped]
failedAlign
@ 
%% 
Alternatively, we can extract just the short reads, and select the
subset of those that failed filtering.
<<sread-filter-fail-subset>>=
failedReads <- sread(aln)[filtered & !mapped]
@ 

\subsection{Quality assessment}

The \Rfunction{qa} function provides a convenient way to summarize
read and alignment quality. One way of obtaining quality assessment
results is
<<qa>>=
qaSummary <- qa(sp)
@ 
%% 
The \Robject{qa} object is a list-like structure. As invoked above and
currently implemented, \Rfunction{qa} visits all
\texttt{s\_N\_export.txt} files in the appropriate directory. It
extracts useful information from the files, and summarizes the results
into a nested list-like structure. 

Evaluating \Rfunction{qa} for a single lane can take several
minutes. For this reason a common use case is to evaluate
\Rfunction{qa} and save the result to disk for later use, e.g.,
<<eval=FALSE>>=
save(qaSummary, file="/path/to/file.rda")
@ 
%% 
A feature of \ShortRead{} is the use of \Rpackage{Rmpi} or
\Rpackage{multicore} and coarse-grained parallel processing when
available. Thus commands such as
<<eval=FALSE>>=
library("Rmpi")
mpi.spawn.Rslaves(nsl=8)
qaSummary <- qa(sp)
mpi.close.Rslaves()
@ 
%% 
or
<<eval=FALSE>>=
library(multicore)
qaSummary <- qa(sp)
@ 
%% 
will distribute the task of processing each lane to each of the
\Rpackage{Rmpi} workers or \Rpackage{multicore} cores. In the
\Rpackage{Rmpi} example, all 8 lanes of a Solexa experiment are
processed in the time take to process a single lane.
\Rpackage{multicore} may impose significant memory demands, as each
core will attempt to load a full lane of data.

The elements of the quality assessment list are suggested by the
output:
<<qa-elements>>=
qaSummary
@ 
%% 
For instance, the count of reads in each lane is summarized in the
\Robject{readCounts} element, and can be displayed as
<<qa-readCounts>>=
qaSummary[["readCounts"]]
qaSummary[["baseCalls"]]
@ 
%% 
The \Robject{readCounts} element contains a data frame with 1 row and
3 columns (these dimensions are indicated in the parenthetical
annotation of \Robject{readCounts} in the output of
\Rcode{qaSummary}). The rows represent different lanes. The columns
indicated the number of reads, the number of reads surviving the
Solexa filtering criteria, and the number of reads aligned to the
reference genome for the lane. The \Robject{baseCalls} element
summarizes base calls in the unfiltered reads.

Other elements of \Robject{qaSummary} are more complicated, and their
interpretation is not directly obvious. Instead, a common use is to
forward the results of \Rfunction{qa} to a report generator. 
<<report, eval=FALSE>>=
report(qaSummary, dest="/path/to/report_directory")
@ 
%%
The report includes \R{} code that can be used to understand how
\Sexpr{class(qaSummary)}-class objects can be processed; reports are
generated as HTML suitable for browser viewing.

The functions that produce the report tables and graphics are 
internal to the package. They can be accessed through calling 
ShortRead:::functionName where functionName is one of the functions
listed below, organized by report section. 
\begin{description}
\item [] Run Summary : .ppnCount, .df2a, .laneLbl, .plotReadQuality
\item [] Read Distribution : .plotReadOccurrences, .freqSequences
\item [] Cycle Specific : .plotCycleBaseCall, .plotCycleQuality
\item [] Tile Performance : .atQuantile, .colorkeyNames, .plotTileLocalCoords, .tileGeometry,
.plotTileCounts, .plotTileQualityScore
\item [] Alignment : .plotAlignQuality
\item [] Multiple Alignment : .plotMultipleAlignmentCount
\item [] Depth of Coverage : .plotDepthOfCoverage
\item [] Adapter Contamination : .ppnCount
\end{description}

\section{Using \Rpackage{ShortRead} for data exploration}

\subsection{Data I/O}

\ShortRead{} provides a variety of methods to read data into \R{}, in
addition to \Rfunction{readAligned}. 

\subsubsection{\Rfunction{readXStringColumns}}

\Rfunction{readXStringColumns} reads a column of DNA or other
sequence-like data. For instance, the Solexa files
\texttt{s\_N\_export.txt} contain lines with the following
information:
<<export>>=
pattern <- "s_2_export.txt"
fl <- file.path(analysisPath(sp), pattern)
strsplit(readLines(fl, n=1), "\t")
length(readLines(fl))
@ 
% 
Column 9 is the read, and column 10 the ASCII-encoded Solexa Fastq
quality score; there are 1000 lines (i.e., 1000 reads) in this sample
file. 

Suppose the task is to read column 9 as a \Rclass{DNAStringSet} and
column 10 as a \Rclass{BStringSet}. \Rclass{DNAStringSet} is a class
that contains IUPAC-encoded DNA strings (IUPAC code allows for
nucleotide ambiguity); \Rclass{BStringSet} is a class that contains
any character with ASCII code 0 through 255. Both of these classes are
defined in the \Rpackage{Biostrings}
package. \Rfunction{readXStringColumns} allows us to read in columns
of text as these classes.

Important arguments for \Rfunction{readXStringColumns} are the
\Rfunarg{dirPath} in which to look for files, the \Rfunarg{pattern} of
files to parse, and the \Rfunarg{colClasses} of the columns to be
parsed. The \Rfunarg{dirPath} and \Rfunarg{pattern} arguments are like
\Rfunarg{list.files}. \Rfunarg{colClasses} is like the corresponding
argument to \Rfunction{read.table}: it is a \Rclass{list} specifying
the class of each column to be read, or \Robject{NULL} if the column
is to be ignored. In our case there are 21 columns, and we would like
to read in columns 9 and 10. Hence
<<colClasses>>=
colClasses <- rep(list(NULL), 21)
colClasses[9:10] <- c("DNAString", "BString")
names(colClasses)[9:10] <- c("read", "quality")
@ 
% 
We use the class of the type of sequence (e.g., \Rclass{DNAString} or
\Rclass{BString}), rather than the class of the set that we will
create ( e.g., \Rclass{DNAStringSet} or \Rclass{BStringSet}).
Applying names to \Robject{colClasses} is not required, but makes
subsequent manipulation easier. We are now ready to read our file
<<readXStringColumns>>=
cols <- readXStringColumns(analysisPath(sp), pattern, colClasses)
cols
@ 
% 
The file has been parsed, and appropriate data objects were created.

A feature of \Rfunction{readXStringColumns} and other input functions
in the \Rpackage{ShortRead} package is that all files matching
\Rfunarg{pattern} in the specified \Rfunarg{dirPath} will be read into
a single object. This provides a convenient way to, for instance,
parse all tiles in a Solexa lane into a single \Rclass{DNAStringSet}
object.

There are several advantages to reading columns as \Rclass{XStringSet}
objects. These are more compact than the corresponding character
representation:
<<size>>=
object.size(cols$read)
object.size(as.character(cols$read))
@ 
% 
They are also created much more quickly. And the \Rclass{DNAStringSet} and
related classes are used extensively in \Rpackage{ShortRead},
\Rpackage{Biostrings}, \Rpackage{BSgenome} and other packages relevant
to short read technology.

\subsubsection{\Rfunction{readFastq}}

\Rfunction{readXStringColumns} should be considered a `low-level'
function providing easy access to columns of data. Another flexible
input function is \Rfunction{readFastq}. Fastq files combine reads and
their base qualities in four-line records such as the following:
<<fastq-format>>=
fqpattern <- "s_1_sequence.txt"
fl <- file.path(analysisPath(sp), fqpattern)
readLines(fl, 4)
@ 
% 
The first and third lines are an identifier (encoding the machine, run,
lane, tile, x and y coordinates of the cluster that gave rise to the
read, in this case). The second line is the read, and the fourth line
the per-base quality. Files of this sort can be read in as
<<readFastq>>=
fq <- readFastq(sp, fqpattern)
fq
@ 
% 
This resulting object (of class \Sexpr{class(fq)}) contains the short
reads, their qualities, and the identifiers:
<<ShortReadQ>>=
reads <- sread(fq)
qualities <- quality(fq)
class(qualities)
id(fq)
@ 
% 
Notice that the class of the qualities is \Rclass{SFastqQuality}, to
indicate that these are quality scores derived using the Solexa
convention, rather than ordinary \Rclass{BStringSet} objects.

The object has essential operations for convenient manipulation, for
instance simultaneously forming the subset of all three components:
<<ShortReadQ-subset>>=
fq[1:5]
@ 

\subsubsection{Additional input functions}

\ShortRead{} includes additional functions to facilitate input. For
instance, \Rfunction{readPrb} reads Solexa \texttt{\_prb.txt}
files. These files contain base-specific quality information, and
\Rfunction{readPrb} returns an \Rclass{SFastqQuality}-class object
representing the fastq-encoded base-specific quality scores of all
reads.

As a second example, the \texttt{s\_N\_LLLL\_int.txt} files in the
\Rfunction{imageAnalysisPath} directory contain lines, one line per
read, of nucleotide intensities. Each line contain lane, tile, X and Y
coordinate information, followed by quadruplets of intensity values.
There are as many quadruplets as there are cycles. Each quadruplet
represents the intensity of the \texttt{A}, \texttt{C}, \texttt{G},
and \texttt{T} nucleotide at the corresponding cycle. These (and their
error estimates, if available), are input with
<<intensity-files>>=
int <- readIntensities(sp, withVariability=FALSE)
int
@ 
%% 
An interesting exercise is to display the intensities at cycle 2
(below) and to compare these to cycle, e.g., 30.
<<intensities-cycle-2, fig=TRUE>>=
print(splom(intensity(int)[[,,2]], pch=".", cex=3))
@ 
 
Additional files can be parsed using standard \R{} input methods. 

\subsection{Sorting}

Short reads can be sorted using \Rfunction{srsort}, 
or the permutation required to bring the short read into 
lexicographic order can be determined using
\Rfunction{srorder}. These functions are different from
\Rfunction{sort} and \Rfunction{order} because the result is
independent of the locale, and they operate quickly on
\Rclass{DNAStringSet} and \Rclass{BStringSet} objects.

The function \Rfunction{srduplicated} identifies duplicate reads. This
function returns a logical vector, similar to \Rfunction{duplicated}.
The negation of the result from \Rfunction{srduplicated} is useful to
create a collection of unique reads. An experimental scenario where
this might be useful is when the sample preparation involved PCR. In this
case, replicate reads may be due to artifacts of sample preparation,
rather than differential representation of sequence in the sample
prior to PCR.

\subsection{Summarizing read occurrence}

The \Rfunction{tables} function summarizes read occurrences, for instance,
<<tables>>=
tbls <- tables(aln)
names(tbls)
tbls$top[1:5]
head(tbls$distribution)
@ 
%% 
The \Robject{top} component returned by \Robject{tables} is a list
tallying the most commonly occurring sequences in the short
reads. Knowledgeable readers will recognize the top-occurring read as a
close match to one of the manufacturer adapters.

The \Robject{distribution} component returned by \Robject{tables} is a
data frame that summarizes how many reads (e.g., \Sexpr{tbls[["distribution"]][1,"nReads"]}) 
are represented exactly \Sexpr{tbls[["distribution"]][1,"nOccurrences"]} times.

\subsection{Finding near matches to short sequences}

Facilities exist for finding reads that are near matches to specific
sequences, e.g., manufacturer adapter or primer
sequences. \Rfunction{srdistance} reports the edit distance between
each read and a reference sequence. \Rfunction{srdistance} is
implemented to work efficiently for reference sequences whose length
is of the same order as the reads themselves (10's to 100's of bases).
To find reads close to the most common read in the example above, one
might say
<<srdistance>>=
dist <- srdistance(sread(aln), names(tbls$top)[1])[[1]]
table(dist)[1:10]
@ 
%% 
`Near' matches can be filtered from the alignment, e.g.,
<<aln-not-near>>=
alnSubset <- aln[dist>4]
@

A different strategy can be used to tally or eliminate reads that consist
predominantly of a single nucleotide. \Rfunction{alphabetFrequency}
calculates the frequency of each nucleotide (in DNA strings) or letter
(for other string sets) in each read. Thus one could identify and
eliminate reads with more than 30 adenine nucleotides with
<<polya>>=
countA <- alphabetFrequency(sread(aln))[,"A"] 
alnNoPolyA <- aln[countA < 30]
@ 
%% 
\Rfunction{alphabetFrequency}, which simply counts nucleotides, is
much faster than \Rfunction{srdistance}, which performs full pairwise
alignment of each read to the subject.

Users wanting to use \R{} for whole-genome alignments or more flexible
pairwise aligment are encouraged to investigate the
\Rpackage{Biostrings} package, especially the \Rclass{PDict} class and
\Rfunction{matchPDict} and \Rfunction{pairwiseAlignment} functions.

\subsection{The \Rfunction{coverage} function}

The \Rfunction{coverage} function provides a way to summarize where
reads align on a reference sequence. The idea is that the aligned
reads, or under some analyses the extension of those aligned reads by
an amount meant to estimate the actual fragment size, `pile up' on top
of nucleotide positions in the reference sequence. A convenient
summary of the alignment of many reads is thus a vector describing the
depth of the pile at each position in the reference sequence. A
typical work flow invokes \Rfunction{coverage} on an instance of the
\Rclass{AlignedRead} class obtained from \Rfunction{readAligned};
additional methods offering greater control operate on
\Rclass{IRanges} directly.  The \Rfunction{coverage} methods returns a
run-length encoding of the pile-up (or a list of such run length
encodings). The run-length encoding returned by \Rfunction{coverage}
is a space-efficient representation; the long integer vector can be
recovered with \Rcode{as.integer}.

There are complicated issues associated with use of
\Rfunction{coverage}, relating to how software reports the `position'
of an alignment, especially on the minus strand. These issues are
illustrated in figure~\ref{fig:coverage}.
\begin{figure}
\begin{verbatim}
'leftmost':
                             P
                             +++++----------
'+' strand: 5' ....|....|....|....|....|....|.. 3'
'-' strand: 3' ....|....|....|....|....|....|.. 5'
                   ----------+++++
                             P

'fiveprime':
                             P
                             +++++----------
'+' strand: 5' ....|....|....|....|....|....|.. 3'
'-' strand: 3' ....|....|....|....|....|....|.. 5'
                   ----------+++++
                                 P
\end{verbatim}
  \caption{Alignment schemes used by
    \Rfunction{coverage}. \texttt{+++} represents the read and
    \texttt{---} the extension. \texttt{P} is the alignment position
    as recorded under the corresponding leftmost or fiveprime
    schemes.}
  \label{fig:coverage}
\end{figure}
In the figure, the two strands are represented by \verb"....|",
aligned reads by \verb"+++", and extensions by \verb"---".  The idea
is that 5-nucleotide reads have been aligned to a reference sequence,
and the alignment extended by 10 nucleotides. In the `leftmost'
notation (used by \ELAND{}) and assuming that the reference sequence
is always numbered in relation to the plus strand and indexed starting
at 1 (\Rfunction{readAligned} translates reported alignment positions
so they are indexed from 1), the reported position is 15 for the
alignments on either the plus or the minus strand. In contrast the
`fiveprime' scheme the alignment to the plus strand is 15, and to the
minus strand 19. This is the scheme used by \MAQ{}, for instance.

The default behavior of \Rfunction{coverage} is to use the `leftmost'
coordinate system. This is appropriate for data derived from \ELAND{}.

\section{Advanced features}

\subsection{The \Rfunarg{pattern} argument to input functions}

Most \ShortRead{} input functions are designed to accept a directory
path argument, and a \Rfunarg{pattern} argument. The latter is a
grep-like pattern (as used by, e.g., \Rfunction{list.files}). Many
input functions are implemented so that all files matching the pattern
are read into a single large input object. Thus the
\texttt{s\_N\_LLLL\_seq.txt} files consist of four numeric columns and a
fifth column corresponding to the short read. The following code
illustrates the file structure and inputs the final column into a
\Rclass{DNAStringSet}:
<<readSeq>>=
seqFls <- list.files(baseCallPath(sp), "_seq.txt", full=TRUE)
strsplit(readLines(seqFls[[1]], 1), "\t")
colClasses <- c(rep(list(NULL), 4), "DNAString")
reads <- readXStringColumns(baseCallPath(sp), "s_1_0001_seq.txt",
                            colClasses=colClasses)
@ 
%% 
The more general pattern
<<readSeq-all>>=
reads <- readXStringColumns(baseCallPath(sp), "s_1_.*_seq.txt",
                            colClasses=colClasses)
@ 
%% 
inputs all lane 1 tile files into a single \Rclass{DNAStringSet} object.

\subsection{\Rfunction{srapply}}

Solexa and other short read technologies often include many files,
e.g., one \texttt{s\_L\_NNNN\_int.txt} file per tile, 300 tiles per
lane, 8 lanes per flow cell for 2400 \texttt{s\_L\_NNNN\_int.txt} files
per flow cell. A natural way to extract information from these is to
write short functions, e.g., to find the average intensity per base at
cycle 12.
<<calcInt-demo>>=
calcInt <- function(file, cycle, verbose=FALSE)
{
    if (verbose)
        cat("calcInt", file, cycle, "\n")
    int <- readIntensities(dirname(file), basename(file),
                           intExtension="", withVariability=FALSE)
    apply(intensity(int)[,,12], 2, mean)
}
@ 
One way to apply this function to all intensity files in a Solexa run
is
<<calcInt-sapply>>=
intFls <- list.files(imageAnalysisPath(sp), ".*_int.txt$", full=TRUE)
lres <- lapply(intFls, calcInt, cycle=12)
@ 
%% 
The files are generally large and numerous, so even simple
calculations consume significant computational resources. The
\Rfunction{srapply} function is meant to provide a transparent way to
perform calculations like this distributed over multiple nodes of an
MPI cluster, or across multiple cores of a single machine. Thus
<<srapply-simple>>=
srres <- srapply(intFls, calcInt, cycle=12)
identical(lres, srres)
@ 
%% 
evaluates the function as \Rfunction{lapply}, whereas
<<srapply-mpi, eval=FALSE>>=
library("Rmpi")
mpi.spawn.Rslaves(nsl=16)
srres <- srapply(intFls, calcInt, cycle=12)
mpi.close.Rslaves()
@ 
%% 
distributes the calculation over available workers, while 
<<srapply-multicore, eval=FALSE>>=
library(multicore)
srres <- srapply(intFls, calcInt, cycle=12)
@ 
%% 
distributes tasks across cores of a single machine. The result is a
speedup approximately inversely proportional to the number of
available compute nodes or cores; memory requirements for
the \Rpackage{multicore} approach may be substantial.

\section{Conclusions and directions for development}

\ShortRead{} provides tools for reading, manipulation, and quality
assessment of short read data. Current facilities in \ShortRead{}
emphasize processing of single-end Solexa data. 

Development priorities in the near term include expanded facilities
for importing key file types from additional manufactures, more
extensive quality assessment methodologies, and development of
infrastructure for paired-end reads.

%---------------------------------------------------------
% SessionInfo
%---------------------------------------------------------
\begin{table*}[tbp]
\begin{minipage}{\textwidth}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 
\end{minipage}
\caption{\label{tab:sessioninfo}%
The output of \Rfunction{sessionInfo} on the build system 
after running this vignette.}
\end{table*}

\end{document}