%\VignetteIndexEntry{Working with aligned nucleotides}
%\VignetteDepends{GenomicAlignments, RNAseqData.HNRNPC.bam.chr14, pasillaBamSubset, BSgenome.Hsapiens.UCSC.hg19, BSgenome.Dmelanogaster.UCSC.dm3, GenomicFeatures, TxDb.Hsapiens.UCSC.hg19.knownGene, TxDb.Dmelanogaster.UCSC.dm3.ensGene}
%\VignetteKeywords{sequence, sequencing, alignments}
%\VignettePackage{GenomicAlignments}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\title{Working with aligned nucleotides (WORK-IN-PROGRESS!)}
\author{Herv\'e Pag\`es}
\date{Last modified: January 2014; Compiled: \today}

\begin{document}

\maketitle

\tableofcontents



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This vignette belongs to the \Rpackage{GenomicAlignments} package.
It illustrates how to use the package for working with the nucleotide
content of aligned reads.

After the reads generated by a high-throughput sequencing experiment have
been aligned to a reference genome, the questions that are being asked
about these alignments typically fall in two broad categories:
{\bf positional only} and {\bf nucleotide-related}.

{\bf Positional only} questions are about the position of the alignments
with respect to the reference genome. Note that the position of an alignment
is actually better described in terms of genomic ranges (1 range for an
alignment with no gaps, 2 or more ranges for an alignment with gaps).
Knowing the ranges of the alignments is sufficient to perform common tasks
like {\it read counting} or for {\it computing the coverage}.
{\it Read counting} is the process of counting the number of aligned
reads per gene or exon and is typically performed in the context of
a differential analysis. This task can be accomplished with the
\Rfunction{summarizeOverlaps} function provided in the
\Rpackage{GenomicAlignments} package and is explained in details
in the ``Counting reads with summarizeOverlaps'' vignette (also located
in this package).
{\it Computing the coverage} is often the preliminary step to peak
detection (ChIP-seq analysis) or to a copy number analysis.
It can be accomplished with the \Rfunction{coverage} function.
See \Rcode{?\`{}coverage-methods\`{}} for more information.

{\bf Nucleotide-related} questions are about the nucleotide content
of the alignments. In particular how this content compares to the
corresponding nucleotides in the reference genome. These questions
typically arise in the context of small genetic variation detection
between one or more samples and a reference genome.

The \Rpackage{GenomicAlignments} package provides a suite of low- to
mid-level tools for dealing with {\bf nucleotide-related} questions
about the alignments. In this vignette we illustrate their use on the
single-end and paired-end reads of an RNA-seq experiment.

Note that these tools do NOT constitute a complete variant toolbox.
If this is what you're looking for, other \Bioconductor{} packages
might be more appropriate. See the GeneticVariability and SNP views
at this URL
\url{http://bioconductor.org/packages/release/BiocViews.html#___AssayDomains}
for a complete list of packages that deal with small genetic variations.
Most of them provide tools of higher level than the tools described in
this vignette. See for example the \Rpackage{VariantTools} and
\Rpackage{VariantAnnotation} packages for a complete variant toolbox
(including variant calling capabilities).



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Load the aligned reads and their sequences from a BAM file}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


In this section, we illustrate how aligned reads and their sequences
can be loaded from a BAM file. The reads we're going to use for this
are paired-end reads from a published study by Zarnack et al., 2012.
A subset of these reads are stored in the BAM files located in the
\Rpackage{RNAseqData.HNRNPC.bam.chr14} data package. The package
contains 8 BAM files, 1 per sequencing run:

<<bamfiles>>=
library(RNAseqData.HNRNPC.bam.chr14)
bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
names(bamfiles)  # the names of the runs
@

Each BAM file was obtained by (1) aligning the reads (paired-end) to
the full hg19 genome with TopHat2, and then (2) subsetting to keep
only alignments on chr14. See \Rcode{?RNAseqData.HNRNPC.bam.chr14}
for more information about this data set.

As a preliminary step, we check whether the BAM files contain single-
or paired-end alignments. This can be done with the
\Rfunction{quickBamFlagSummary} utility from the \Rpackage{Rsamtools}
package:

<<quickBamFlagSummary>>=
library(Rsamtools)
quickBamFlagSummary(bamfiles[1], main.groups.only=TRUE)
@

This confirms that all the alignments in the 1st BAM file (run ERR127306)
are paired-end. This means that we should preferably load them with the
\Rfunction{readGAlignmentPairs} function from the \Rpackage{GenomicAlignments}
package. However for the purpose of keeping things simple, we will
ignore the pairing for now and load only the alignments corresponding to
the first segment of the pairs. We will use the \Rfunction{readGAlignments}
function from the \Rpackage{GenomicAlignments} package for this, together
with a \Rclass{ScanBamParam} object for the filtering. See
\Rcode{?ScanBamParam} in the \Rpackage{Rsamtools} package for the details.

Furthermore, while preparing the \Rclass{ScanBamParam} object to perform
the filtering, we'll also get rid of the PCR or optical duplicates
(flag bit 0x400 in the SAM format, see the SAM Spec
\footnote{\url{http://samtools.sourceforge.net/}} for the details),
as well as reads not passing quality controls (flag bit 0x200 in the
SAM format).

Finally we also request the read sequences (a.k.a. the {\it segment sequences}
in the SAM Spec, stored in the SEQ field) via the \Rclass{ScanBamParam}
object:

<<ScanBamParam>>=
flag1 <- scanBamFlag(isFirstMateRead=TRUE, isSecondMateRead=FALSE,
                     isDuplicate=FALSE, isNotPassingQualityControls=FALSE)
param1 <- ScanBamParam(flag=flag1, what="seq")
@

We're now ready to load the alignments and their sequences. Note that we
use \Rcode{use.names=TRUE} in order to also load the {\it query names}
(a.k.a. the {\it query template names} in the SAM Spec, stored in the
QNAME field) from the BAM file. \Rfunction{readGAlignments} will use
them to set the names of the returned object:

<<readGAlignments>>=
library(GenomicAlignments)
gal1 <- readGAlignments(bamfiles[1], use.names=TRUE, param=param1)
@

This returns a \Rclass{GAlignments} object. The read sequences are stored
in the \Rcode{seq} metadata column of the object:

<<read_sequences>>=
mcols(gal1)$seq
@



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Compute the {\it original query sequences}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Because the BAM format imposes that the read sequence is ``reverse
complemented'' when a read aligns to the minus strand, we need to
``reverse complement'' it again to restore the {\it original query
sequences} (i.e. the sequences before alignment, that is, as they can
be seen in the FASTQ file assuming that the aligner didn't perform
any hard-clipping on them):

<<original-query-sequences>>=
oqseq1 <- mcols(gal1)$seq
is_on_minus <- as.logical(strand(gal1) == "-")
oqseq1[is_on_minus] <- reverseComplement(oqseq1[is_on_minus])
@

Because the aligner used to align the reads can report more than 1
alignment per read (i.e. per sequence stored in the FASTQ file), we
shouldn't expect the names of \Rcode{gal1} to be unique:

<<is_dup>>=
is_dup <- duplicated(names(gal1))
table(is_dup)
@

However, sequences with the same {\it query name} should correspond to
the same {\it original query} and therefore should be the same. Let's do
a quick sanity check:

<<same-name-implies-same-seq-in-U1-oqseq>>=
dup2unq <- match(names(gal1), names(gal1))
stopifnot(all(oqseq1 == oqseq1[dup2unq]))
@

Finally, let's reduce \Rcode{oqseq1} to one {\it original query sequence}
per unique {\it query name} (like in the FASTQ file containing the 1st end
of the unaligned reads):

<<oqseq1>>=
oqseq1 <- oqseq1[!is_dup]
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mismatches, indels, and gaps}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Because the aligner possibly tolerated a small number of mismatches, indels,
and/or gaps during the alignment process, the sequences in
\Rcode{mcols(gal1)\$seq} gnerally don't match exactly the reference genome.

The information of where indels and/or gaps occur in the alignments is
represented in the CIGAR strings. Let's have a look at these string.
First the most frequent cigars:

<<most_frequent_cigars>>=
head(sort(table(cigar(gal1)), decreasing=TRUE))
@

Then a summary of the total number of insertions (I), deletions (D), and
gaps (N):

<<cigarOpTable>>=
colSums(cigarOpTable(cigar(gal1)))
@

This tells us that the aligner that was used supports indels (I/D) and
junction reads (N).

Finally we count and summarize the number of gaps per alignment:

<<table_njunc>>=
table(njunc(gal1))
@

Some reads contain up to 3 gaps (i.e. span 3 splice junctions).



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Put the read sequences and reference sequences ``side by side''}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


TODO (with \Rfunction{sequenceLayer})


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{OLD STUFF (needs to be recycled/updated)}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Load paired-end reads from a BAM file}

BAM file {\tt untreated3\_chr4.bam} (located in the
\Rpackage{pasillaBamSubset} data package) contains paired-end reads from
the ``Pasilla'' experiment and aligned against the dm3 genome (see
\Rcode{?untreated3\_chr4} in the \Rpackage{pasillaBamSubset} package for
more information about those reads).
We use the \Rfunction{readGAlignmentPairs} function to load them
into a \Rclass{GAlignmentPairs} object:

<<readGAlignmentPairs>>=
library(pasillaBamSubset)
flag0 <- scanBamFlag(isDuplicate=FALSE, isNotPassingQualityControls=FALSE)
param0 <- ScanBamParam(flag=flag0)
U3.galp <- readGAlignmentPairs(untreated3_chr4(), use.names=TRUE, param=param0)
head(U3.galp)
@

The \Rcode{show} method for \Rclass{GAlignmentPairs} objects displays
two {\tt ranges} columns, one for the {\it first} alignment in the pair (the
left column), and one for the {\it last} alignment in the pair (the right
column). The {\tt strand} column corresponds to the strand of the {\it first}
alignment.

<<first-and-last-U3.galp>>=
head(first(U3.galp))
head(last(U3.galp))
@

According to the SAM format specifications, the aligner is expected to mark
each alignment pair as {\it proper} or not (flag bit 0x2 in the SAM format).
The SAM Spec only says that a pair is {\it proper} if the {\it first} and
{\it last} alignments in the pair are ``properly aligned according to the
aligner''. So the exact criteria used for setting this flag is left to the
aligner.

We use \Rcode{isProperPair} to extract this flag from the
\Rclass{GAlignmentPairs} object:

<<isProperPair>>=
table(isProperPair(U3.galp))
@

Even though we could do {\it overlap encodings} with the full object,
we keep only the {\it proper} pairs for our downstream analysis:

<<keep-only-proper-pairs>>=
U3.GALP <- U3.galp[isProperPair(U3.galp)]
@

Because the aligner used to align those reads can report more than 1 alignment
per {\it original query template} (i.e. per pair of sequences stored in the
input files, typically 1 FASTQ file for the {\it first} ends and 1 FASTQ file
for the {\it last} ends), we shouldn't expect the names of \Rcode{U3.GALP} to
be unique:

<<U3.GALP_names_is_dup>>=
U3.GALP_names_is_dup <- duplicated(names(U3.GALP))
table(U3.GALP_names_is_dup)
@

Storing the {\it query template names} in a factor will be useful:

<<U3.GALP_qnames>>=
U3.uqnames <- unique(names(U3.GALP))
U3.GALP_qnames <- factor(names(U3.GALP), levels=U3.uqnames)
@

as well as having the mapping between each {\it query template name} and its
first occurence in \Rcode{U3.GALP\_qnames}:

<<U3.GALP_dup2unq>>=
U3.GALP_dup2unq <- match(U3.GALP_qnames, U3.GALP_qnames)
@

Our reads can have up to 1 gap per end:

<<gaps-in-U3.GALP>>=
head(unique(cigar(first(U3.GALP))))
head(unique(cigar(last(U3.GALP))))
table(njunc(first(U3.GALP)), njunc(last(U3.GALP)))
@

Like for our single-end reads, the following tables indicate that indels were
not allowed/supported during the alignment process:

<<no-indels-in-U3.GALP>>=
colSums(cigarOpTable(cigar(first(U3.GALP))))
colSums(cigarOpTable(cigar(last(U3.GALP))))
@



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\Rcode{sessionInfo()}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

<<sessionInfo>>=
sessionInfo()
@

\end{document}