%\VignetteIndexEntry{Assessing and Adjusting for Technical Bias in High Throughput Sequencing Data} 
%\VignettePackage{seqbias}
%\VignettePackage{seqbias, Rsamtools}

\documentclass{article}

\usepackage{Sweave}
\usepackage{graphicx}
\usepackage{utopia}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE}

\title{\texttt{seqbias} \\ Assessing and Adjusting for Technical Bias in High Throughput
Sequencing Data}

\author{Daniel Jones \\
\small{\texttt{<dcjones@cs.washington.edu>}} \\
\small{Computer Science \& Engineering} \\
\small{University of Washington}}

\date{\today}

\begin{document}

\maketitle


\section{Introduction}

This package is designed as a means to assess and adjust for technical bias in
high-throughput sequencing datasets, RNA-Seq being a specific target. As noted
in previous studies, RNA-Seq is often subject to protocol specific bias.  That
is, the number of reads mapping to a particular position of the genome is
dependent on the the surrounding nucleotide sequence (as well as the abundance
of the RNA transcript) \cite{Hansen2010} \cite{Li2010}. Accounting for this bias
increases uniformity of coverage and may result in more accurate
quantification.

The approach implemented here trains a simple Bayesian network classifier and
uses it to evaluate the per position bias.  This builds off work done by Hansen,
et. al.  \cite{Hansen2010}, available in the \texttt{Genominator} Bioconductor
package.  Another approach is taken by Li, et. al.  \cite{Li2010} in the
\texttt{mseq} package, available from CRAN.

For this vignette, we will use some example data taken from Mortazavi, et. al.
\cite{Mortazavi2008} (NCBI accession number SRR001358). Because of space
constraints, we have mapped the reads (using Bowtie \cite{Langmead2009}) to an
artificial genome consisting of approximately 100kb of exonic DNA.

This ``artificial genome'' is given as,
<<>>=
library(seqbias)
library(Rsamtools)
ref_fn <- system.file( "extra/example.fa", package = "seqbias" )
ref_f <- FaFile( ref_fn )
open.FaFile( ref_f )
@

And the mapped reads,
<<>>=
reads_fn <- system.file( "extra/example.bam", package = "seqbias" )
@


\section{Assessment}

As a natural first step, we would like to assess whether our sample is
significantly biased. If this proves to be the case, we may wish to correct
for this. A simple procedure to do so will be covered in the next section.

To assess the nucleotide frequency we will use a very simple procedure:

\begin{enumerate}
\item Generate a random sample of intervals across our reference genome.
\item Extract sequences for these intervals from a FASTA file.
\item Extract read counts across these intervals from a BAM file.
\item Using these sequences and counts, compute and plot nucleotide or $k$-mer
frequencies.
\end{enumerate}

\subsection*{Sampling}

For this step, we could use collection of known exons, but trustworthy
annotations are not always available, and biasing the analysis by known exons
may be a concern in some instances. Fortunately, \texttt{seqbias} provides a
function to generate random intervals.

First, we extract a vector of sequence lengths, in the reference genome. Given
an FASTA file than has been indexed with the \texttt{samtools faidx} command, we
can use the \texttt{Rsamtools} package to read off the sequence lengths and to
extract the sequence. First, the lengths,

<<>>=
ref_seqs <- scanFaIndex( ref_f )
@

Once we have this, we generate 5 intervals of 100kb. It most cases we would want
to generate a larger sample, but since we are working here with small reference
sequence with dense coverage, we can get an accurate measurement with a few
intervals. 

<<>>=
I <- random.intervals( ref_seqs, n = 5, m = 100000 )
@



\subsection*{Sequences}

Next we extract the nucleotide sequences,
<<>>=
seqs <- scanFa( ref_f, I )
@

The \texttt{scanFa} function does not respect strand, so we must be sure to perform
the reverse complement ourselves.

<<>>=
neg_idx <- as.logical( I@strand == '-' )
seqs[neg_idx] <- reverseComplement( seqs[neg_idx] )
@


\subsection*{Counts}

Finally, we count the number of reads mapping to each position in our sampled
intervals.
<<>>=
counts <- count.reads( reads_fn, I, binary = T )
@

Unless the \texttt{binary} argument is FALSE, this function returns a 0-1 vector,
where a position is 0 if no reads map to it, and 1 if \emph{at least} one read maps to
it. This is a more robust way to measure sequencing bias, as the frequencies can not
get dominated by a few very high peaks.


\subsection*{Frequencies}

At last, we compute the $k$-mer frequency (where $k=1$, by default).
<<>>=
freqs <- kmer.freq( seqs, counts )
@

A nice way to plot this is with the \texttt{ggplot2} package, if available.

<<fig1,fig=TRUE,width=7,height=4>>=
if( require(ggplot2) ) {
  P <- qplot( x = pos,
              y = freq,
              ylim = c(0.15,0.4),
              color = seq,
              data  = freqs,
              geom  = "line" )
  P <- P + facet_grid( seq ~ . )
  print(P)
} else {
  par( mar = c(5,1,1,1), mfrow = c(4,1) )
  with( subset( freqs, seq == "a" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "a", type = 'l' ) )
  with( subset( freqs, seq == "c" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "c", type = 'l' ) )
  with( subset( freqs, seq == "g" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "g", type = 'l' ) )
  with( subset( freqs, seq == "t" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "t", type = 'l' ) )
} 
@

Doing so will produce the following plot,
\begin{center}
\includegraphics[width=\textwidth]{overview-fig1}
\end{center}

The x-axis shows the nucleotide position relative to the read start. Negative
number occur in the genome to the left of mapped reads. In this set, the reads
consist of positions 0-24.

We can see a clear bias here in positions 0-15, approximately. The rest of the
plot looks relatively flat, as we would expect if the experiment was measuring
abundance only and not a biased by the nucleotide sequence.  In the next section
we will adjust read counts to account for this.


\section{Compensation}

\subsection*{Training}

To begin, we must fit a seqbias model to our dataset. This done very easily with
the \texttt{seqbias.fit} function. This will take only a few seconds, but when
more reads are available a more accurate model can be trained at the expense of
the training procedure taking up to several minutes.

<<results=hide>>=
sb <- seqbias.fit( ref_fn, reads_fn, L = 5, R = 15 )
@

The \texttt{L} and \texttt{R} arguments control the maximum number of positions
to the left and right of the read start that may be considered. The model tries
to consider only informative positions, so increasing these numbers will
increase training time, but should never have a negative effect the accuracy of
the model.


\subsection*{Prediction}

Once we have trained the seqbias model, we can use it to predict the sequencing
bias across a set of intervals.
<<>>=
bias <- seqbias.predict( sb, I )
@

\subsection*{Adjustment}

To adjust, we will can simply divide the \texttt{counts} vectors by the
\texttt{bias} vectors.
<<>>=
counts.adj <- mapply( FUN = `/`, counts, bias, SIMPLIFY = F )
@

The post-adjustment nucleotide frequencies can then be measured as before,
<<>>=
freqs.adj <- kmer.freq( seqs, counts.adj )
@

And plotted,
<<fig2,fig=TRUE,width=7,height=4>>=
if( require(ggplot2) ) {
  P <- qplot( x = pos,
              y = freq,
              ylim = c(0.15,0.4),
              color = seq,
              data  = freqs.adj,
              geom  = "line" )
  P <- P + facet_grid( seq ~ . )
  print(P)
} else {
  par( mar = c(5,1,1,1), mfrow = c(4,1) )
  with( subset( freqs.adj, seq == "a" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "a", type = 'l' ) )
  with( subset( freqs.adj, seq == "c" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "c", type = 'l' ) )
  with( subset( freqs.adj, seq == "g" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "g", type = 'l' ) )
  with( subset( freqs.adj, seq == "t" ),
        plot( freq ~ pos, ylim = c(0.15,0.4), sub = "t", type = 'l' ) )
}
@

The plot below results,
\begin{center}
\includegraphics[width=\textwidth]{overview-fig2}
\end{center}

Compared to the first figure, the improvement is clear.


\section{Save / Load}

If the model is fit using a large number of reads, it can take several minutes
to train. To avoid repeatedly refitting the model, \texttt{seqbias} provides a
mechanism to save and load the model to a YAML file with the
\texttt{seqbias.save} and \texttt{seqbias.load} functions.

<<>>=
seqbias.save( sb, "my_seqbias_model.yml" )

# load the model sometime later
sb <- seqbias.load( ref_fn, "my_seqbias_model.yml" )
@

Note when loading the model, we need to provide a reference sequence. The
\texttt{seqbias} object keeps track of the reference sequence to make
\texttt{seqbias.predict} more convenient.


\section{Session Info}
<<>>=
sessionInfo()
@


\bibliographystyle{plain}
\bibliography{overview}

\end{document}