%\VignetteIndexEntry{Working with the ShortRead Package} %\VignetteDepends{Genominator, ShortRead, yeastRNASeq} %\VignettePackage{Genominator} \documentclass[letterpaper,12pt]{article} <>= options(width=70) @ %%%%%%%%%%%%%%%%%%%%%%%% Standard Packages %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{epsfig} \usepackage{graphicx} \usepackage{graphics} \usepackage{amssymb} \usepackage{amsmath} \usepackage{mathrsfs} \usepackage{fancyvrb} \usepackage{caption} \usepackage{comment} \usepackage{fancyhdr} \usepackage{hyperref} \usepackage{color} %%%%%%%%%%%%%%%%%%%%%%%% Adapted from Sweave %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \DefineVerbatimEnvironment{Rcode}{Verbatim}{fontshape=sl, frame=single, framesep=2mm, fontsize=\small, baselinestretch=.5} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%% My macros (which of course are borrowed from a million ... %% \def\argmax{\operatornamewithlimits{arg\,max}} \def\argmin{\operatornamewithlimits{arg\,min}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\myurl}[1]{\href{http://#1}{\textcolor{red}{\texttt{#1}}}} \newcommand{\myem}[1]{\structure{#1}} %%%%%%%%%%%%%%%%%%%%%%%% Page and Document Setup %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \addtolength{\oddsidemargin}{-0.875in} \addtolength{\topmargin}{-0.875in} \addtolength{\textwidth}{1.75in} \addtolength{\textheight}{1.75in} \captionsetup{margin=15pt,font=small,labelfont=bf} \renewcommand{\topfraction}{0.9} % max fraction of floats at top \renewcommand{\bottomfraction}{0.8} % max fraction of floats at bottom % Parameters for TEXT pages (not float pages): \setcounter{topnumber}{2} \setcounter{bottomnumber}{2} \setcounter{totalnumber}{4} % 2 may work better \setcounter{dbltopnumber}{2} % for 2-column pages \renewcommand{\dbltopfraction}{0.9} % fit big float above 2-col. text \renewcommand{\textfraction}{0.07} % allow minimal text w. figs % Parameters for FLOAT pages (not text pages): \renewcommand{\floatpagefraction}{0.7} % require fuller float pages % N.B.: floatpagefraction MUST be less than topfraction !! \renewcommand{\dblfloatpagefraction}{0.7} % require fuller float pages %%%%%%%%%%%%%%%%%%%%%%% options for sweave %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \SweaveOpts{prefix.string=plots/withshortread} %%%%%%%%%%%%%%%%%%%%%% headers and footers %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \pagestyle{fancy} \renewcommand{\footrulewidth}{\headrulewidth} %%%%%%%%%%%%%%%%%%%%%%%%% bibliography %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \bibliographystyle{plainnat} %%%%%%%%%%%%%%%%%%%%%%% opening %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \title{Working with the ShortRead Package} \author{James Bullard, Kasper Daniel Hansen} \begin{document} \maketitle In this document we show how to use the Genominator package with ShortRead. In this document, we will demonstrate some analysis -- maybe even some useful first steps when analyzing RNA-Seq data. <>= require(Genominator) require(ShortRead) require(yeastRNASeq) @ We will use some data sets from the yeastRNASeq data package. Lets first see what this package contains: <>= data(package = "yeastRNASeq")[[3]][, c("Item", "Title")] @ At first, we will focus on the AlignedRead list. This object is a list of 4 \Rfunction{AlignedRead}'s, each representing the alignment of .5 million reads. The 4 elements of the list corresponds to 2 samples (mut, wt) each sequenced on 2 lanes. <>= data("yeastAligned") names(yeastAligned) yeastAligned[[1]] @ The first step is to make a database: <>= eData <- importFromAlignedReads(yeastAligned, chrMap = levels(chromosome(yeastAligned[[1]])), filename = "my.db", tablename = "raw", overwrite = TRUE) head(eData) @ \Rfunction{importFromAlignedReads} takes exactly such a (named) list of \Rfunction{AlignedReads} and aggregates each experiment and finally joins them together. Certainly, the most cryptic portion of this call is the chromosome map (the \texttt{chrMap} argument). The Genominator package requires that all chromosomes be stored as integer values. Therefore, we must to convert the ``chromosome names'' to integers. We do this by specifying a rule described as a named vector of integer values, like <>= levels(chromosome(yeastAligned[[1]])) @ Sometimes the call to \Rfunction{levels} is not enough and we need to do some reordering and perhaps the addition of extra chromosomes. The result is an ExpData object which has been aggregated over, i.e.\ each position in the genome which was represented by > 0 reads at its 5' most end will have that number of reads. Note how we deal with reads mapping to the reverse strand of the genome (by using the 5' end of the read as its location). We can now take advantage of all of the functionality of the Genominator package, using (parts of) the yeast annotation from SGD. <>= data(yeastAnno) head(yeastAnno) @ Once again, we have to deal with the fact that people like to represent chromosomes in different ways. In this case, SGD represents chromosomes with Roman numerals. Additionally, we prepare the annotation object to be consumable by the Genominator functions by making sure that we have columns \texttt{start}, \texttt{end} present and that \texttt{strand} takes values in $\{-1,0,1\}$. <>= yeastAnno$chr <- match(yeastAnno$chr, c(as.character(as.roman(1:16)), "MT", "2-micron")) yeastAnno$start <- yeastAnno$start_position yeastAnno$end <- yeastAnno$end_position rownames(yeastAnno) <- yeastAnno$ensembl yeastAnno <- yeastAnno[, c("chr", "start", "end", "strand", "gene_biotype")] head(yeastAnno) @ With this amount of processing we are now able to do some high level analysis. First we compute, for each annotated region, the number of reads hitting that region. Note that we may double count reads, if two regions overlap (which they often do in yeast). <>= geneCounts <- summarizeByAnnotation(eData, yeastAnno, ignoreStrand = TRUE) head(geneCounts) @ We can see how ``good'' the replicates are by assessing whether it fits the Poisson model of constant gene expression across lanes with variable sequencing effort. <>= plot(regionGoodnessOfFit(geneCounts, groups = gsub("_[0-9]_f", "", colnames(geneCounts))), chisq = TRUE) @ \end{document}