%\VignetteIndexEntry{ArrayExpressHTS: RNA-Seq Pipeline for transcription profiling experiments} %\VignetteDepends{} %\VignetteKeywords{HTS, transcription, profiling, RNA-Seq, pipeline} %\VignettePackage{ArrayExpressHTS} \documentclass[a4paper]{article} \usepackage{times} \usepackage{a4wide} \usepackage{verbatim} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Rsamtools}{\Rpackage{Rsamtools}} \newcommand{\ArrayExpressHTS}{\Rpackage{ArrayExpressHTS}} \newcommand{\fastq}{\texttt{fastq}} \newcommand{\MAGETAB}{\texttt{MAGE-TAB}} \clearpage \begin{document} \title{ An Introduction to \Rpackage{ArrayExpressHTS}: RNA-Seq Pipeline for transcription profiling experiments } \author{Andrew Tikhonov, Angela Goncalves} \date{Modified: 15 March, 2011. Compiled: \today} \maketitle @ % \section{ArrayExpressHTS Package} \ArrayExpressHTS{} is an R based pipeline for pre-processing, expression estimation and data quality assessment of high throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive. \section{General Overview} The pipeline starts from raw sequence files and produces standard Bioconductor R objects of class \Rclass{ExpressionSet} containing expression levels over features for downstream analysis, along with HTML reports for data quality assessment. Further informative intermediate data, such as alignment and annotation files, are also available. The steps of the pipeline are: \begin{itemize} \item Prepare the data and experimental metadata \item Assess raw data quality and produce quality report \item Align sequencing data files to a reference \item Filter reads \item Assess alignment quality and produce quality report \item Estimate expression \item Generate a compared report \end{itemize} \section{Running AEHTS on the EBI R-Cloud} \subsection{Find a dataset to analyse or upload your own data} \ArrayExpressHTS{} can be used on the EBI R-Cloud to analyse public and private datasets available through ArrayExpress. You can use the web interface\footnote{http://www.ebi.ac.uk/arrayexpress/} to search for datasets in ArrayExpress. % % The procedure of getting one's data available for analysis on the cloud is to submit it to the AE Archive, where the data will remain password protected. Simple \MAGETAB{} templates for submission can be obtained online\footnote{http://www.ebi.ac.uk/fg/submissions\_overview.html} and curators will assist in file preparation and validation. % \subsection{Launch the R-Cloud Workbench, register and create a new project} The \texttt{EBI R-Cloud} is a new service at the EBI which allows R users to log in and run distributed computational jobs remotely on a powerful 64-bit linux cluster. The \texttt{R-Cloud} is available through a GUI client, the \texttt{R-Cloud Workbench}. Follow the instructions on the page\footnote{http://www.ebi.ac.uk/tools/rcloud} to download or launch the Workbench. % When connecting for the first time you will be requested to register. You will need to provide a username, password and e-mail address to allow you to retrieve your long running projects the next time you log in. Once registered you can log in and create a new project. % To find your way around the workbench visit the documentation online\footnote{http://www.ebi.ac.uk/Tools/rcloud/quick\_start.html}. % \subsection{Run the whole pipeline with default options } Running \ArrayExpressHTS{} within R with default options is straight forward with a call to function \Rfunction{ArrayExpressHTS}. You must provide an ArrayExpress accession number for the dataset you want to analyse. For example, the publicly available dataset \texttt{E-GEOD-16190} comes from a study by Chepelev et al. on detecting single nucleotide variations in expressed exons of the human genome. To re-analyse this data with default options run the following commands on t he Workbench console: <>= library("ArrayExpressHTS") aehts <- ArrayExpressHTS("E-GEOD-16190") @ When the pipeline finishes, \Robject{e} will contain an \Rclass{ExpressionSet} object that can then be used for further analysis. % Running the whole pipeline will take time depending on the size of the dataset and the amount of computing nodes allocated for parallel computation. Reports and intermediate files are available as soon as they are ready and can be visualised through the File Browsing tab. A directory will be automatically created on your working space for each of the samples in the dataset. Any file can be copied by dragging and dropping from the File Browser into, for example, your Desktop. Beware of the file size, which is generally the reason why downloading takes long. % To run the pipeline with options other than the default ones check the 'Configuration options' section. \section{ArrayExpressHTS Local Configuration} \subsection{Pre-requisites} In order to run the pipeline locally you will need do have at least: \begin{itemize} \item Unix based OS (linux, MacOS). \item R environment for statistical computing\footnote{http://www.r-project.org/} \item Bioconductor\footnote{http://www.bioconductor.org/install/index.html} The following packges are required: \Rpackage{Rsamtools}, \Rpackage{IRanges}, \Rpackage{Biostrings}, \Rpackage{ShortRead}, \Rpackage{Hmisc}, \Rpackage{R2HTML}, \Rpackage{XML}, \Rpackage{Biobase}, \Rpackage{svMisc}, \Rpackage{sampling}, \Rpackage{xtable}, \Rpackage{DESeq}, \Rpackage{ArrayExpress}, \Rpackage{RColorBrewer}, \Rpackage{biomaRt} \item SAMtools installed\footnote{http://samtools.sourceforge.net/} \item At least one aligner installed: BWA\footnote{http://bio-bwa.sourceforge.net/} and/or Bowtie\footnote{http://bowtie-bio.sourceforge.net/tutorial.shtml} and/or TopHat\footnote{http://tophat.cbcb.umd.edu/} \item Cufflinks\footnote{http://cufflinks.cbcb.umd.edu/tutorial.html} and/or MMSEQ\footnote{http://www.bgx.org.uk/software/mmseq.html} \item \ArrayExpressHTS{} package installed. \end{itemize} Check out 'Running AEHTS on the EBI R-Cloud' section to try out the pipeline on publicly available datasets without having to install any of these. \subsection{Process public ArrayExpress dataset} It is possible to run \ArrayExpressHTS{} on publicly available ArrayExpress data on your computer. It will take longer as the data will be processed sequentially as opposed to R-Cloud where it's processed in parallel. However this possibillity provides a great benefit and can of course be of use. You need to provide an ArrayExpress accession number for the dataset you want to analyse. For example, the publicly available dataset \texttt{E-GEOD-16190} that was mentioned above by Chepelev et al. on detecting single nucleotide variations in expressed exons of the human genome. To run the pipeline, execute the following R command: <>= library("ArrayExpressHTS") aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE) @ The parameter 'usercloud' signals the pipeline that it is running in a non-parallel environment rather than R-Cloud. When the pipeline finishes, \Robject{e} will contain an \Rclass{ExpressionSet} object that can then be used for further analysis. \subsection{Prepare your own data} Users can process their owne datasets, that are non public, experimental or other. Data analysis starts with obtaining the input raw read files and creation of corresponding experimental metadata. Create a directory with the name you want to give your project and a subdirectory named 'data' on that project directory to hold your raw read and metadata files. <>= dir.create("testExperiment") dir.create("testExperiment/data") @ If your data is paired the pipeline is expecting the mates to be separated in two files with the names ending with \_1 and \_2, e.g.: 1974\_1.fastq and 1974\_2.fastq. Place your \fastq{} files into the 'data' directory. The experimental metadata serves to create a set of options used to configure the analysis and includes: \begin{itemize} \item essential sample annotation, including the experimental factors and their values (e.g. sex of the sample, time in a time series experiment); \item essential information about the biological system from which samples were taken, for instance, the organism and organism part (if known) \item experimental protocol information, such as the retaining of strand information and the insert size in paired-end reads; \item experiment design information including the links between files and samples and their experimental factors; \item machine related information, such as the instrument used and the scale of the quality information. \end{itemize} This information can be easily conveyed using the \MAGETAB{} format\footnote{http://tab2mage.sourceforge.net/docs/magetab\_docs.html}, in particular using the \texttt{IDF} (Investigation Description Format) and the \texttt{SDRF} (Sample and Data Relationship Format) files. You can download and adapt the examples online\footnote{http://www.ebi.ac.uk/Tools/rwiki/Wiki.jsp?page=ArrayExpressHTS\%20Data\%20Preparation.} Name these files somename.idf.txt and somename.sdrf.txt. The \texttt{IDF} and \texttt{SDRF} should be placed into the 'data' folder. \subsection{Preparing the References and Annotation data} References are the organism's genome or transcriptome your reads will be aligned to. For the alignment the references also need to be indexed. References can be stored anywhere on the filesystem as later it can be specified in \Rfunction{ArrayExpressHTS} where they are located. If you would like to use a custom reference not available in \texttt{Ensembl} do prepare custom references. To get a reference genome from \texttt{Ensembl} you can use the \Rfunction{prepareReference} command with the organism, version and type. <>= # create directory # # Please note, tempdir() is used for automamtic test # execution. Select directory more appropriate and # suitable for keeping reference data. # referencefolder = paste(tempdir(), "/reference", sep = "") dir.create(referencefolder) # download and prepare reference prepareReference("Homo_sapiens", version = "GRCh37.61", type = "genome", aligner = "bowtie", location = referencefolder ) prepareReference("Homo_sapiens", version = "GRCh37.61", type = "transcriptome", aligner = "bowtie", location = referencefolder ) prepareReference("Mus_musculus", version = "current", type = "genome", location = referencefolder ) prepareReference("Mus_musculus", version = "current", type = "transcriptome", location = referencefolder ) @ % Annotation is used throughout all steps of the pipeline and therefore needs to be prepared as well: <>= # download and prepare annotation prepareAnnotation("Homo_sapiens", "current", location = referencefolder ) prepareAnnotation("Mus_musculus", "NCBIM37.61", location = referencefolder ) @ % \subsection{Running the pipeline} After preparing your own data and references you are ready to launch the pipeline. Start R, load the ArrayExpressHTS package and call the \Rfunction{ArrayExpressHTSFastQ} function. Specify the location of the reference data in 'refdir' parameter. Below is the example experiments that comes with the pipeline, which is a very shortened version of E-GEOD-16190 experiment \footnote{http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-16190}. <>= # In ArrayExpressHTS/expdata there is testExperiment, which is # a very short version of E-GEOD-16190 experiment, placed there # for testing reasons. # # Experiment in ArrayExpress: # http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-16190 # # the following piece of code will take ~1.5 hours to compute # on local PC and ~10 minutes on R-Cloud # # if executed on a local PC, make sure tools are available # to the pipeline. Check initDefaultEnvironment help page # for instructions. # # create a temporary folder where experiment will be copied # computing experiment in the package folder may cause issues # with file permissions and therefore failures. # # srcfolder <- system.file("expdata", "testExperiment", package="ArrayExpressHTS"); dstfolder <- tempdir(); file.copy(srcfolder, dstfolder, recursive = TRUE); # run the pipeline # # set usercloud = FALSE if executing on local PC, # therefore parallel computation will be disabled # aehts = ArrayExpressHTSFastQ(accession = "testExperiment", organism = "Homo_sappiens", dir = dstfolder); # load the expression set object loadednames = load(paste(dstfolder, "/testExperiment/eset_notstd_rpkm.RData", sep="")); loadednames; get('library')(Biobase); # print out the expression values # head(assayData(eset)$exprs); # print out the experiment meta data experimentData(eset); pData(eset); @ % \section{Configuration options} You can override the default options used by \Rfunction{ArrayExpressHTS} by setting the function's \Rfunarg{options} parameter to a \Robject{list} specifying values for each option. The following options are available: \begin{itemize} \item \emph{stranded}: set to TRUE if a strand specific protocol was used \item \emph{insize}: an integer, which will be automatically determined if set to NULL \item \emph{insizedev}: an integer, which will be automatically determined if set to NULL \item \emph{reference}: The reference should be set to either 'genome' or 'transcriptome'. \item \emph{aligner}: 'tophat', 'bowtie', 'bwa' or 'custom' \item \emph{aligner\_options}: string of options to be directly passed to the aligners according to their own manual pages \item \emph{count\_feature}: count over 'genes' or 'transcripts' \item \emph{count\_method}: 'cufflinks', 'mmseq' or 'count' \emph{count} can only be used with the reference set to transcriptome, though it will estimate gene level counts if count\_feature is set to transcript and the sequence names in the reference include both transcript and gene names (e.g. see fasta files from Ensembl). It involves counting reads overlapping known transcripts. Reads are discarded if they overlap more than one isoform of the same gene or there is some ambiguity as from which gene they originated from. Count values are thus not very useful by themselves but can be used for comparison of expression between conditions. Discarding multi-mapping reads leads to information loss and systematic underestimation of expression. The \software{mmseq} and \software{cufflinks} statistical methods can be used estimate gene and transcript level expression taking into account all reads. \emph{mmseq} can only used with \texttt{SAM/BAM} files generated by the \software{TopHat} or \software{Bowtie} aligners. See also \emph{standardise} for a discussion on the types of values returned by these methods. \item \emph{standardise}: 'TRUE' or 'FALSE' The three supported count methods 'cufflinks', 'mmseq' and 'count' produce different types of values by default: for the first two the expression estimates are in FPKM (Fragments Per Kilobase of exon per Million fragments mapped), and for count the values produced are in number of aligned reads. The type of values returned by the pipeline can be controlled by setting the standardise parameter to 'TRUE' or 'FALSE', regardless of the counting method. They return respectively per feature (gene or transcript) counts/estimates and counts/estimates standardised by feature length and scaled to the number of aligned reads in the sample (FPKM). \item \emph{normalisation}: 'node' or 'tmm' Normalisation is generally required to remove systematic effects that occur in the data. \emph{normalisation} can be set to either \emph{none} or \emph{tmm}, where \emph{tmm} uses the trimmed mean of M-values for normalisation as implemented in the edgeR\footnote{http://www.bioconductor.org/packages/release/bioc/html/edgeR.html}\footnote{http://genomebiology.com/2010/11/3/R25)}. Note: when using 'cufflinks' or 'mmseq' with 'none' or 'tmm' the expression estimates do not correspond one-to-one to read counts. This because, unlike the count method which only uses uniquely mapping reads, both these methods try to estimate transcript abundance from all reads including multi-mapping ones (reads that map to more than one transcript or location). \end{itemize} \section{Downstream analysis} \subsection{Differential Expression} Differential expression for count data can be determined, downstream of the pipeline, with count based DE packages such as edgeR\footnote{http://www.bioconductor.org/packages/release/bioc/html/edgeR.html} and DEseq\footnote{http://www.bioconductor.org/packages/release/bioc/html/DESeq.html}. We do not recommend however using the output of \software{mmseq} and \software{cufflinks} with these methods. Users should instead provide their own test, of which t-tests or likelihood ratio tests would be suitable examples. \end{document}