\name{readFastq} \alias{readFastq} \alias{readFastq,character-method} \title{Read FASTQ-formatted files into compact R representations} \description{ \code{readFastq} reads all FASTQ-formated files in a directory \code{dirPath} whose file name matches pattern \code{pattern}, returning a compact internal representation of the sequences and quality scores in the files. Methods read all files into a single R object; a typical use is to restrict input to a single FASTQ file. } \usage{ readFastq(dirPath, pattern=character(0), ...) } \arguments{ \item{dirPath}{A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of FASTQ files to be read.} \item{pattern}{The (\code{\link{grep}}-style) pattern describing file names to be read. The default (\code{character(0)}) results in line (attempted) input of all files in the directory.} \item{...}{Additional arguments, perhaps used by methods.} } \details{ The fastq format is not quite precisely defined. The basic definition used here parses the following four lines as a single record: \preformatted{ @HWI-EAS88_1_1_1_1001_499 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT +HWI-EAS88_1_1_1_1001_499 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS } The first and third lines are identifiers preceded by a specific character (the identifiers are identical, in the case of Solexa). The second line is an upper-case sequence of nucleotides. The parser recognizes IUPAC-standard alphabet (hence ambiguous nucleotides), coercing \code{.} to \code{-} to represent missing values. The final line is an ASCII-encoded representation of quality scores, with one ASCII character per nucleotide. The encoding implicit in Solexa-derived fastq files is that each character code corresponds to a score equal to the ASCII character value minus 64 (e.g., ASCII \code{@} is decimal 64, and corresponds to a Solexa quality score of 0). This is different from BioPerl, for instance, which recovers quality scores by subtracting 33 from the ASCII character value (so that, for instance, \code{!}, with decimal value 33, encodes value 0). The BioPerl description of fastq asserts that the first character of line 4 is a \code{!}, but the current parser does not support this convention. } \value{ A single R object (e.g., \code{\linkS4class{ShortReadQ}}) containing sequences and qualities contained in all files in \code{dirPath} matching \code{pattern}. There is no guarantee of order in which files are read. } \seealso{ The IUPAC alphabet in Biostrings. \url{http://www.bioperl.org/wiki/FASTQ_sequence_format} for the BioPerl definition of fastq. Solexa documentation `Data analysis - documentation : Pipeline output and visualisation'. } \author{Martin Morgan} \examples{ showMethods("readFastq") sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") sread(rfq) id(rfq) quality(rfq) ## SolexaPath method 'knows' where FASTQ files are placed rfq1 <- readFastq(sp, pattern="s_1_sequence.txt") rfq1 } \keyword{manip}