\name{getSNPlocs} \alias{SNPcount} \alias{all_rsids} \alias{ch1_snplocs} \alias{ch2_snplocs} \alias{ch3_snplocs} \alias{ch4_snplocs} \alias{ch5_snplocs} \alias{ch6_snplocs} \alias{ch7_snplocs} \alias{ch8_snplocs} \alias{ch9_snplocs} \alias{ch10_snplocs} \alias{ch11_snplocs} \alias{ch12_snplocs} \alias{ch13_snplocs} \alias{ch14_snplocs} \alias{ch15_snplocs} \alias{ch16_snplocs} \alias{ch17_snplocs} \alias{ch18_snplocs} \alias{ch19_snplocs} \alias{ch20_snplocs} \alias{ch21_snplocs} \alias{ch22_snplocs} \alias{chX_snplocs} \alias{chY_snplocs} \alias{chMT_snplocs} \alias{getSNPcount} \alias{.loadLoc} \alias{.loadAlleles} \alias{getSNPlocs} \alias{rsid2loc} \alias{rsid2alleles} \alias{rsidsToGRanges} \title{Accessing the data stored in SNPlocs.Hsapiens.dbSNP.20100427} \description{ The data sets stored in SNPlocs.Hsapiens.dbSNP.20100427 and how to access them. } \usage{ ## Datasets: data(SNPcount) data(all_rsids) data(ch1_snplocs) data(ch2_snplocs) data(ch3_snplocs) data(ch4_snplocs) data(ch5_snplocs) data(ch6_snplocs) data(ch7_snplocs) data(ch8_snplocs) data(ch9_snplocs) data(ch10_snplocs) data(ch11_snplocs) data(ch12_snplocs) data(ch13_snplocs) data(ch14_snplocs) data(ch15_snplocs) data(ch16_snplocs) data(ch17_snplocs) data(ch18_snplocs) data(ch19_snplocs) data(ch20_snplocs) data(ch21_snplocs) data(ch22_snplocs) data(chX_snplocs) data(chY_snplocs) data(chMT_snplocs) ## Convenience wrappers for loading the SNP data: getSNPcount() getSNPlocs(seqname, as.GRanges=FALSE, caching=TRUE) ## Extract SNP information for a set of rs ids: rsid2loc(rsids, caching=TRUE) rsid2alleles(rsids, caching=TRUE) rsidsToGRanges(rsids, caching=TRUE) } \arguments{ \item{seqname}{ The name of the sequence for which to get the SNP locations and alleles. If \code{as.GRanges} is \code{FALSE}, only one sequence can be specified (i.e. \code{seqname} must be a single string). If \code{as.GRanges} is \code{TRUE}, an arbitrary number of sequences can be specified (i.e. \code{seqname} can be a character vector of arbitrary length). } \item{as.GRanges}{ \code{TRUE} or \code{FALSE}. If \code{TRUE}, then the SNP locations and alleles are returned in a \link[GenomicRanges]{GRanges} object. Otherwise (the default), they are returned in a data frame (see below). } \item{caching}{ Should the loaded SNPs be cached in memory for faster further retrieval but at the cost of increased memory usage? } \item{rsids}{ A vector of rs ids. Can be integer or character vector, with or without the \code{"rs"} prefix. NAs are not allowed. } } \details{ See \link{SNPlocs.Hsapiens.dbSNP.20100427} for general information about this package. The SNP data are split by chromosome i.e. the package contains one data set per chromosome, each of them being a serialized data frame with 1 row per SNP and the 2 following columns: \itemize{ \item \code{loc}: The 1-based location of the SNP relative to the first base at the 5' end of the plus strand of the reference sequence. \item \code{alleles}: A raw vector with no NAs which can be converted into a character vector containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code (see \code{?\link[Biostrings]{IUPAC_CODE_MAP}} in the Biostrings package for more information). } Note that those data sets are not intended to be used directly but the user should instead use the \code{getSNPcount} and \code{getSNPlocs} convenience wrappers for loading the SNP data. When used with \code{as.GRanges=FALSE} (the default), \code{getSNPlocs} returns a data frame with 1 row per SNP and the 3 following columns: \itemize{ \item \code{RefSNP_id}: RefSNP ID (aka "rs id") with \code{"rs"} prefix removed. Character vector with no NAs and no duplicates. \item \code{alleles_as_ambig}: A character vector with no NAs containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code. \item \code{loc}: Same as for the 2-col serialized data frame described previously. } } \value{ \code{getSNPcount} returns a named integer vector containing the number of SNPs for each sequence in the reference genome. By default (\code{as.GRanges=FALSE}), \code{getSNPlocs} returns the 3-col data frame described above containing the SNP data for the specified chromosome. Otherwise (\code{as.GRanges=TRUE}), it returns a \link[GenomicRanges]{GRanges} object with extra columns \code{"RefSNP_id"} and \code{"alleles_as_ambig"}. Note that all the elements (genomic ranges) in this \link[GenomicRanges]{GRanges} object have their strand set to \code{*} and that all the sequence lengths are set to \code{NA}. \code{rsid2loc} and \code{rsid2alleles} both return a named vector (integer vector for the former, character vector for the latter) where each (name, value) pair corresponds to a supplied rs id. For both functions the name in (name, value) is the chromosome of the rs id. The value in (name, value) is the position of the rs id on the chromosome for \code{rsid2loc}, and a single IUPAC code representing the associated alleles for \code{rsid2alleles}. \code{rsidsToGRanges} returns a \link[GenomicRanges]{GRanges} object similar to the one returned by \code{getSNPlocs} (when used with \code{as.GRanges=TRUE}) and where each element corresponds to a supplied rs id. } \author{H. Pages} \seealso{ \link{SNPlocs.Hsapiens.dbSNP.20100427}, \code{\link[Biostrings]{IUPAC_CODE_MAP}}, \link[GenomicRanges]{GRanges-class}, \code{\link[BSgenome]{injectSNPs}} } \examples{ ## --------------------------------------------------------------------- ## A. BASIC USAGE ## --------------------------------------------------------------------- getSNPcount() ## Get the locations and alleles of all SNPs on chromosomes 22: ch22snps <- getSNPlocs("ch22") dim(ch22snps) colnames(ch22snps) head(ch22snps) ## Get the locations and alleles of all SNPs on chromosomes 22 and MT ## as a GRanges object: getSNPlocs(c("ch22", "chMT"), as.GRanges=TRUE) ## --------------------------------------------------------------------- ## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS... ## --------------------------------------------------------------------- ## ... and return it in a GRanges object: myrsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206", "rs10932221", "rs56219727", "rs73709730", "rs55838886", "rs3734153", "rs79381275", "rs75350930", "rs1516535") rsidsToGRanges(myrsids) ## --------------------------------------------------------------------- ## C. INJECTION IN THE REFERENCE GENOME ## --------------------------------------------------------------------- library(BSgenome.Hsapiens.UCSC.hg19) Hsapiens ## Note that the chromosome names in BSgenome.Hsapiens.UCSC.hg19 ## are those used by UCSC and they differ from those used by dbSNP. ## Inject the SNPs in hg19 (injectSNPs() "knows" how to map dbSNP ## chromosome names to UCSC names): Hs2 <- injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP.20100427") Hs2 alphabetFrequency(unmasked(Hs2$chr22)) alphabetFrequency(unmasked(Hsapiens$chr22)) ## Get the number of nucleotides that were modified by this injection: neditAt(unmasked(Hs2$chr22), unmasked(Hsapiens$chr22)) ## --------------------------------------------------------------------- ## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!) ## --------------------------------------------------------------------- ## Note that dbSNP can assign distinct ids to SNPs located at the same ## position: any(duplicated(ch22snps$RefSNP_id)) # rs ids are all distinct... any(duplicated(ch22snps$loc)) # but some locations are repeated! ch22snps <- ch22snps[order(ch22snps$loc), ] # sort by location which(duplicated(ch22snps$loc))[1] # 639 ch22snps[637:640, ] # rs75258394 and rs78180314 have same locations # and alleles ## Also note that not all SNP alleles are consistent with the hg19 genome ## i.e. the alleles reported for a given SNP are not always compatible ## with the nucleotide found at the SNP location in hg19. ## For example, to get the number of inconsistent SNPs in chr1: ch1snps <- getSNPlocs("ch1") all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="") nchar(all_alleles) # 1369185 SNPs on chr1 neditAt(all_alleles, unmasked(Hsapiens$chr1)[ch1snps$loc], fixed=FALSE) ## ==> 3084 SNPs (0.23%) are inconsistent with hg19 chr1! ## Finally, let's check that no SNP falls in an assembly gap: agaps <- masks(Hsapiens$chr1)$AGAPS agaps # the assembly gaps ## Looping over the assembly gaps: sapply(1:length(agaps), function(i) any(ch1snps$loc >= start(agaps)[i] & ch1snps$loc <= end(agaps)[i])) ## Or, in a more efficient way: length(findOverlaps(ch1snps$loc, agaps)) # 0 } \keyword{package} \keyword{data}