%\VignetteIndexEntry{Introduction to VariantAnnotation} %\VignetteDepends{GenomicRanges, Rsamtools, Biostrings, AnnotationDbi, BSgenome} %\VignetteKeywords{variants, sequence, sequencing, alignments} %\VignettePackage{VariantAnnotation} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\VariantAnnotation}{\Rpackage{VariantAnnotation}} \title{Introduction to \VariantAnnotation} \author{Valerie Obenchain, Michael Lawrence} \date{\today} \begin{document} \maketitle <>= options(width=72) @ \section{Introduction} In this vignette we introduce methods in the \VariantAnnotation package for the annotation and filtering of genetic variants. Variant locatation with respect to gene function can be determined as well as the amino acid coding changes for nonsynonymous variants. \Rpackage{VariantAnnotation} contains functions to access the SIFT and PolyPhen databases packages. The SIFT and PolyPhen methods predict the possible impact of amino acid coding changes on protein function (SIFT) or function and structure (PolyPhen). To assist with data access and managment the \Rfunction{readVcf} function parses data from Variant Call Format (VCF) files into a \Rclass{summarizedExperiment} object. Filtering functions are available for subsetting of variants on physical location in the gene or membership in dbSNP. \section{Variant Location} As sample data we use the \Robject{GRanges} object, \Robject{variants}, that contains both real and faux variants. The ranges in \Robject{variants} represent the positions in the reference sequence that are altered in the variation. A single basepair replacement (SNP) has equal start and end positions and a width of length 1. A deletion or substitution will have a width of 1 or greater depending on the number of basepairs being deleted or substituted. Insertions are represented with start = end + 1 which results in a width of length 0. Reference and variant alleles are represented as \Robject{DNAStringSet}s in \Robject{refAllele} and \Robject{varAllele} columns. An insertion will have a missing value in the \Robject{refAllele} column and a deletion will have a missing value in the \Robject{varAllele} column. <>= library(VariantAnnotation) data(variants) head(variants) @ \Rfunction{locateVariants} returns a \Robject{DataFrame} with one row for each variant-transcript match. The queryHits column maps back to the variant in the original query. Location categories of `coding', `UTR5', `UTR3', `intron' and `intergenic' are assigned based on the criteria in Table \ref{table:location}. Intergenic variants have two gene ID's, representing the genes preceding and following the variant. <>= library(TxDb.Hsapiens.UCSC.hg18.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene loc <- locateVariants(variants, txdb) head(loc) @ \begin{table}[ht] \begin{center} \begin{tabular}{c|l} \hline Region & Details \\ \hline coding & variant falls \emph{within} a coding region \\ %splicing & * & variant is within 2-bp of a splicing junction \\ %ncRNA & 2 & variant overlaps a transcript without coding annotation in the gene %definition \\ UTR5 & variant falls \emph{within} a 5' untranslated region \\ UTR3 & variant falls \emph{within} a 3' untranslated region \\ intron & variant falls \emph{within} a transcript but not \emph{within} a coding region \\ %upstream & 5 & variant overlaps 1-kb region upstream of transcription start %site \\ %downstream & 5 & variant overlaps 1-kb region downtream of transcription end %site \\ intergenic & variant does not fall \emph{within} a transcript \\ \hline \end{tabular} \end{center} \caption{Variant Location} \label{table:location} \end{table} Multiple alignments for a single variant should be separate rows in the \Rcode{query} object. Below is an example snp from dbSNP Build 130 which is aligned to different locations for different assemblies. Each alignment is entered as a separate row in the \Rcode{singleSNP} object. <>= singleSNP <- GRanges(seqnames="chr16", ranges=IRanges(start=c(23004226, 22316999, 24133766), width=c(1,1,1)), varAllele=DNAStringSet("T"), comments=c("Celera assembly", "HuRef assembly", "reference assembly")) singleSNP @ The result from \Rfunction{locateVariants} shows the Celera assembly location matching two transcripts in the same gene, both intron regions. The HuRef assembly falls in an intergenic region. The reference assembly hits two different transcripts in the same gene, both in the 3' UTR region. \begin{verbatim} > locateVariants(singleSNP, txdb) DataFrame with 5 rows and 4 columns queryHits txID geneID Location 1 1 52091 57478 intron 2 1 52092 57478 intron 3 2 NA 1039,57478 intergenic 4 3 50256 5579 3'UTR 5 3 50259 5579 3'UTR \end{verbatim} \section{Amino Acid Coding} Amino acid coding changes can be determined for non-synonymous variants with \Rfunction{predictCoding}. This function identifies overlaps between the variants in \Robject{query} and the coding regions of a \Robject{TranscriptDb} object. Sequences for the reference codons are retrieved from either a \Robject{BSgenome} or fasta file. Codon sequences for the variants are constructed by substituting, inserting or deleting the \Robject{varAllele} values into the reference sequence. Amino acid codes are computed for the resulting variant codon sequence when the length is a multiple of 3. Examples of various coding situations are shown in Table \ref{table:aacoding}. Positions of substitution, insertion or deletion in Table \ref{table:aacoding} are purely random and were chosen for example purposes only. \begin{table}[ht] \begin{center} \begin{tabular}{c|c|c|c|c|c} \hline Type & refAllele & varAllele & refSeq & varSeq & AA coding of varSeq \\ \hline substitution & G & T & aag & aaT & yes\\ substitution & G & TG & tga & tTGa & no\\ substitution & G & TGCG & gtc & TGCGtc & yes\\ insertion & `' & G & cgg & Gcgg & no\\ insertion & `' & TTG & gaa & gaTTGa & yes\\ deletion & A & `' & atc & tc & no\\ deletion & GGCCTA & `' & acggcctaa & aca & yes\\ \hline \end{tabular} \end{center} \caption{Amino Acid Coding} \label{table:aacoding} \end{table} Variants that fall within coding regions are in chromosomes 1, 2 and 16. We subset the \Robject{GRanges} of coding regions by transcripts to retain only the chromosomes of interest. Alternatively one could use the full \Robject{TranscriptDb}. <>= library(BSgenome.Hsapiens.UCSC.hg18) cdsByTx <- cdsBy(txdb) grl <- keepSeqlevels(cdsByTx, c("chr1", "chr2", "chr16")) aaCoding <- predictCoding(variants, grl, seqSource=Hsapiens, varAllele="varAllele") head(aaCoding) @ The result is a \Robject{DataFrame} containing only the variants in coding regions. Columns include queryHits, txID, refSeq, varSeq, refAA, varAA, Consequence and any metadata columns that were present in the \Robject{subject}. The result has one row for each transcript matched by a variant. The queryHits column is the map back to the variants in the original \Robject{query}. Multiple rows are returned for variants that match multiple transcripts. For example, the variant in the sixth row of \Robject{variants} fell within a coding region and matched four transcripts. The first four rows of the aaCoding result are the data for this variant. The Consequence column indicates `synonymous' or `nonsynonymous' for variants for which the codons could be translated. If translation was not possible the Consequence was considered a `frameshift'. Variants for which no amino acid codes were computed have a missing in value in the varAA column. <>= aaCoding[width(aaCoding$varAA) == 0,] @ \section{Filtering} Identifying subsets of variants that meet a specific criteria can be accomplished using the \Rfunction{dbSNPfilter} or \Rfunction{regionFilter} functions. Filters are created with the appropriate reference object such as a \Robject{SNPlocs} or \Robject{TranscriptDb} object. <>= library("SNPlocs.Hsapiens.dbSNP.20090506") ## create filter to identify variants present in dbSNP snpFilt <- dbSNPFilter("SNPlocs.Hsapiens.dbSNP.20090506") ## create filter to identify variants present in introns regionFilt <- regionFilter(txdb, region="intron") regionFilt @ When filtering data with \Rfunction{dbSNPFilter} or \Rfunction{regionsFilt} two options are available for the return object. If the argument \Rcode{subset=TRUE} (default) a \Robject{GRanges} is returned that contains only the records that passed the filter. If \Rcode{subset=FALSE}, the result is a \Robject{VAFilterResult} object. This object contains a logical vector indicating which records passed the filter and some stats on the number of records removed and what filter was applied. <>= ## return the subset of records that passed the filter snpResult <- snpFilt(variants) snpResult metadata(snpResult) ## return a VAFilterResult object snpFilt(variants, subset=FALSE) @ Filters can be combined to act in unison, \begin{verbatim} # return a subset of data passing both filters > combofilt <- compose(snpFilt, regionFilt) > combofilt(variants) region= intron GRanges with 2 ranges and 3 elementMetadata values seqnames ranges strand | refAllele varAllele | [1] chr1 [161003087, 161003087] * | C T [2] chr1 [ 84647761, 84647761] * | C T comments [1] rs1000050, SNP [2] rs6576700 SNP seqlengths chr1 chr13 chr16 chr2 NA NA NA NA \end{verbatim} \section{Variant Call Format (VCF) files} The \Rcode{readVcf} function reads VCF files and parses them into a \Rcode{SummarizedExperiment} object. When the compressed VCF file has a tabix file index a \Robject{GRanges} can be used to specify a subset of ranges. The sample VCF in \Rcode{VariantAnnotation} is not compressed so in this example we compress on the fly, create a tabix index and retrieve a subset of ranges. <>= ## compress vcf file , create index vcfFile <- system.file("extdata", "ex1.vcf", package="VariantAnnotation") from <- vcfFile to <- tempfile() compressVcf <- bgzip(from, to) idx <- indexTabix(compressVcf, "vcf") tab <- TabixFile(compressVcf, idx) head <- headerTabix(tab) ## select ranges with param gr <- GRanges(seqnames="1", ranges=IRanges(start=0, end=100000)) vcf <- readVcf(tab, param=gr) @ All data are parsed according to the header information in the VCF file. The header is stored as a \Rcode{SimpleList} in the \Rcode{exptData} slot and can be viewed by querying the elements of the list. <>= vcf exptData(vcf)[["HEADER"]] exptData(vcf)[["HEADER"]][["META"]] @ The eight required fields of the VCF file are parsed into a \Rcode{GRanges} object in the \Rcode{rowData} slot. The \Rcode{ALT} alleles are provided as a \Rcode{DataFrameList} to accomodate for multiple values per variant. All fields in the \Rcode{INFO} column are parsed into individual elementMetadata columns. << readVCF_rowData>>= exptData(vcf)[["HEADER"]][["FILTER"]] exptData(vcf)[["HEADER"]][["INFO"]] rowData(vcf) ## ALT as a DNAStringSetList head(as.list(values(rowData(vcf))[["ALT"]]), 4) @ The genotype data described in the \Rcode{FORMAT} fields are parsed into matrices in the \Rcode{assays} slot. <>= exptData(vcf)[["HEADER"]][["FORMAT"]] assays(vcf) head(assays(vcf)$GT) @ \section{SIFT and PolyPhen Databases} SIFT (Sorting Intolerant From Tolerant) and PolyPhen (Polymorphism Phenotyping) are methods for predicting the impact of amino acid substitution on a human protein. The SIFT method uses sequence homology and the physical properties of amino acids to make predictions about protein function. PolyPhen uses sequence- based features and structural information characterizing the substitution to make predicitons about the structure and function of the protein. \noindent Both methods offer interactive tools and documentation on their web sites, \noindent \url{http://sift.bii.a-star.edu.sg/} \noindent \url{http://genetics.bwh.harvard.edu/pph2/} Collated predictions for specific dbSNP builds are available as downloads from the SIFT and PolyPhen web sites. These results have been stored as sqlite databases and made available through the \Rcode{Bioconductor} packages \Rpackage{SIFT.Hsapiens.dbSNP132.db} and \Rpackage{PolyPhen.Hapiens.dbSNP131.db}. The packages are designed to be serached by rsid. <>= library(SIFT.Hsapiens.dbSNP132) library(PolyPhen.Hsapiens.dbSNP131) ## rsids in the package head(keys(SIFT.Hsapiens.dbSNP132)) ## list available columns cols(SIFT.Hsapiens.dbSNP132) rsids <- c("rs2142947", "rs8692231", "rs3026284") subst <- c("RSID", "PREDICTION", "SCORE", "AACHANGE", "PROTEINID") select(SIFT.Hsapiens.dbSNP132, keys=rsids, cols=subst) select(PolyPhen.Hsapiens.dbSNP131, keys="rs3026284", cols=c("TRAININGSET", "PREDICTION", "PPH2PROB", "PPH2FPR", "PPH2FDR")) @ Detailed column descriptions can be found at ?SIFTDbColumns and ?PolyPhenDbColumns. \section{References} Wang K, Li M, Hakonarson H, (2010), ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, Vol 38, No. 16, e164.\\ \noindent McLaren W, Pritchard B, RiosD, et. al., (2010), Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics, Vol. 26, No. 16, 2069-2070. \section{Session Information} <>= sessionInfo() @ \end{document}