%\VignetteIndexEntry{Entrez Genome queries} \documentclass[12pt]{article} \usepackage{Sweave} \usepackage{fullpage} \usepackage{hyperref} \newcommand{\R}{\textsf{R}} \newcommand{\Rcmd}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\texttt{#1}} \title{ Entrez database queries } \author{Chris Stubben} \begin{document} \maketitle %% for cutting and pasting use continue ="" %% change margins on every chunk <>= library(genomes) data(lproks) options(warn=-1, width=75, digits=2, scipen=3, "prompt" = "R> ", "continue" = " ") options(SweaveHooks=list(fig=function() par(mar=c(5,4.2,1,1)))) @ Genome tables may also be created using two Entrez Utility functions. The \Rcmd{term2summary} function remotely queries the Genome Project database at NCBI using any valid combination of Entrez search terms and returns a genome table. Since detailed taxonomy information is not stored in the local tables, a typical search may include listing genomes projects by a taxonomy group like family, class or order. In addition, some fields in the genome tables like sequencing center may be incomplete and many other fields are missing. For example, this query returns a list of microbial genome projects which have sequence data in the Short Read Archive. <>= sra <- term2summary("genomeprj sra[Filter] AND Bacteria[ORGN]") sra @ The \Rcmd{term2neighbor} function searches the Genome database and retrieves links to other genomes for a species (genome neighbors) in the Nucleotide database and then returns a table listing accession numbers, deflines, released dates, and taxonomy ids. Viral genomes typically have one Reference sequence per species, and other strains are linked as Genome Neighbors. For example, Nipah virus is listed once in the virus table (NC\_002728) and has 7 neighbors reported. To download those 7 neighbors, use the \Rcmd{term2neighbor} function shown in the next example. In addition, the function can also return the GenBank sequence that the reference was derived from using the \Rcmd{derived=TRUE} option. Finally, if you are searching for a large group of viruses, it is often helpful to lookup the scientific name using the taxonomy ID in the table. The \Rcmd{taxid2names} function takes a taxonomy ID and returns the scientific name and lineage from the Taxonomy database. Using pattern matching, one can return the genus and plot released dates. <>= data(virus) subset(virus, name %like% 'Nipah*') nipah<-term2neighbor('Nipah virus[orgn]') nipah[,1:2] buny<- term2neighbor("Bunyaviridae[ORGN]", derived=TRUE) nrow(buny) taxids<-unique(buny$taxid) btax<- taxid2names(taxids) genus<-gsub("(.*Bunyaviridae; )(\\w*)(.*)", "\\2", btax$lineage) n<-match(buny$taxid, btax$taxid) plotby(buny, genus[n], log='y', lbty='n', lcex=.7) @ \begin{figure}[t] \centering \includegraphics[height=3in,width=3in]{entrez-queries-eneigh.pdf} \caption{Accumulated number of genome sequences for vector-borne viruses in the family Bunyaviridae.} \label{eneigh} \end{figure} \end{document}