%\VignetteIndexEntry{How to get pretty HTML output for my gene list} %\VignetteDepends{annotate, biomaRt} %\VignetteKeywords{Expression Analysis, Annotation} %\VignettePackage{biomaRt} \documentclass[11pt]{article} \usepackage{hyperref} \usepackage{url} \usepackage[authoryear,round]{natbib} \bibliographystyle{plainnat} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \begin{document} \title{HowTo: get pretty HTML output for my gene list} \author{James W. MacDonald} \maketitle{} \SweaveOpts{keep.source=TRUE} \section{Overview} The intent of this vignette is to show how to make reasonably nice looking HTML tables for presenting the results of a microarray analysis. These tables are a very nice format because you can insert clickable links to various public annotation databases, which facilitates the downstream analysis. In addition, the format is quite compact, can be posted on the web, and can be viewed using any number of free web browsers. One caveat; an HTML table is probably not the best format for presenting the results for \emph{all} of the genes on a chip. For even a small (5000 gene) chip, the file could be 10 Mb or more, which would take an inordinate amount of time to open and view. Also note that the Bioconductor project supplies annotation packages for many of the more popular Affymetrix chips, as well as for many commercial spotted cDNA chips. For chips that have annotation packages, the \Rpackage{annaffy} package is the preferred method for making HTML tables. To make an annotated HTML table, the only requirement is that we have some sort of annotation data for the microarray that we are using. Most manufacturers supply data in various formats that can be read into \Rpackage{R}. For instance, Affymetrix supplies CSV files that can be read into \Rpackage{R} using the \Rmethod{read.csv()} function \url{http://www.affymetrix.com/support/technical/byproduct.affx?cat=arrays}. Another alternative is to annotate using functionality in the \Rpackage{biomaRt} package. This allows one to get the most current annotations interactively. In addition, the output can be used directly with functions in \Rpackage{annotate} to make HTML tables. We will use these functions in this vignette. \section{Data Analysis} I will assume that the reader is familiar with the analysis of microarray data, and has a set of genes that she would like to use. In addition, I will assume that the reader is familiar enough with \Rpackage{R} that she can subset the data based on a list of genes, and reorder based on a particular statistic. For any questions about subsetting or ordering data, please see ``An Introduction to R''. For questions regarding microarray analysis, please consult the vignettes for, say \Rpackage{limma}, \Rpackage{multtest}, or \Rpackage{marray}. \section{Getting Started} We first load the \Rpackage{annotate} package, as well as some data. These data will be from the Affymetrix HG-U95Av2 chip (for which we would normally use \Rpackage{annaffy}). To keep the HTML table small, we will take a subset of fifteen genes as an example. <>= options(width=70) @ <<>>= library("annotate") data(sample.ExpressionSet) igenes <- featureNames(sample.ExpressionSet)[246:260] @ We also have to load the \Rpackage{biomaRt} package and connect to a Biomart database, using the \Rfunction{useMart} function. Note that there are two interfaces that \Rpackage{biomaRt} can use to connect to a Biomart database, using either the \Rpackage{RCurl} package to connect via http protocols, or the \Rpackage{RMySQL} package to connect via database protocols. The default is to use \Rpackage{RCurl} because it can be difficult to get \Rpackage{RMySQL} set up on Windows computers. However, I find the database connectivity to be much faster, so for those who want to annotate a large number of genes, I would recommend using the \Rpackage{RMySQL} interface. See the help file for \Rfunction{useMart} for more information. <<>>= library("biomaRt") mart <- useMart("ensembl", "hsapiens_gene_ensembl") @ \section{Annotation Data} The \Rfunction{htmlpage} function is designed to take two sets of input; data that will be converted to clickable links to various online databases, and data that will simply be put into the HTML table as is. For the clickable links we need an \Robject{list} of character vectors for each database. For the data, we need a \Robject{list} of either \Robject{vectors}, \Robject{data.frames} or a mixture of the two. We will explore this topic more later. First, we will see how to get data using \Rpackage{biomaRt} functionality. We first need to see exactly what sort of data we can get from Ensembl's Biomart. Note that some of the information can be a bit cryptic, so we can parse out a more reasonable description. We also need to see what things we can use as identifiers in our query of the Biomart server. First, a bit of terminology. An 'attribute' is a data type that can be returned from a query of a Biomart server. A 'filter' is the identifier that we use to query the server. For instance, we can get GO terms and Entrez Gene IDs (the attributes) by querying on the Affy Probe ID (the filter). There are too many attributes to list here, so we can parse out things that may be interesting to us. Let's say we want to get Entrez Gene and SwissProt IDs for our set of genes. <<>>= attrbuts <- listAttributes(mart) attrbuts[grep("swiss", attrbuts[,1]),] attrbuts[grep("entrez", attrbuts[,1]),] @ So the attributes we want are uniprot\_swissprot\_accession and entrezgene. The same basic idea can be used to figure out which filter to use. Since we are using the HG-U95av2 chip, we need to figure out what that filter is called. <<>>= fltr <- listFilters(mart) fltr[grep("affy", fltr[,1]),] @ This one is pretty obvious - we want the affy\_hg\_u95av2 filter. Now to get back to the task at hand; we have 15 Affy Probeset IDs that we want to use to create our HTML table. Let's say we want to create an HTML table in which we map the Affy IDs to Entrez Gene, SwissProt, UniGene, and RefSeq IDs (all clickable links) and in addition we want to include the gene description and symbol, the Gene Ontology terms, and chromosome location, as well as the $t$-statistic, $p$-value, fold change and expression values. That would be a nice compact format for presenting the data to someone. We need to collect all this information in two lists; one that will be used to make the hyperlinks, and one that will just be static information. First, we do the hyperlinks. By using the ideas presented above, I figured out which attributes correspond to Entrez Gene, SwissProt, UniGene, and RefSeq IDs, so I will use them here. Instead of writing them all out, I will simply select the correct terms using the \Rfunction{listAttributes} function. <<>>= genelist <- getBM(attributes = c("affy_hg_u95av2","entrezgene","uniprot_swissprot_accession","refseq_dna"), filter = "affy_hg_u95av2", values = igenes, mart = mart, output = "list", na.value = " ") genelist[[1]] <- igenes ## let's look at genelist values lapply(genelist, function(x) x[5:10]) @ Here we can see that genelist is a \Robject{list} of \Robject{lists}, with each sub-list being made up of a character vector. I substituted the 'igenes' vector back into the first position of the list because the Biomart server we are using doesn't have data for all Affy IDs (including the Affy ID itself). Since we want links for all of the Affy IDs, I simply substituted the original vector into the genelist. Two important things to note about the call to \Rfunction{getBM}. First, we have to use the argument output = ``list''. Second, we have to use na.value = ``\ '', which will create an empty table entry for any missing data. This is much nicer than leaving the NAs, which will tend to clutter up the table without adding any information. Making the second \Robject{list} is a bit more complicated. We need to get some annotation from the Biomart database, and append that to some data we have from our experiment. The first step is to get the annotation data. As noted above, we want the gene name and symbol, as well as the GO terms and chromosome location for these probesets. We can figure out which attribute terms to use, following the ideas presented above. <<>>= attrbuts[grep("description", attrbuts[,1]),] attrbuts[grep("symbol", attrbuts[,1]),] attrbuts[grep("go", attrbuts[,1]),] attrbuts[grep("^chrom", attrbuts[,1]),] @ We want "description", "hgnc\_symbol", "go\_biological\_process\_id" and "band" <<>>= annotlist <- getBM(attributes = c("description", "hgnc_symbol", "go_biological_process_id", "band"), filter = "affy_hg_u95av2", values = igenes, mart = mart, output = "list",na.value = " ") @ <>= lapply(annotlist, function(x) x[5:10]) @ <>= prntlist <- vector("list", 4) prntlist <- lapply(annotlist[1:3], function(x) lapply(x[5:10], strwrap)) prntlist[[4]] <- annotlist[[4]][5:10] prntlist @ Now we have the annotation data, it is time to add in the experimental data. As an example, we will only use the first ten samples. We also use the \Rfunction{round} function to truncate the data to a reasonable number of decimal points. <>= dat <- round(exprs(sample.ExpressionSet)[igenes,1:10], 3) FC <- round(rowMeans(dat[igenes,1:5]) - rowMeans(dat[igenes,6:10]), 2) pval <- round(esApply(sample.ExpressionSet[igenes,1:10], 1, function(x) t.test(x[1:5], x[6:10])$p.value), 3) tstat <- round(esApply(sample.ExpressionSet[igenes,1:10], 1, function(x) t.test(x[1:5], x[6:10])$statistic), 2) @ We now need to put all this into one list. <<>>= othernames <- vector("list", length = 8) othernames[1:4] <- annotlist othernames[5:8] <- list( tstat, pval, FC, dat) @ \section{Build the Table} Once we have all our data in \Robject{lists}, it is simple to build the HTML table. <<>>= table.head <- c("Affy ID", "Entrez Gene", "SwissProt", "RefSeq", "Name","Symbol","GO Term", "Band", "t-statistic","p-value","Fold change", sampleNames(sample.ExpressionSet)[1:10]) repository <- list("affy","en","sp","gb") htmlpage(genelist, "Annotated genes.html", "Annotated genes", othernames, table.head, repository = repository) @ The resulting HTML table should be in R\_HOME/library/annotate/doc. If not, you can reproduce it using the \Rfunction{vExplorer} function in the \Rpackage{tkWidgets} package, which will allow you to step through the code in this vignette and examine all the objects that are made. Alternatively, one could use \Rfunction{getwd} to change the working directory to R\_HOME/library/annotate/doc, then use \Rfunction{Stangle} on the vignette and then source the resulting R code (e.g., Stangle("prettyOutput.Rnw") followed by source("prettyOutput.R"). \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \end{document}