% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{GGtools: exploring genetics of gene expression} %\VignetteDepends{snpStats} %\VignetteKeywords{expression, genetics, high performance computing} %\VignettePackage{GGtools} \documentclass[12pt]{article} \usepackage{amsmath,pstricks} % With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running % pdflatex will fail. Note that using auto-pst-pdf requires to set environment % variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in % texmf.cnf for Tex Live on Unix/Linux/Mac. \usepackage{auto-pst-pdf} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{GGtools 2011: leaner software for genetics of gene expression} \author{VJ Carey (stvjc at channing.harvard.edu)} \maketitle \tableofcontents \clearpage \section{Introduction; major changes} Since its introduction in 2006, GGtools has provided a number of data structures and tools for exploratory data analysis and hypothesis testing in expression genetics. Since 2006, Bioconductor's facilities for representing genomes and for exploiting advanced ideas in computing and statistical modeling have evolved considerably, and many components of GGtools/GGBase need to be discarded to promote use of new facilities. The following major changes have been made. \begin{itemize} \item \texttt{smlSet} instances should not be used for genotyping panels of more than one million loci. A packaging discipline has been introduced. An expression genetics experiment should be managed in a package in which expression data are held in an \texttt{ExpressionSet} instance in the \texttt{data} folder, and \textit{snpStats} \texttt{SnpMatrix} instances are stored in \texttt{inst/parts}. After installation, the \texttt{getSS} function constructs an \texttt{smlSet} instance on the fly -- typically with modest memory footprint because only a fraction of available loci are held in memory. \texttt{externalize} demonstrates construction of a compliant package for expression genetics data by converting an existing whole- or partial-genome smlSet instance into a package. \item \texttt{gwSnpTests} has been retained for focused in-memory test computation, but it will be phased out. \item \texttt{eqtlTests} is the primary function for inference in expression genetics. This function takes genotype and phenotype information from an \texttt{smlSet}, passes it through testing facilities of \textit{snpStats}, and creates managed out-of-memory storage for voluminous testing results. \item Facilities for metadata handling, e.g., determining SNP locations, gene locations, vocabulary translations, are being phased out. Users must construct location and feature name metadata using up-to-date Bioconductor resources or their own resources. \end{itemize} The remainder of this document illustrates basic activities now supported by the package. \section{A simple exploration of eQTL for a selected gene} \subsection{Filtering an smlSet for a chromosome-wide test} We retain a two-chromosome example with expression data derived from the Wellcome Trust GENEVAR project archives, and genotype data from HapMap phase II: <>= library(GGtools) if (!exists("hmceuB36.2021")) hmceuB36.2021 <- getSS("GGtools", c("20", "21")) hmceuB36.2021 @ A commonly found gene with eQTL is CPNE1 on chromosome 20. We can determine the probe name in use for CPNE1 as follows: <>= library("illuminaHumanv1.db") cpn = get("CPNE1", revmap(illuminaHumanv1SYMBOL)) use = intersect(cpn, featureNames(hmceuB36.2021)) if (length(use) == 0) stop("probe not on array") use = use[1] @ We will now trim down the \texttt{smlSet} instance to this gene and SNPs on chr20: <>= hmlit = hmceuB36.2021[probeId(use),] hmlit = hmlit[ chrnum("20"), ] hmlit @ \subsection{Executing tests and interrogating the results} The preferred testing procedure is: <>= tname = function() gsub(".*/|.*\\\\", "", tempfile()) e1 = eqtlTests(hmlit, ~male, targdir=tname(), runname=tname()) e1 class(e1) @ We support focused queries into the manager of results. For this application we are computing $\chi^2(1)$ statistics for each gene-SNP combination, measuring the fit of an additive genetic model with, in this case, adjustment for gender. <>= e1[rsid("rs6060535"), probeId("GI_23397697-A")] @ We can acquire the most predictive SNP using \texttt{topFeats}. <>= topFeats(probeId(cpn), mgr=e1, ffind=1) @ Here the \texttt{ffind} parameter must be set to pick among files that might be managed by the manager bound to \texttt{mgr}. See the technical appendix for more details. \subsection{Visualization} To visualize the results, using hg18 locations for the SNP, we must acquire location metadata. Note that we are using the 2009 SNPlocs metadata from Bioconductor; for later builds, an as.GRanges parameter can be set to help avoid some of the conversion tasks. <>= library(SNPlocs.Hsapiens.dbSNP.20090506) c20l = getSNPlocs("chr20") c20 = GRanges(IRanges(c20l$loc, width=1), seqnames="chr20") rs20 = paste("rs", c20l$RefSNP_id, sep="") names(c20) = rs20 length(c20) c20[1:3,] @ The names of SNP for which tests were computed are <>= tn = rownames(e1@fflist[[1]]) tn[1:10] @ We can now obtain the manhattan plot: <>= manhPlot(cpn, mgr=e1, ffind=1, locGRanges=c20, cex=3) @ In general we will prefer to have this information in a browser track. To accomplish this, add the test results to the GRanges instance for locations as `score' and export a wig file. We have to align the SNPs with locations with those for which we have tests. First, restrict the locations. <>= okn = intersect(tn, names(c20)) c20ok = c20[okn] @ Now obtain the -log10 p values: <>= chis1 = e1[, probeId(cpn)][[1]][okn,] ml10p = -log10(1-pchisq(chis1, 1)) @ Bind to the location data: <>= elementMetadata(c20ok)$score = ml10p c20ok[1:5,] @ and export: <>= library(rtracklayer) export(c20ok, "cpne1.wig") @ This is imported as a custom track and viewed as: \vspace*{1cm} \setkeys{Gin}{width=1.15\textwidth} \includegraphics{cpne1_2011} \subsection{Checking coincidence with other genomic features} When we store the statistical results in a \texttt{GRanges} instance, we have capacity to quickly assess genomic context of high-scoring SNP. The top 20 SNP and their locations are <>= top200 = c20ok[order(elementMetadata(c20ok)$score, decreasing=TRUE)[1:200]] top200[196:200] @ We see that we have SNP with p-values at most $10^{-5}$. We will now check whether any of these lie in regulatory regions defined by ORegAnno, confined to 10MB near CPNE1 on chr20. Execution suppressed for build failures, 9/15/2011. <>= library(rtracklayer) ss = browserSession() genome(ss) = "hg18" GG = GRanges(seqnames="chr20", IRanges(30000000,40000000)) oq = ucscTableQuery(ss, "oreganno", range=GG) toq = track(oq) ov = subsetByOverlaps(toq, top200) ov @ For further discovery of the regulatory feature characteristics, consider <>= tableName(oq) = "oregannoAttr" featdat = getTable(oq) featdat[ featdat$id %in% ov$name, ] @ Retrieving the table can take a while, so we suppress it in this vignette, but on 8 Mar 2011 the result was: \begin{verbatim} > featdat[ featdat$id %in% ov$name, ] id attribute attrVal 5809 OREG0004040 type REGULATORY REGION 5810 OREG0004040 Gene FER1L4 5811 OREG0004040 Dataset Stanford ENCODE Dataset 5812 OREG0004040 EvidenceSubtype Transient transfection luciferase assay 5833 OREG0004046 type REGULATORY REGION 5834 OREG0004046 Gene RBM12 5835 OREG0004046 Dataset Stanford ENCODE Dataset 5836 OREG0004046 EvidenceSubtype Transient transfection luciferase assay 86472 OREG0025894 type REGULATORY REGION 86473 OREG0025894 TFbs CTCF 86474 OREG0025894 Dataset CTCF ChIP-chip sites (Ren lab) 86475 OREG0025894 EvidenceSubtype ChIP-on-chip (ChIP-chip) \end{verbatim} enabling us to learn more about the annotations of regions in which high-scoring SNP are found. Clearly other UCSC-mediated resources can be used similarly to help interpret SNP scores. \section{Supporting comprehensive testing} \subsection{The default behavior of \texttt{eqtlTests}} \texttt{eqtlTests} will, by default, compute scores for all gene-SNP combinations in the \texttt{smlSet} instance to which it is applied. With a multicore system, some concurrency can be achieved. To illustrate the activity, we take only 20 genes sampled from the illumina platform and compute all eQTL tests for SNP on chromosomes 20 and 21. <>= NG = length(featureNames(hmceuB36.2021)) hmct = hmceuB36.2021[ sample(1:NG, size=20, replace=FALSE), ] applier = lapply chkmult = function() {length(grep("^multicore$", installed.packages()[,1])) > 0} if (chkmult()) { library(multicore) options(cores=min(c(multicore:::volatile$detectedCores, 6))) applier = mclapply } unix.time(e2 <- eqtlTests(hmct, ~male, geneApply=applier, targdir=tempdir(), runname=tname())) @ We are now managing 3.4 million tests. These are stored out of memory using the \textit{ff} system for memory-mapped disk representations of R objects. <>= e2@fflist[[1]] sapply(e2@fflist,dim) @ It is unpleasant to use the at-sign fflist notation but until a richer set of use cases is identified, this primitive idiom will be used. In general we will break up a comprehensive run by gene sets -- typically one run for all genes on a chromosome. In this case, we will generate 22 eqtlTestManager instances; these can be unified with a cisTransDirector or multiCisDirector object. We will discuss that later. \subsection{Reducing the memory footprint for comprehensive `same chromosome' tests} In the example given above, the data for genotypes on chr20 and chr21 are simultaneously in memory. Up to now, we have often stored all genotype data in memory when using smlSets, leading to multiple Gb of RAM consumption, causing conflicts and loss of speed. We now describe an approach to building and using package-based experiment representations, which allow reduced and more flexible use of resources. Because this functionality involves use of the file system along with package building and installation, which can lead to subtle portability issues, we do not evaluate the code chunks here. We can take a unified smlSet instance and turn it into a package using the \texttt{externalize} function. We also install it: <>= edname = paste("exdem", tname(), sep="") if (interactive() & file.exists(edname)) stop(paste("will not remove existing '", edname, "' folder")) externalize(hmceuB36.2021, edname) tar(tna <- paste(edname,".tar.gz", sep=""), edname, compression="gzip") install.packages(tna, repos=NULL, type="source") try(system(paste("rm -rf", tna))) try(system(paste("rm -rf", edname))) @ Now we can use \texttt{getSS} to create tailored smlSet instances on the fly. <>= gc() s1 = getSS(edname, "20") gc() rm(s1) gc() @ This technology is used in the makeDiagDirector function. This function requires a list that maps genes to chromosomes. In this example, we take 20 genes from each of chromosome 20 and 21 to define the map. All same-chromosome tests for these 40 genes will be computed. No concurrency is used. <>= mymap = lapply(c("20", "21"), function(x) get(x, revmap(illuminaHumanv1CHR))[1:20]) mymap = lapply(mymap, function(x) intersect(x, featureNames(hmceuB36.2021))) names(mymap) = c("20", "21") unix.time(md <- makeDiagDirector("GGtools", mymap, rhs=~male, sleeplen=2, runname=tname(), targdir=tname())) md @ It would be typical to save the director object (here named \texttt{md}) to disk, so that the ff files that have been computed can be reused conveniently. \subsection{Acquiring test results within specified intervals} Here we will discuss how to use \texttt{cisProxScores}. The calling sequence is \begin{verbatim} cisProxScores(smlSet, fmla, dradset, direc = NULL, folder, runname, geneApply = mclapply, saveDirector = TRUE, snpGRL=NULL, geneGRL=NULL, snpannopack="SNPlocs.Hsapiens.dbSNP.20090506", ffind=NULL, ...) \end{verbatim} so in fact one can pass an \texttt{smlSet} instance as input. However, if a suitable director instance is available, that can be used, avoiding the computation of the tests. If the director is used, we need only specify \begin{itemize} \item \texttt{dradset}: the collection of nested paired intervals around genes within which scores will be collected \item \texttt{geneGRL}: a list of GRanges, with one element per chromosome, providing coordinates of gene start and stop locations \item \texttt{snpGRL}: a list of GRanges, with one element per chromosome, providing coordinates of SNPs with names \item \texttt{ffind}: an index value into the manager-specific lists for the scores we wish to retrieve; in most cases this will be 1 \end{itemize} The names of probesets scored in \texttt{md} can be obtained: <>= ps = lapply(md@mgrs, function(x) colnames(x@fflist[[1]])) names(ps) = names(md@mgrs) ps @ The GRanges instance is then obtained as <>= getgr = function(x,map) abs(sapply(mget(x, map), "[", 1)) st = lapply(ps, getgr, illuminaHumanv1CHRLOC) en = lapply(ps, getgr, illuminaHumanv1CHRLOCEND) sp = names(ps) grl = list() for (i in 1:length(ps)) { grl[[i]] = GRanges(seqnames=sp[i], IRanges(st[[i]], en[[i]])) names(grl[[i]]) = ps[[i]] } names(grl) = names(ps) @ SNP locations from hg18 can be obtained as follows (we did chr20 above): <>= seqlevels(c20) <- sub("chr", "", seqlevels(c20)) c21l = getSNPlocs("chr21") # hg18 c21 = GRanges(IRanges(c21l$loc, width=1), seqnames="21") rs21 = paste("rs", c21l$RefSNP_id, sep="") names(c21) = rs21 srl = list(`20`=c20, `21`=c21) @ Now compute scores in intervals to 50kb around gene, then from 50kb to 100kb, then from 100kb to 500kb: <>= csc = cisProxScores( direc=md, dradset = c(0,50000,100000, 500000), snpGRL = srl, geneGRL = grl, geneApply=lapply, ffind=1 ) @ This yields a deeply nested structure; a hint of best scores per gene within intervals comes from <>= lapply(csc, lapply, sapply, function(x)max(unlist(x))) @ \section{Working with multiple populations} Here we will discuss how to take multiple population-specific directors and obtain a director representing the sums across populations. \section{Session information} <>= sessionInfo() @ \end{document}