%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%

%\VignetteIndexEntry{GGtools: exploring genetics of gene expression}
%\VignetteDepends{snpStats}
%\VignetteKeywords{expression, genetics, high performance computing}
%\VignettePackage{GGtools}

\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
% With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running
% pdflatex will fail. Note that using auto-pst-pdf requires to set environment
% variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in
% texmf.cnf for Tex Live on Unix/Linux/Mac.
\usepackage{auto-pst-pdf}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{GGtools 2011: leaner software for genetics of gene expression}
\author{VJ Carey (stvjc at channing.harvard.edu)}
\maketitle

\tableofcontents

\clearpage

\section{Introduction; major changes}

Since its introduction in 2006, GGtools has provided a number of 
data structures and tools for exploratory data analysis and
hypothesis testing in expression genetics.  Since 2006, Bioconductor's
facilities for representing genomes and for exploiting
advanced ideas in computing and statistical modeling have evolved considerably,
and many components of GGtools/GGBase need to be discarded to
promote use of new facilities.

The following major changes have been made.
\begin{itemize}
\item \texttt{smlSet} instances should not be used for genotyping panels
of more than one million loci.  A packaging discipline has been introduced.
An expression genetics experiment should be managed in a package in which
expression data are held in an \texttt{ExpressionSet} instance in the
\texttt{data} folder, and \textit{snpStats} \texttt{SnpMatrix} instances
are stored in \texttt{inst/parts}.  After installation, the \texttt{getSS}
function constructs an \texttt{smlSet} instance on the fly -- typically
with modest memory footprint because only a fraction of available loci
are held in memory.
\texttt{externalize} demonstrates construction of a compliant package for
expression genetics data by converting an existing whole- or partial-genome 
smlSet instance into a package.
\item \texttt{gwSnpTests} has been retained for focused in-memory test
computation, but it will be phased out.
\item \texttt{eqtlTests} is the primary function for inference
in expression genetics.  This function takes genotype and phenotype
information from
an \texttt{smlSet}, passes it through testing facilities of
\textit{snpStats}, and creates managed out-of-memory storage for
voluminous testing results.
\item Facilities for metadata handling, e.g., determining SNP locations,
gene locations, vocabulary translations, are being phased out.  Users
must construct location and feature name metadata using up-to-date
Bioconductor resources or their own resources.
\end{itemize}

The remainder of this document illustrates basic activities now
supported by the package.

\section{A simple exploration of eQTL for a selected gene}

\subsection{Filtering an smlSet for a chromosome-wide test}

We retain a two-chromosome example with expression data
derived from the Wellcome Trust GENEVAR project archives, and
genotype data from HapMap phase II:
<<lkd>>=
library(GGtools)
data(hmceuB36.2021)
hmceuB36.2021
@

A commonly found gene with eQTL is CPNE1 on chromosome 20.  We
can determine the probe name in use for CPNE1 as follows:
<<lkn>>=
library("illuminaHumanv1.db")
cpn = get("CPNE1", revmap(illuminaHumanv1SYMBOL))
use = intersect(cpn, featureNames(hmceuB36.2021))
if (length(use) == 0) stop("probe not on array")
use = use[1]
@
We will now trim down the \texttt{smlSet} instance
to this gene and SNPs on chr20:
<<trim>>=
hmlit = hmceuB36.2021[probeId(use),]
hmlit = hmlit[ chrnum("20"), ]
hmlit
@

\subsection{Executing tests and interrogating the results}

The preferred testing procedure is:
<<doe>>=
tname = function() gsub(".*/|.*\\\\", "", tempfile())
e1 = eqtlTests(hmlit, ~male, targdir=tname(), runname=tname())
e1
class(e1)
@
We support focused queries into the manager of results.  For this
application we are computing $\chi^2(1)$ statistics for
each gene-SNP combination, measuring the fit of an additive
genetic model with, in this case, adjustment for gender.

<<doq>>=
e1[rsid("rs6060535"), probeId("GI_23397697-A")]
@

We can acquire the most predictive SNP using \texttt{topFeats}.
<<dotop>>=
topFeats(probeId(cpn), mgr=e1, ffind=1)
@
Here the \texttt{ffind} parameter must be set to pick among
files that might be managed by the manager bound to \texttt{mgr}.
See the technical appendix for more details.

\subsection{Visualization}

To visualize the results, using hg18 locations for the SNP,
we must acquire location metadata.  Note that we are using
the 2009 SNPlocs metadata from Bioconductor; for later
builds, an as.GRanges parameter can be set to help avoid some
of the conversion tasks.

<<get09>>=
library(SNPlocs.Hsapiens.dbSNP.20090506)
c20 = getSNPlocs("chr20")
c20 = SNPlocs.Hsapiens.dbSNP.20100427:::.SNPlocsAsGranges(c20, "chr20")
rs20 = paste("rs", elementMetadata(c20)$RefSNP_id, sep="")
names(c20) = rs20
length(c20)
c20[1:3,]
@
The names of SNP for which tests were computed are
<<lkn>>=
tn = rownames(e1@fflist[[1]])
tn[1:10]
@
We can now obtain the manhattan plot:
<<lkman,fig=TRUE>>=
manhPlot(cpn, mgr=e1, ffind=1, locGRanges=c20, cex=3)
@

In general we will prefer to have this information in a browser track.  To accomplish
this, add the test results to the GRanges instance for locations as `score'
and export a wig file.

We have to align the SNPs with locations with those for
which we have tests.  First, restrict the locations.
<<dook>>=
okn = intersect(tn, names(c20))
c20ok = c20[okn]
@
Now obtain the -log10 p values:
<<getp>>=
chis1 = e1[, probeId(cpn)][[1]][okn,]
ml10p = -log10(1-pchisq(chis1, 1))
@
Bind to the location data:
<<dob>>=
elementMetadata(c20ok)$score = ml10p
c20ok[1:5,]
@
and export:
<<doex>>=
library(rtracklayer)
export(c20ok, "cpne1.wig")
@

This is imported as a custom track and viewed as:

\vspace*{1cm}

\setkeys{Gin}{width=1.15\textwidth}

\includegraphics{cpne1_2011}

\subsection{Checking coincidence with other genomic features}

When we store the statistical results in a \texttt{GRanges}
instance, we have capacity to quickly assess genomic context
of high-scoring SNP.  The top 20 SNP and their locations are
<<lkgsc>>=
top200 = c20ok[order(elementMetadata(c20ok)$score, decreasing=TRUE)[1:200]]
top200[196:200]
@
We see that we have SNP with p-values at most $10^{-5}$.  We will
now check whether any of these lie in regulatory regions defined
by ORegAnno, confined to 10MB near CPNE1 on chr20.

<<gettab>>=
library(rtracklayer)
ss = browserSession()
genome(ss) = "hg18"
GG = GRanges(seqnames="chr20", IRanges(30000000,40000000))
oq = ucscTableQuery(ss, "oreganno", range=GG)
toq = track(oq)
ov = subsetByOverlaps(toq, top200)
ov
@

For further discovery of the regulatory feature characteristics,
consider 
<<domore,eval=FALSE>>=
tableName(oq) = "oregannoAttr"
featdat = getTable(oq)
featdat[ featdat$id %in% ov$name, ]
@
Retrieving the table can take a while, so we suppress it in this
vignette, but on 8 Mar 2011 the result was:
\begin{verbatim}
> featdat[ featdat$id %in% ov$name, ]
               id       attribute                                 attrVal
5809  OREG0004040            type                       REGULATORY REGION
5810  OREG0004040            Gene                                  FER1L4
5811  OREG0004040         Dataset                 Stanford ENCODE Dataset
5812  OREG0004040 EvidenceSubtype Transient transfection luciferase assay
5833  OREG0004046            type                       REGULATORY REGION
5834  OREG0004046            Gene                                   RBM12
5835  OREG0004046         Dataset                 Stanford ENCODE Dataset
5836  OREG0004046 EvidenceSubtype Transient transfection luciferase assay
86472 OREG0025894            type                       REGULATORY REGION
86473 OREG0025894            TFbs                                    CTCF
86474 OREG0025894         Dataset          CTCF ChIP-chip sites (Ren lab)
86475 OREG0025894 EvidenceSubtype                ChIP-on-chip (ChIP-chip)
\end{verbatim}
enabling us to learn more about the annotations of regions in which high-scoring SNP
are found.

Clearly other UCSC-mediated resources can be used similarly to help interpret SNP scores.


\section{Supporting comprehensive testing}

\subsection{The default behavior of \texttt{eqtlTests}}

\texttt{eqtlTests} will, by default, compute scores for all gene-SNP combinations
in the \texttt{smlSet} instance to which it is applied.  With a multicore
system, some concurrency can be achieved.  To illustrate the activity,
we take only 20 genes sampled from the illumina platform and compute all
eQTL tests for SNP on chromosomes 20 and 21.
<<lke>>=
NG = length(featureNames(hmceuB36.2021))
hmct = hmceuB36.2021[ sample(1:NG, size=20, replace=FALSE), ]
applier = lapply
chkmult = function() {length(grep("^multicore$", installed.packages()[,1])) > 0}
if (chkmult()) {
  library(multicore)
  options(cores=min(c(multicore:::volatile$detectedCores, 6)))
  applier = mclapply
  }
unix.time(e2 <- eqtlTests(hmct, ~male, geneApply=applier, targdir=tempdir(), runname=tname()))
@
We are now managing 3.4 million tests.  These are stored out of memory using
the \textit{ff} system for memory-mapped disk representations of R objects.
<<lkff>>=
e2@fflist[[1]]
sapply(e2@fflist,dim)
@
It is unpleasant to use the at-sign fflist notation but until a richer set
of use cases is identified, this primitive idiom will be used.

In general we will break up a comprehensive run by gene sets -- typically one
run for all genes on a chromosome.  In this case, we will generate 22 eqtlTestManager
instances; these can be unified with a cisTransDirector or multiCisDirector
object.  We will discuss that later.

\subsection{Reducing the memory footprint for comprehensive `same chromosome' tests}

In the example given above, the data for genotypes on chr20 and chr21 are simultaneously
in memory.  Up to now, we have often stored all genotype data in memory when
using smlSets, leading to multiple Gb of RAM consumption, causing conflicts
and loss of speed.  We now describe an approach to building and using package-based
experiment representations, which allow reduced and more flexible use of resources.

We can take a unified smlSet instance and turn it into a package using the
\texttt{externalize} function.  We also install it:
<<doext>>=
edname = paste("exdem", tname(), sep="")
if (interactive() & file.exists(edname)) stop(paste("will not remove existing '", edname, "' folder"))
externalize(hmceuB36.2021, edname)
tar(tna <- paste(edname,".tar.gz", sep=""), edname, compression="gzip")
install.packages(tna, repos=NULL, type="source")
try(system(paste("rm -rf", tna)))
try(system(paste("rm -rf", edname)))
@
Now we can use \texttt{getSS} to create tailored smlSet instances on the fly.
<<lkd>>=
gc()
s1 = getSS(edname, "20")
gc()
rm(s1)
gc()
@
This technology is used in the makeDiagDirector function.  This function requires
a list that maps genes to chromosomes.  In this example, we take 20 genes from
each of chromosome 20 and 21 to define the map.  All same-chromosome tests for these
40 genes will be computed.  No concurrency is used.
<<mymap>>=
mymap = lapply(c("20", "21"), function(x) get(x, revmap(illuminaHumanv1CHR))[1:20])
mymap = lapply(mymap, function(x) intersect(x, featureNames(hmceuB36.2021)))
names(mymap) = c("20", "21")
unix.time(md <-  makeDiagDirector(edname, mymap, rhs=~male, sleeplen=2, runname=tname(),
    targdir=tname()))
md
@
It would be typical to save the director object (here named \texttt{md}) to disk, so that the ff files
that have been computed can be reused conveniently.

\subsection{Acquiring test results within specified intervals}

Here we will discuss how to use cisProxScores with the director computed above.
The calling sequence is
\begin{verbatim}
cisProxScores(smlSet, fmla, dradset, direc = NULL, folder, runname, 
      geneApply = mclapply, saveDirector = TRUE, snpGRL=NULL,
      geneGRL=NULL, snpannopack="SNPlocs.Hsapiens.dbSNP.20090506", ffind=NULL, ...)
\end{verbatim}
so in fact one can pass an \texttt{smlSet} instance as input.  However, if a
\Sexpr{class(md)} instance is available, that can be used, avoiding the computation
of the tests.  If the director is used, we need only specify 
\begin{itemize}
\item \texttt{dradset}: the collection of nested paired intervals around
genes within which scores will be collected
\item \texttt{geneGRL}: a list of GRanges, with one element per chromosome,
providing coordinates of gene start and stop locations
\item \texttt{snpGRL}: a list of GRanges, with one element per chromosome,
providing coordinates of SNPs with names
\item \texttt{ffind}: an index value into the manager-specific lists
for the scores we wish to retrieve; in most cases this will be 1
\end{itemize}

The names of probesets scored in \texttt{md} can be obtained:
<<getb>>=
ps = lapply(md@mgrs, function(x) colnames(x@fflist[[1]]))
names(ps) = names(md@mgrs)
ps
@
The GRanges instance is then obtained as
<<getgr>>=
getgr = function(x,map) abs(sapply(mget(x, map), "[", 1))
st = lapply(ps, getgr, illuminaHumanv1CHRLOC)
en = lapply(ps, getgr, illuminaHumanv1CHRLOCEND)
sp = names(ps)
grl = list()
for (i in 1:length(ps)) {
  grl[[i]] = GRanges(seqnames=sp[i], IRanges(st[[i]], en[[i]]))
  names(grl[[i]]) = ps[[i]]
  }
names(grl) = names(ps)
@
SNP locations from hg18 can be obtained as follows (we did chr20 above):
<<getsr>>=
#allsn = lapply(md@mgrs, function(x) rownames(x@fflist[[1]]))
#names(allsn) = names(md@mgrs)
#c20 = getSNPlocs("ch20", as.GRanges=TRUE)
seqlevels(c20) <- sub("chr", "", seqlevels(c20))
#names(c20) = paste("rs", elementMetadata(c20)$RefSNP_id, sep="")
c21 = getSNPlocs("chr21")
c21 = SNPlocs.Hsapiens.dbSNP.20100427:::.SNPlocsAsGranges(c21, "chr21")
rs21 = paste("rs", elementMetadata(c21)$RefSNP_id, sep="")
names(c21) = rs21
seqlevels(c21) <- sub("chr", "", seqlevels(c21))
srl = list(`20`=c20, `21`=c21)
@
Now compute scores in intervals to 50kb around gene, then from 50kb to 100kb, then from 100kb to 500kb:
<<lksc>>=
csc = cisProxScores( direc=md, dradset = c(0,50000,100000, 500000), snpGRL = srl, geneGRL = grl,
  geneApply=lapply, ffind=1 )
@
This yields a deeply nested structure; a hint of best scores per gene within intervals
comes from
<<lklap>>=
lapply(csc, lapply, sapply,  function(x)max(unlist(x)))
@



\section{Working with multiple populations}

Here we will discuss how to take multiple population-specific directors and obtain a director
representing the sums across populations.


\section{Session information}

<<lks>>=
sessionInfo()
@



\end{document}