%\VignetteIndexEntry{Hypergeometric tests for less common model organisms}
\documentclass[11pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newlength{\smallfigwidth}
\setlength{\smallfigwidth}{6cm}

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\title{How To Use GOstats and Category to do Hypergeometric testing with unsupported model organisms}
\author{M. Carlson}

\begin{document}

\maketitle

\section{Introduction}

This vignette is meant as an extension of what already exists in the
GOstatsHyperG.pdf vignette.  It is intended to explain how a user can
run hypergeometric testing on GO terms or KEGG terms when the organism
in question is not supported by an annotation package.  The 1st thing
for a user to determine then is whether or not their organism is
supported by an organism package through AnnotationForge.  In order to
do that, they need only to call the available.dbschemas() method from
AnnotationForge.

<<available Schemas>>=
library("AnnotationForge")
available.dbschemas()
@ 

If the organism that you are using is listed here, then your organism
is supported.  If not, then you will need to find a source or GO (org
KEGG) to gene mappings.  One source for GO to gene mappings is the
blast2GO project.  But you might also find such mappings at Ensembl or
NCBI.  If your organism is not a typical model organism, then the GO
terms you will find are probably going to be predictions based on
sequence similarity measures instead of direct measurements.  This is
something that you might want to bear in mind when you draw
conclusions later.

In preparation for our subsequent demonstrations, lets get some data
to work with by borrowing from an organism package.  We will assume
that you will use something like \Rfunction{read.table} to load your
own annotation packages into a data.frame object.  The starting object
needs to be a data.frame with the GO Id's in the 1st col, the evidence
codes in the 2nd column and the gene Id's in the 3rd.

<<Acquire annotation data>>=
library("org.Hs.eg.db")
frame = toTable(org.Hs.egGO)
goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id)
head(goframeData)
@ 


\subsection{Preparing GO to gene mappings}

When using GO mappings, it is important to consider the data structure
of the GO ontologies.  The terms are organized into a directed acyclic
graph.  The structure of the graph creates implications about the
mappings of less specific terms to genes that are mapped to more
specific terms.  The Category and GOstats packages normally deal with
this kind of complexity by using a special GO2All mapping in the
annotation packages.  You won't have one of those, so instead you will
have to make one.  AnnotationDbi provides some simple tools to
represent your GO term to gene mappings so that this process will be
easy.  First you need to put your data into a GOFrame object.  Then
the simple act of casting this object to a GOAllFrame object will tap
into the GO.db package and populate this object with the implicated
GO2All mappings for you.


<<transformGOFrame>>=
goFrame=GOFrame(goframeData,organism="Homo sapiens")
goAllFrame=GOAllFrame(goFrame)
@ 


In an effort to standardize the way that we pass this kind of custom
information around, we have chosen to support geneSetCollection
objects from GSEABase package.  You can generate one of these objects
in the following way:

<<Make GSC>>=
library("GSEABase")
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
@ 


\subsection{Setting up the parameter object}

Now we can make a parameter object for GOstats by using a special
constructor function.  For the sake of demonstration, I will just use
all the EGs as the universe and grab some arbitrarily to be the
geneIds tested.  For your case, you need to make sure that the gene
IDs you use are unique and that they are the same type for both the
universe, the geneIds and the IDs that are part of your
geneSetCollection.

<<<make parameter>>=
library("GOstats")
universe = Lkeys(org.Hs.egGO)
genes = universe[1:500]
params <- GSEAGOHyperGParams(name="My Custom GSEA based annot Params", 
                             geneSetCollection=gsc, 
                             geneIds = genes, 
                             universeGeneIds = universe, 
                             ontology = "MF", 
                             pvalueCutoff = 0.05, 
                             conditional = FALSE, 
                             testDirection = "over")
@

And finally we can call \Rfunction{hyperGTest} in the same way we
always have before.

<<call HyperGTest>>=
Over <- hyperGTest(params)
head(summary(Over))
@ 


\subsection{Preparing KEGG to gene mappings}

This is much the same as what you did with the GO mappings except for
two important simplifications. First of all you will no longer need to
track evidence codes, so the object you start with only needs to hold
KEGG Ids and gene IDs.  Seconly, since KEGG is not a directed graph,
there is no need for a KEGG to All mapping.  Once again I will borrow
some data to use as an example.  Notice that we have to put the KEGG
Ids in the left hand column of our initial two column data.frame.

<<KEGGFrame object>>=
frame = toTable(org.Hs.egPATH)
keggframeData = data.frame(frame$path_id, frame$gene_id)
head(keggframeData)
keggFrame=KEGGFrame(keggframeData,organism="Homo sapiens")
@ 

The rest of the process should be very similar.

<<KEGG Parameters>>=
gsc <- GeneSetCollection(keggFrame, setType = KEGGCollection())
universe = Lkeys(org.Hs.egGO)
genes = universe[1:500]
kparams <- GSEAKEGGHyperGParams(name="My Custom GSEA based annot Params", 
                               geneSetCollection=gsc, 
                               geneIds = genes, 
                               universeGeneIds = universe,  
                               pvalueCutoff = 0.05, 
                               testDirection = "over")
kOver <- hyperGTest(params)
head(summary(kOver))
@ 


<<info, results=tex>>=
toLatex(sessionInfo())

@

\end{document}