%
%\VignetteIndexEntry{ontoTools Primer}
%\VignetteDepends{ontoTools, org.Hs.eg.db, graph, GO.db}
%\VignetteKeywords{ontology}
%\VignettePackage{ontoTools}
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref,epsfig}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\bibliographystyle{plainnat} 
 
\begin{document}
\title{{\it ontoTools}: software for working with ontologies in Bioconductor}
\author{VJ Carey {\tt stvjc@channing.harvard.edu}}
\maketitle
\tableofcontents

\section{Introduction}

An ontology is a vocabulary arranged as a rooted directed
acyclic graph (rDAG, or just DAG).  This package
provides tools for working with such data structures.
A key concern is the interpretation of sets of objects
mapped to an ontology.

%This package is in development in the absence
%of a truly mature graph data structure system in R.
%We use the {\it graph} package of Gentleman and Whalen,
%and provide some ad hoc extensions such as a {\tt compoundGraph}
%class that can be much more straightforwardly implemented.

This package relies heavily on the \Rpackage{graph}
package from Bioconductor, and the {\it SparseM} package
from UIUC, available from CRAN.  %That package is also
%not fully mature, and in particular seems to have some
%bugs (in version 0.24) when used with Windows R.

\section{Basic structures}


\subsection{Rooted DAG}

The {\tt rootedDAG} class is built using a {\tt graph::graphNEL} and
specifying a root.

This package comes with {\tt litOnto}, which is a {\tt graphNEL}.
<<setupSmall>>=
library(graph)
library(ontoTools)
#library(Rgraphviz)
data(litOnto)
print(litOnto)
print(class(litOnto))
@
Children point to parents by default.
When Rgraphviz is available, the graphical
structure can be plotted using {\tt plot}.
%# should be fig=TRUE
%#plot(litOnto)
@

%\epsfig{file=demo1.ps,width=3.9in}

It is possible to reverse the directions of arcs in such a graph.

The {\tt rootedDAG} object is constructed by hand.
<<showrDAG>>=
g1 <- new("rootedDAG", DAG=litOnto, root="A")
show(DAG(g1))
root(g1)
@


\subsection{ {\tt compoundGraph}}  
This structure was devised to
allow presentations like the following:

%%%% ajr:  BAD VINCE!  \epsfig{file=demoComp.ps,width=4.5in}

\includegraphics[width=4.5in]{demoComp}

The R code that generated the dot code from which this
image was created, and the dot language code for the image, are as follows:
<<compGraph,keep.source=TRUE>>=
data(litObj)
com <- new("compoundGraph", grList=list(litOnto,litObj),
         between= list(c("W","E"), c("X", "K"), c("Y","B"),
        c("Z","D"), c("Z","G")))
compRendList<-list(
      list( prenodes="node [fontsize=28 color=orange fontcolor=orange];",
         preedges="edge [color=black];"),
      list( prenodes="node [fontsize=28 color=green fontcolor=green];",
         preedges="edge [color=black];"),
      betweenRend = list( preedges = "edge [color = red]"))

ff <- "demoComp.dot"
toDot(com, ff, compRendList)
cat(readLines(ff),sep="\n")
@

\subsection{{ \tt ontology}}  
This is just a lightly annotated {\tt rootedDAG}.

<<makeOnto>>=
g1 <- new("rootedDAG", DAG=litOnto, root="A")
o1 <- new("ontology", name="demo", version="0.1",
        rDAG=g1)
show(o1)
@
\subsection{ Mapped key-value list}  

At present this is not a class.

@

<<sparseKVmap>>=
kvlist <- list(W="E", X="K", Y="B", Z=c("D","G"))
litMap <- otkvList2namedSparse( names(kvlist), LETTERS[1:12], kvlist )
print(litMap)
@

\subsection{ OOC: object-ontology complex }
This combines an ontology and a mapped key-value list that
specifies how objects map to terms.

<<makeOOC>>=
ooc1 <- makeOOC( o1, litMap )
show(ooc1)
#coverageMat(ooc1)
@

\section{Basic methods}

\subsection{Matrix conversion of DAG; namedSparse extension}

This should be a generic.  There are two basic sorts
of matrices available: {\tt child2parent} which has
1 in element (i,j) if node i is the child of node j,
and an accessibility matrix that has 1 in element (i,j)
if node i may be reached from node j.

\subsubsection{child2parent}

<<DAG2matrix>>=
g1 <- new("rootedDAG", DAG=litOnto, root="A")
mg1d <- getMatrix(g1, "child2parent", "dense")
print(mg1d)
@

Typically a sparse representation will be needed.

<<sparsifyMap,keep.source=TRUE>>=
ng1 <- getMatrix(g1, "child2parent", "sparse")
ng1 <- new("namedSparse", mat=ng1)
dimnames(ng1) <- list(as.character(1:dim(ng1@mat)[1]),
     as.character(1:dim(ng1@mat)[2]))
print(ng1@mat)

print(as.matrix(ng1))
@

We have not implemented a full set of element subsetting
operations as we expect the SparseM authors to do so.
However, efficient operations using names of rows and columns
are essential.  These are implemented using hashed
environments.

<<namedSparse>>=
dimnames(ng1) <- list(letters[1:12], LETTERS[1:12])
print(class(ng1))
print(getSlots("namedSparse"))
print(as.matrix(ng1))

@
\subsubsection{Accessibility}
Currently this is only handled for an ontology.
<<accessMat>>=
show(accessMat(o1))
@

\subsection{Coverage matrix for an OOC}
This matrix has element 1 in row r, column c if object
corresponding to row r is covered by the term corresponding
to column c.  This means that the term for column c is
accessible from some term to which the object for row r
was explicitly mapped.
<<coverageMat>>=
print(coverageMat(ooc1))

@
\subsection{Depth of terms in an ontology}
We need to be able to get the set of terms at a certain
depth (root is depth 0) and the depth of any given term.

<<depthStruct>>=
print(ontoDepth(g1))
ds1 <- depthStruct(g1)
print(ds1$tag2depth("B"))
print(ds1$depth2tags(3))

@
\section{Applications to large data objects}

\subsection{Key-value list for LocusLink to GO}

The {\tt org.Hs.eg.db} metadata package from
Bioconductor includes org.Hs.egGO,
to regarded as a key-value environment
mapping.  The otkvEnv2namedSparse function can
be used with this environment and the GOMF terms
to create the object to term mapping.
See
ooMapLL2GOMFdemo.rda.

@
\subsection{GO graph and ontology for molecular function}
@
The buildGOGraph function was used to create
goMFgraph.1.15.rda on the basis of the GO package.

With this graph in hand, we construct the rooted DAG
<<rDAGbuild>>=
data(goMFgraph.1.15)
gomfrDAG <- new("rootedDAG", root="GO:0003674", DAG=goMFgraph.1.15)
@
and then the ontology:
<<ontoBuild>>=
GOMFonto <- new("ontology", name="GOMF", version="bioc 1.3.1", rDAG=gomfrDAG)
@

The term accessibility matrix is important for semantic
calculations.  This is a slow calculation and its result
is archived.
\begin{verbatim}
gomfAmat <- accessMat(GOMFonto)
save(gomfAmat,file="gomfAmat.rda", compress=TRUE)
\end{verbatim}

@
\subsection{Mapping from LocusLink to GO}

The LocusLink database is a collection of one-to-many
mappings from genes or gene-associated sequence to
GO tags.  This object-to-term (`ot') association is
provided as an environment-like resource by bioconductor in the org.Hs.egGO package.
We will use a named sparse matrix structure to encode the object
to term mapping.  For this example, the mapping is archived
after computation by the following code:
\begin{verbatim}
library(org.Hs.eg.db)
data(goMFgraph1.15)
obs <- ls(org.Hs.egGO)
tms <- nodes(goMFgraph.1.15)
ooMapLL2GOMFdemo <- otkvEnv2namedSparse( obs, tms, org.Hs.egGO )
save(ooMapLL2GOMFdemo,file="ooMapLL2GOMFdemo.rda")
\end{verbatim}

\section{Applications}

\subsection{Lord's {\it Bioinformatics} 2003 paper}

Lord's paper on semantic similarity measures defines
a probability $p(c)$ for each concept term $c$.
Semantic similarity between terms
$c_1$ and $c_2$ can be measured as $-\log p_{ms}(c_1, c_2)$,
where $p_{ms}$ is the so-called ``probability of the minimum
subsumer'', defined as
\[
p_{ms}(c_1, c_2) = \min\limits_{c \in S(c_1,c_2)} \{p(c)\},
\]
where $S(c_1, c_2)$ is the set of all terms that are
refined by both
$c_1$ and $c_2$.

We indicate how these computations can be performed
using test data in {\it ontoTools}.  Most of the functions defined
here as x.vig are included in the package (sometimes with slight modifications),
lacking the vig suffix.

\subsubsection{Setup of the data}

We begin by attaching the package and
convering the {\tt litOnto} information (encoding
the 12-term ontology) into a {\tt rootedDAG}.
<<smSetup>>=
library(ontoTools)
data(litOnto)
g1 <- new("rootedDAG", DAG=litOnto, root="A")
@
Now we create an ontology object based on this
rooted DAG:
<<smOnto>>=
o1 <- new("ontology", name="demo", version="0.1",
   rDAG=g1)
@
The mapping from objects to the concepts in the
ontology uses a key-value list, with multimappings
represented by vectors.
<<smKVlist>>=
kvlist <- list(W="E", X="K", Y="B", Z=c("D", "G"))
@
A special function compute the object to term
map:
<<smMap>>=
demomap <- otkvList2namedSparse( names(kvlist), LETTERS[1:12],
   kvlist )
@
We obtain the object-ontology complex.
<<smOOC>>=
demoooc <- makeOOC( o1, demomap )
@
The map from objects to terms is:
<<smOOmap>>=
print(OOmap(demoooc))
@
The coverage matrix can be directly computed:
<<smcov>>=
cov1 <- coverageMat(demoooc)
print(cov1)
@
\subsubsection{Concept probabilities}
We begin with the concept probabilities.
The counts of usages per term is
<<smUcounts>>=
#print(pc <- colSums(coverageMat(demoooc)))
ACC <- accessMat(ontology(demoooc))
acctms <- dimnames(ACC)[[2]]
print(acctms)
MAP <- OOmap(demoooc)
nterms <- function(x) length(nodes(DAG(rDAG(x))))
ontoTerms <- function(x) nodes(DAG(rDAG(x)))
print(ontoTerms)

usageCount.vig <- function (MAP, ACC) 
{
    maptms <- dimnames(MAP)[[2]]
    acctms <- dimnames(ACC)[[2]]
    usages <- rep(0,length(maptms))
    names(usages) <- maptms
    for (i in 1:nrow(MAP)) {
        hits <- maptms[as.matrix.ok(MAP[i, ]) == 1]
        usages[hits] <- usages[hits] + 1
        for (j in 1:length(hits)) {
            anctags <- acctms[as.matrix.ok(ACC[hits[j], ]) == 1]
            usages[anctags] <- usages[anctags] + 1
        }
    }
    usages
}
print(usages <- usageCount.vig(MAP,ACC))
    
@
because whenever a term is used, all its ancestors
are implicitly used as well (Lord section 2).
This inherited usage does not recurse, however.
A basic usage event fires up the graph once; the
fact that a descendent is used transitively
does not increment the usage of its ancestors.

The total number of basic usage events is
<<smUev>>=
print(N <- max(usages))
@
The vector of concept probabilities is
<<smCprob>>=
print(usages/N)
@
So we have
<<cprobFunc>>=
conceptProbs.vig <- function(ooc) {
 if (is(occ, "OOC")) stop("arg must have class OOC")
 pc <- usageCount.vig(OOmap(ooc), accessMat(ontology(ooc)))
 pc/max(pc, na.rm=TRUE)
}
@
\subsubsection{Identifying the most informative subsumer}
This task involves the ontology (to identify
subsumers) and then the concept probabilities.
<<subsumFunc>>=
subsumers.vig <- function(c1, c2, ont) {
 if (! is(ont, "ontology")) stop("ont must have class ontology")
 tmp <- colSums(accessMat(ont)[c(c1,c2),])
 names(tmp[tmp==2])
}
print(subsumers.vig("I", "K", ontology(demoooc)))
@
Thus we can use
<<pmsFunc>>=
pms.vig <- function(c1, c2, ooc) {
 if (! is(ooc, "OOC")) stop("arg must have class OOC")
 if (any(!(c(c1,c2) %in% nodes(DAG(rDAG(ontology(ooc))))))) 
     stop("some term not found in ontology DAG nodes")
 S <- subsumers.vig(c1,c2,ontology(ooc))
 pc <- conceptProbs.vig(ooc)
 min(pc[S])
}

print(pms("I", "K", demoooc))
@
Now we can compute semantic similarity relative to an
OOC:
<<semsimFunc>>=
semsim.vig <- function(c1, c2, ooc) 
 -log(pms.vig(c1,c2,ooc))
@


\subsection{Iyer's time-course experiment}

\subsubsection{Setup}

We attach the Iyer517 library and the
IyerAnnotated data structure.  IyerAnnotated
is a data.frame that gives the mapping from
probes in Iyer517 to up to 5 GO MF tags.

<<IySetup>>=
library(ontoTools)
library(Iyer517)
data(IyerAnnotated)
print(IyerAnnotated[1:3,])
@
We also attach the concept probabilities for GO MF,
relative to the LocusLink usages.
<<IyCP>>=
data(LL2GOMFcp.1.15)
print(LL2GOMFcp.1.15[1:3])
@

We also define a function that extracts the most informative
of a set of tags:
<<MostInfFunc>>=
getMostInf <- function (tags,pc=LL2GOMFcp.1.15) 
{
    if (length(tags) == 1) 
        return(tags)
    if (all(is.na(tags))) return(NA)
    tags[pc[tags] == min(pc[tags])][1]
}
@
and apply this to the probe-specific tag sets:
<<IyMostInf,keep.source=TRUE>>=
GOmost <- apply(IyerAnnotated[, 5:9], 1, 
  function(x) getMostInf(as.character(x[!is.na(x)])))
@

We also need an OOC for GOMF relative to LocusLink:

<<GOMFOOC>>=
data(goMFgraph.1.15)
data(LL2GOMFooMap.1.15)
data(goMFamat.1.15)
#
# build the rooted DAG, the ontology, and the OOC objects
#
gomfrDAG <- new("rootedDAG", root="GO:0003674", DAG=goMFgraph.1.15)
GOMFonto <- new("ontology", name="GOMF", version="bioc 1.3.1", rDAG=gomfrDAG)
LLGOMFOOC <- makeOOC(GOMFonto, LL2GOMFooMap.1.15)
#
@
\subsubsection{Filtering the IyerAnnotated tags}
The Iyer data have been clustered.
<<IclustTab>>=
print(table(IyerAnnotated$Iclust))

@
We will work with the four largest clusters,
extracting the most informative tag for each cluster.

<<IyMostInfTags,keep.source=TRUE>>=
getMostInfTags <- function(let) {
grp.GO <- GOmost[ IyerAnnotated$Iclust==let ]
grp.GO[!is.na(grp.GO)]
}

A.GO <- getMostInfTags("A")
B.GO <- getMostInfTags("B")
D.GO <- getMostInfTags("D")
H.GO <- getMostInfTags("H")

@
\subsubsection{Computing pairwise semantic similarity}
The next function
computes the pairwise semantic similarities in
a group of tags.  However, to save time, the
loop is truncated at 30 pairs.
<<pwiseSim>>=
getSim <- function(x,Nchk=30) {
 library(combinat)
 prs <- combn(x,2)
 sim <- rep(NA,ncol(prs))
 for (i in 1:ncol(prs))
   {
   if (i > Nchk) break
   cat(i)
   if (any(!(c(prs[1,i], prs[2,i]) %in% goMFamat.1.15@Dimnames[[1]])))
     sim[i] = NA
   else
     sim[i] <- semsim( prs[1,i], prs[2,i], acc=goMFamat.1.15, pc=LL2GOMFcp.1.15, ooc=LLGOMFOOC )
   }
 sim[1:Nchk]
 }
@
We compute the pairwise similarities for the 4 groups.
<<calcSims>>=
Asim <- getSim(A.GO)
save(Asim,file="Asim.rda")
Bsim <- getSim(B.GO)
save(Bsim,file="Bsim.rda")
Dsim <- getSim(D.GO)
save(Dsim,file="Dsim.rda")
Hsim <- getSim(H.GO)
save(Hsim,file="Hsim.rda")
@
Display the similarity statistics:
<<boxpSims,fig=TRUE,eval=TRUE>>=
boxplot(Asim,Bsim,Dsim,Hsim)
@
\end{document}