---
title: "Creating pangenomes using FindMyFriends"
author: "Thomas Lin Pedersen"
date: "June 12th 2015"
output: 
    rmarkdown::html_vignette:
        css: styles.css
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Creating pangenomes using FindMyFriends}
  %\usepackage[utf8]{inputenc}
---

*  *  *

FindMyFriends is an R package for creating pangenomes of large sets of microbial
organisms, together with tools to investigate the results. The number of 
tools to create pangenomes is large and some of the more notable ones are 
OrthoMCL, PanOCT, inParanoid and Sybil. Within the R framework micropan 
(available from CRAN) is currently the only other entry, but the scope of 
FindMyFriends is much broader. 

Within comparative microbiol genomics, a pangenome is defined as a collection of
all the genes present in a set of related organisms and grouped together by some
sort of homology. This homology measure has mainly been some sort of blast 
derived similarity, but as you shall see here, this is not the only possibility.

Pangenomes have many uses, both as organisational tools and as a mean to get
insight into the collective genomic information present within a group of
organisms. An example of the former is to ensure an equal functional annotation
of all genes within a set of genomes by annotating pangenome gene groups rather
than single genes. An example of the latter is to use a pangenome together with 
phenotypic information to discover new links between genes and phenotypes.

Usually pangenomes are calculated from the result
of blasting all genes in the organisms under investigation against each other.
This step is very time-consuming due to the complexity of the BLAST algorithm. 
FindMyFriends is one of the few tools that take a different approach and instead
relies on an alignment free methodology. In FindMyFriends gene similarities are
calculated using kmer feature vectors. A kmer feature vector is a decomposition
of a sequence into substrings of length k.

```
GATTCGATTAG  ->  ATT: 2
                 CGA: 1
                 GAT: 2
                 TAG: 1
                 TCG: 1
                 TTA: 1
                 TTC: 1
```

With this decomposition the problem of calculating sequence similarity is 
reduced to a vector similarity problem, for which there are many different 
solution. The one employed in FindMyFriends is called cosine similarity, which
is effectively the cosine to the angle between the two vectors. If the vector
space is strictly positive then the value is bound between 0 and 1 with 1 being
100 % similarity.

How to convert similarity measures into groupings is again a problem with many 
solutions. FindMyFriends opts for a for a graph theory solution, using the
infomap community detection algorithm. This is similar in theory to OrthoMCL,
though the algorithm differs (Infomap vs. Markov Clustering). It should be noted
that the choice of community detection algorithm is open in FindMyFriends as all
algorithms implemented within the igraph framework is available - Infomap is
highly adviced though.

There is a lot more to FindMyFriends than this though - some of it will be 
discussed in the following.

## Data to use in this vignette
This vignette will use a collection of 10 different *Mycoplasma pneumonia* and
*hyopneumonia* strains. These organisms have an appreciable small genome making 
it easier to distribute and analyse.

```{r, echo=TRUE, eval=TRUE}
# Unpack files
location <- tempdir()
unzip(system.file('extdata', 'Mycoplasma.zip', package='FindMyFriends'),
      exdir=location)
genomeFiles <- list.files(location, full.names=TRUE, pattern='*.fasta')
```

## Creating a Pangenome object from a set of fasta files
The first step in performing a pangenomic analysis with FindMyFriends is to load
gene data into a Pangenome object. This is done with the aptly named `pangenome`
function, which takes a list of fasta files, a boolean indicating whether the
genes are translated and optionally information about the position of each gene
on the chromosome and a boolean indicating whether to read all sequences into 
memory. 

```{r, echo=TRUE, eval=TRUE}
library(FindMyFriends)
library(igraph) # some return values are igraph objects

mycoPan <- pangenome(genomeFiles[1:9], 
                     translated=TRUE, 
                     geneLocation='prodigal', 
                     lowMem=FALSE)
mycoPan
```

The last parameters of the function call warrants some more explanation. Some of 
the algorithms in FindMyFriends uses positional information and this must be
supplied with the sequenceInfo parameter. This can be done in a number of ways: 
As a data.frame with one row per gene, as a function that takes the header info 
of each gene from the fasta file and returns a data.frame with on row per gene, 
or as a string identifying the annotator used to find the gene (currently only 
Prodigal supported). The gene info has the following format (all columns 
required):

```{r, echo=TRUE, eval=TRUE}
head(geneLocation(mycoPan))
```

The contig column holds information on the contig/chromosome on which the gene 
is located, the start and stop columns gives the start and stop position of 
translation and the strand column identifies on which strand the gene is 
located. Additional columns are allowed, but is currently unused.

The last parameter (lowMem) specifies whether all sequences should be read into
memory (the default) or be kept as references to the original fasta files. The 
latter is useful in cases where the size of the sequence data becomes 
prohibitively large but is not needed for our little toy example.

Metadata about the organisms can be added and queried using the `addOrgInfo` 
method. For instance we might add some taxonomy data.

```{r, echo=TRUE, eval=TRUE}
orgMeta <- data.frame(Species = c("hyopneumoniae", "hyopneumoniae", 
                                  "hyopneumoniae", "pneumoniae", 
                                  "pneumoniae", "hyopneumoniae", 
                                  "hyopneumoniae", "hyopneumoniae", 
                                  "pneumoniae"))

mycoPan <- addOrgInfo(mycoPan, orgMeta)
head(orgInfo(mycoPan))
```

The raw sequences are easily accessible as an XStringSet object, or split by
organism into an XStringSetList object:

```{r, echo=TRUE, eval=TRUE}
genes(mycoPan)
genes(mycoPan, split = 'organism')
```

Apart from that the object really needs some grouping before there is anything
interesting to do with it.

## Object defaults
A lot of the methods in FindMyFriends contains similar parameters, and a lot of
these. To save typing when doing the analyses, these parameters can be set on a
per-object manner. At creation a set of defaults is already set, which can be 
queried and modified:

```{r, echo=TRUE, eval=TRUE}
# Query current defaults
head(defaults(mycoPan))

# Set a new default
defaults(mycoPan)$lowerLimit <- 0.6
```

Values supplied as arguments to a method will always override object defaults.

## Calculating pangenomes
FindMyFriends provide two approaches to creating the initial pangenome. Both of 
these are based on comparing normalised kmer counts between sequences, rather
than doing a blast between all combinations of strains. Because kmer 
computations are so much faster than blast, this results in an enormous speed 
increase compared to other current tools.

### All-vs-all kmer comparison
The first approach is very akin to the idea behind blasting all genes against
each other. The major difference is that instead of deriving a similarity score
from blasting two sequences against each other, the cosine similarity for the 
kmer feature vectors of the two sequences is used.

```{r, echo=TRUE, eval=TRUE}
mycoSim <- kmerSimilarity(mycoPan, lowerLimit=0.8, rescale=FALSE)
```

The result of `kmerSimilarity()` will be a similarity matrix which can then be
used to construct a grouping. FindMyFriends adopts a graph based approach, where
the similarity matrix is translated to an undirected, weighted graph and 
community detection is used to split nodes (genes) into groups. The two-pass
approach makes it possible to intercept the grouping and define your own 
algorithm though. Furthermore it makes it possible to define a similarity matrix
based on other computations than kmer cosine similarity (e.g. blast), and still
use the graph grouping from FindMyFriends.

```{r, eval=TRUE, echo=TRUE}
mycoPan <- graphGrouping(mycoPan, mycoSim)
mycoPan
```

FindMyFriends support all community detection algorithms implemented in iGraph,
but the Infomap algorithm in particular has recieved praise for its performance
(in areas other than pangenome grouping though). The MCL algorithm is 
unfortunately not available in iGraph so a direct comparison to OrthoMCL is not
possible, but a recent article comparing different community detection 
algorithms indicated better performance using infomap 
([Aldecoa & Marin, 2013](http://www.nature.com/srep/2013/130717/srep02216/full/srep02216.html)).

### Guided Pairwise Comparison
While using a kmer based approach instead of blast yields a huge speed 
improvement, it doesn't solve one of the biggest problems with pangenome 
calculation, mainly the way it scales. When doing all-vs-all comparisons a 
doubling in the number of elements will result in 4 times the number of 
computations. This means that as you continue to add new genomes the 
computational time will increase at a faster and faster rate. So while time has
been cut drastically by using kmers instead of blast, there will come a time 
when the computer will just give up.

FindMyFriends does away with this by introducing a new approach called Guided
Pairwise Comparison (GPC). It works by building a tree of the organisms in the
set, and recursively combine pangenomes of subtrees into new pangenomes. Each 
time two pangenomes are combined, representatives for each gene group in the two
pangenomes are chosen at random and compared to each other using kmer 
similarity. The resulting grouping is then propagated to genes in the two 
pangenomes. As the number of comparisons are related to the number of branching 
points on the tree, and the number of branching point is equal to the number of
organisms minus one, the algorithm is approximatly linear in scaling. True 
linearity would only exist in the case where the number of gene groups are 
constant for all generated pangenomes. As it is expected that the number of gene
groups increases as more and more distant pangenomes are merged there will be a 
small deviation from linearity.

```{r, eval=TRUE, echo=TRUE}
mycoPan <- gpcGrouping(mycoPan, lowerLimit=0.5)
mycoPan
```

It is possible to supply the tree yourself using the `tree` argument, but 
omitting it will result in the algorithm creating one itself based on kmer count
of the combined proteins for each organism. As can be seen, due to the nature of
the algorithm, grouping is done directly instead of the two-pass approach used
in the standard all-vs-all algorithm.

As the algorithm scales with the number of gene groups it is generally advised
to keep the grouping very liberal by providing a low lowerLimit (e.g. 0.5, the
default). This will result in fewer, more heterogenous groups, but this will be
resolved at a later stage (see Gene group splitting).

### Manual grouping
It is also possible to define the grouping of genes manually and thus import
results from other tools to use in the FindMyFriends framework. The supplied 
information can either be a vector of integers giving the group membership for 
each gene, or a list of vectors giving the indexes of genes for each group.

```{r, echo=TRUE, eval=TRUE}
# Compute grouping based on hierarchical clustering - only for illustrative
# purpose. Not advisable.
distMat <- as.matrix(1/mycoSim-1)
distMat[is.infinite(distMat)] <- 1e5
htree <- hclust(as.dist(distMat), method='ward.D2')
members <- cutree(htree, k=2000)

# Add membership manually
mycoPanMan <- manualGrouping(mycoPan, members)
mycoPanMan
```

## Pangenome post-processing
The initial results of either kmerSimilarity or GPC grouping can be further
processed in a number of ways. Most notably gene groups can be split based on 
the similarity of the neighborhood of each gene within the group. This allows
for splitting of paralogues into separate groups. Apart from that it is possible
to link paralogue gene groups together, and add new organisms while retaining 
the gene group information.

### Gene group splitting
If sequence location information has been provided it is possible to split up 
the gene groups created above into groups with similar neighborhood. The idea of
using neighborhood to split paralogues has been pursued by 
[PanOCT](http://www.ncbi.nlm.nih.gov/pubmed/22904089) too, but in a slightly 
different way. There the blast hit score is weighted by the similarity of the
neightborhood and incorporated into the initial grouping, whereas FindMyFriends
provide it as a separate and optional step. When splitting by neighborhood, 
FindMyFriends builds up a of all genes within each group based on a range of
calculations. Genes in the graph are only connected if they are not from the 
same organism, have a sequence similarity above a certain threshold and a
neighborhood similarity above a certain threshold. Based on this graph, maximal
cliques are detected and sorted by the quality of their edges, so that cliques
with high internal sequence and neighborhood similarity are prefered. Cliques
are then extracted as new groups until all members of the group has been 
assigned.

```{r, echo=TRUE, eval=TRUE}
mycoPan <- neighborhoodSplit(mycoPan, lowerLimit=0.75)
mycoPan
```

### Paralogue linking
While it is generally a boon to split up paralogue gene groups, as the genes 
might have different functionality within the cell (and probably be under 
different regulation), it can be nice to maintain a link between groups with a
certain similarity. Such a link can be obtained by calculating a kmer similarity
between representatives from each gene group. Once paralogue links have been 
created they can be used when extracting raw sequences and the information will 
be taken into account when calculating organism statistics. Furthermore it is 
possible to merge paralogues into true gene groups (effectively undoing the 
neighborhood splitting).

```{r, echo=TRUE, eval=TRUE}
mycoPan <- kmerLink(mycoPan, lowerLimit=0.8)

genes(mycoPan, split='paralogue')[[1]]

mycoPan
collapseParalogues(mycoPan, combineInfo='largest')
```

### Removing genes
Usually genes are detected automatically by programs suchs as Prodigal and 
Glimmer, but while these programs are generally good, they are not perfect. A
pangenome analysis can potentially reveal genes that, while annotated, are no
longer functioning due to frameshifts etc. These will be apparant as members of
gene groups that have vastly different length than the majority of the members.
Removing such genes will in general improve whatever downstream analysis that
the data will be used for. Using the `removeGene()` function it is possible to
remove single or sets of genes from your pangenome object.

```{r, echo=TRUE, eval=TRUE}
# Remove a gene by raw index
removeGene(mycoPan, ind=60)

# Remove the first organism by index
removeGene(mycoPan, organism=1)
# or by name
name <- orgNames(mycoPan)[1]
removeGene(mycoPan, organism=name)

# Remove the second member of the first gene group
removeGene(mycoPan, group=1, ind=2)
```

## Investigating the results
Having a nice structure around your pangenome data is just a part of a great
framework - having a set of analyses at hand for investigating the results is
another part. Luckily, by being part of Bioconductor and using the standard data
representations, a lot of the functionality already provided in Bioconductor is 
at your disposal. The two main data types that you might want to extract in 
order to do further analysis is the pangenome matrix and raw sequences:

```{r, echo=TRUE, eval=TRUE}
# Get the pangenome matrix as an ExpressionSet object
as(mycoPan, 'ExpressionSet')
# or as a regular matrix
as(mycoPan, 'matrix')[1:6, ]

# Get all genes split into gene groups
genes(mycoPan, split='group')
```

Apart from what is already available within Bioconductor, FindMyFriends also
comes with a set of tools to investigate the results further. Two of these 
calculates different statistics on the two main gene groupings in the dataset,
namely gene groups and organisms. These functions are called `groupStat()`and 
`orgStat()` and are very straightforward to use:

```{r, echo=TRUE, eval=TRUE}
groupStat(mycoPan)[[1]]

head(orgStat(mycoPan))
```

To get a very broad overview of your result `plotStat()` can help you:

```{r, echo=TRUE, eval=TRUE, fig.height=18, fig.width=9, fig.align='center'}
plotStat(mycoPan, color='Species', type='qual', palette=6)
```

Three other plot functions exists to let you asses general properties of the 
pangenome. `plotEvolution()` lets you follow the number of singleton, accessory
and core genes as the number of organisms increases. Generally this type of plot
is very biased towards the order of organisms, so while it is possible to supply
a progression of organisms, the default approach is to create bootstraped values
for the plot:

```{r, echo=TRUE, eval=TRUE, fig.height=6, fig.width=9, fig.align='center'}
plotEvolution(mycoPan)
```

`plotSimilarity()` creates a heatplot of the similarity matrix of the organisms
in the pangenome. The similarity of two organisms is here defined as either the
percent of shared genes or the cosine similarity of their total kmer feature
vector. The ordering can be done manually or automatically according to a 
hierachcal clustering:

```{r, echo=TRUE, eval=TRUE, fig.height=9, fig.width=9, fig.align='center'}
# Pangenome matrix similarity
plotSimilarity(mycoPan)
# Kmer similarity
plotSimilarity(mycoPan, type='kmer', kmerSize=5)
# No ordering
plotSimilarity(mycoPan, ordering='none')
```

`plotTree()` Plots a dendrogram of the organisms in the pangenome based either 
on the pangenome matrix or kmer counts. The tree can be augmented with 
additional information from either orgInfo or supplied manually:

```{r, echo=TRUE, eval=TRUE, fig.width=9, fig.height=9, fig.align='center'}
plotTree(mycoPan, clust='ward.D2', dist='minkowski')
plotTree(mycoPan, type='kmer', kmerSize=5, clust='ward.D2', 
         dist='cosine', circular=TRUE, info='Species') + 
    ggplot2::scale_color_brewer(type='qual', palette=6)
```

Apart from these general statistics it is possible to get the neighborhood of 
any given gene group as a weighted graph to further investigate how the up- and 
downstream genes are organised.

```{r, echo=TRUE, eval=TRUE}
library(igraph)
getNeighborhood(mycoPan, group=15, vicinity=5)
```

The resulting graph object can be plotted and handled by the functionality of 
the igraph package or plotted directly using FindMyFriends `plotNeighborhood()`
method, which applies appropriate styling to the plot, making it easier to 
interpret:

```{r, echo=TRUE, eval=TRUE, fig.align='center', fig.height=9, fig.width=9}
plotNeighborhood(mycoPan, group=15, vicinity=5)
```

### Panchromosomal analysis
While pangenomes are often envisioned as presence/absence matrices, this is not
the only way to organise the information. As chromosomal position is available
in most circumstances, the information can also be converted into a 
panchromosomal graph where each gene group is a vertex and chromosomal adjacency
is weighted edges. The graph will generally be very sparse, consisting of long 
strings of gene groups interspersed with regions of higher variability.

```{r, echo=TRUE, eval=TRUE}
panchromosome <- pcGraph(mycoPan)
plot(panchromosome, 
     layout=layout.drl(panchromosome, options = igraph.drl.coarsen),
     vertex.shape='none',
     vertex.label=NA)
```

This graph structure can be used for other things than random line drawing 
though. Often local regions of high variability can point to insertion/deletion
events, frameshifts, problems with the gene grouping (god forbid) or general
regions of high chromosomal plasticity. These regions can of course be detected
in FindMyFriends and investigated accordingly.

```{r, echo=TRUE, eval=TRUE}
localVar <- variableRegions(mycoPan, flankSize=6)
localVar[[1]]
plot(localVar[[1]]$graph)
```

## Dealing with huge datasets
Progress within computational efficiency is often followed by a quest to push
these new boundaries to the fullest. FindMyFriends have been designed with this
in mind and tries to cater to this as much as possible. The first option comes
at the time the pangenome object is created, where it is possible to specify 
that sequences should not be kept in memory. This lowered footprint comes at the
expense of slightly longer computational time as the sequence lookup is slower.
It can become a necessity as the number of organisms increases (to date the full
amount of sequenced E. coli weighs in just below 7 Gb in translated form). When
using the lowMem option some calculations are also repeated instead of keeping 
the result in memory - most notably kmer count during GPC. For the E. coli
example above the collective kmer count information in sparse format would 
occupy just below 50 Gb (with a kmer size of 5). Allocating this amount of 
memory (if possible) would often take longer than repeating a couple of 
computations. When utilizing parallelization (see below) this becomes even more
of an issue as the memory footprint should be multiplied be the number of 
threads used (provided you dont use a cluster but a single multicore computer).

### Parallelization
At certain computationally heavy steps it is possible to use parallelization 
based on the BiocParallel package interface. If a subclass of BiocParallelParam 
is supplied to either `kmerSimilarity`, `gpcGrouping`, `orgTree` or `kmerLink` 
the computations will be split out according to the supplied parameter. All of
these are embarrassingly parallel and thus easy to distribute evenly.

### Caching
When analysing huge number of organisms the computations can easily go on for
days. In that time a lot of things can go wrong. The computer can crash due to
prolonged high memory pressure, the power can be accidentally turned off etc. In
order to avoid having to redo everything from scratch in these instances, 
FindMyFriends employs a caching mechanism courtesy of the filehash package. At 
the moment the caching is only implemented for the GPC algorithm as this 
algorithm will be the natural choice for huge datasets anyway. It works by 
providing a path to a folder where the cached results can be kept to the cacheDB
argument in `gpcGrouping()`. It only works if the other parameters are kept 
across runs!

### Good strategies for long running analyses
There is a few good strategies to optimise the analysis when analysing huge sets
of organisms. In no particular order:

* Precalculate the tree when using GPC instead of letting the algorithm do it.
* Use caching.
* Use `lowMem=TRUE` when creating the pangenome object.
* Lower the lowerLimit compared to what you would normally do, and then rely on
neighborhood splitting to resolve paralogues. This will cut computational time.
* Use parallelization when available.
* Be patient.

## Session
```{r, echo=TRUE, eval=TRUE}
sessionInfo()
```