--- title: "Creating pangenomes using FindMyFriends" author: "Thomas Lin Pedersen" date: "June 12th 2015" output: rmarkdown::html_vignette: css: styles.css vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Creating pangenomes using FindMyFriends} %\usepackage[utf8]{inputenc} --- * * * FindMyFriends is an R package for creating pangenomes of large sets of microbial organisms, together with tools to investigate the results. The number of tools to create pangenomes is large and some of the more notable ones are OrthoMCL, PanOCT, inParanoid and Sybil. Within the R framework micropan (available from CRAN) is currently the only other entry, but the scope of FindMyFriends is much broader. Within comparative microbiol genomics, a pangenome is defined as a collection of all the genes present in a set of related organisms and grouped together by some sort of homology. This homology measure has mainly been some sort of blast derived similarity, but as you shall see here, this is not the only possibility. Pangenomes have many uses, both as organisational tools and as a mean to get insight into the collective genomic information present within a group of organisms. An example of the former is to ensure an equal functional annotation of all genes within a set of genomes by annotating pangenome gene groups rather than single genes. An example of the latter is to use a pangenome together with phenotypic information to discover new links between genes and phenotypes. Usually pangenomes are calculated from the result of blasting all genes in the organisms under investigation against each other. This step is very time-consuming due to the complexity of the BLAST algorithm. FindMyFriends is one of the few tools that take a different approach and instead relies on an alignment free methodology. In FindMyFriends gene similarities are calculated using kmer feature vectors. A kmer feature vector is a decomposition of a sequence into substrings of length k. ``` GATTCGATTAG -> ATT: 2 CGA: 1 GAT: 2 TAG: 1 TCG: 1 TTA: 1 TTC: 1 ``` With this decomposition the problem of calculating sequence similarity is reduced to a vector similarity problem, for which there are many different solution. The one employed in FindMyFriends is called cosine similarity, which is effectively the cosine to the angle between the two vectors. If the vector space is strictly positive then the value is bound between 0 and 1 with 1 being 100 % similarity. How to convert similarity measures into groupings is again a problem with many solutions. FindMyFriends opts for a for a graph theory solution, using the infomap community detection algorithm. This is similar in theory to OrthoMCL, though the algorithm differs (Infomap vs. Markov Clustering). It should be noted that the choice of community detection algorithm is open in FindMyFriends as all algorithms implemented within the igraph framework is available - Infomap is highly adviced though. There is a lot more to FindMyFriends than this though - some of it will be discussed in the following. ## Data to use in this vignette This vignette will use a collection of 10 different *Mycoplasma pneumonia* and *hyopneumonia* strains. These organisms have an appreciable small genome making it easier to distribute and analyse. ```{r, echo=TRUE, eval=TRUE} # Unpack files location <- tempdir() unzip(system.file('extdata', 'Mycoplasma.zip', package='FindMyFriends'), exdir=location) genomeFiles <- list.files(location, full.names=TRUE, pattern='*.fasta') ``` ## Creating a Pangenome object from a set of fasta files The first step in performing a pangenomic analysis with FindMyFriends is to load gene data into a Pangenome object. This is done with the aptly named `pangenome` function, which takes a list of fasta files, a boolean indicating whether the genes are translated and optionally information about the position of each gene on the chromosome and a boolean indicating whether to read all sequences into memory. ```{r, echo=TRUE, eval=TRUE} library(FindMyFriends) library(igraph) # some return values are igraph objects mycoPan <- pangenome(genomeFiles[1:9], translated=TRUE, geneLocation='prodigal', lowMem=FALSE) mycoPan ``` The last parameters of the function call warrants some more explanation. Some of the algorithms in FindMyFriends uses positional information and this must be supplied with the sequenceInfo parameter. This can be done in a number of ways: As a data.frame with one row per gene, as a function that takes the header info of each gene from the fasta file and returns a data.frame with on row per gene, or as a string identifying the annotator used to find the gene (currently only Prodigal supported). The gene info has the following format (all columns required): ```{r, echo=TRUE, eval=TRUE} head(geneLocation(mycoPan)) ``` The contig column holds information on the contig/chromosome on which the gene is located, the start and stop columns gives the start and stop position of translation and the strand column identifies on which strand the gene is located. Additional columns are allowed, but is currently unused. The last parameter (lowMem) specifies whether all sequences should be read into memory (the default) or be kept as references to the original fasta files. The latter is useful in cases where the size of the sequence data becomes prohibitively large but is not needed for our little toy example. Metadata about the organisms can be added and queried using the `addOrgInfo` method. For instance we might add some taxonomy data. ```{r, echo=TRUE, eval=TRUE} orgMeta <- data.frame(Species = c("hyopneumoniae", "hyopneumoniae", "hyopneumoniae", "pneumoniae", "pneumoniae", "hyopneumoniae", "hyopneumoniae", "hyopneumoniae", "pneumoniae")) mycoPan <- addOrgInfo(mycoPan, orgMeta) head(orgInfo(mycoPan)) ``` The raw sequences are easily accessible as an XStringSet object, or split by organism into an XStringSetList object: ```{r, echo=TRUE, eval=TRUE} genes(mycoPan) genes(mycoPan, split = 'organism') ``` Apart from that the object really needs some grouping before there is anything interesting to do with it. ## Object defaults A lot of the methods in FindMyFriends contains similar parameters, and a lot of these. To save typing when doing the analyses, these parameters can be set on a per-object manner. At creation a set of defaults is already set, which can be queried and modified: ```{r, echo=TRUE, eval=TRUE} # Query current defaults head(defaults(mycoPan)) # Set a new default defaults(mycoPan)$lowerLimit <- 0.6 ``` Values supplied as arguments to a method will always override object defaults. ## Calculating pangenomes FindMyFriends provide two approaches to creating the initial pangenome. Both of these are based on comparing normalised kmer counts between sequences, rather than doing a blast between all combinations of strains. Because kmer computations are so much faster than blast, this results in an enormous speed increase compared to other current tools. ### All-vs-all kmer comparison The first approach is very akin to the idea behind blasting all genes against each other. The major difference is that instead of deriving a similarity score from blasting two sequences against each other, the cosine similarity for the kmer feature vectors of the two sequences is used. ```{r, echo=TRUE, eval=TRUE} mycoSim <- kmerSimilarity(mycoPan, lowerLimit=0.8, rescale=FALSE) ``` The result of `kmerSimilarity()` will be a similarity matrix which can then be used to construct a grouping. FindMyFriends adopts a graph based approach, where the similarity matrix is translated to an undirected, weighted graph and community detection is used to split nodes (genes) into groups. The two-pass approach makes it possible to intercept the grouping and define your own algorithm though. Furthermore it makes it possible to define a similarity matrix based on other computations than kmer cosine similarity (e.g. blast), and still use the graph grouping from FindMyFriends. ```{r, eval=TRUE, echo=TRUE} mycoPan <- graphGrouping(mycoPan, mycoSim) mycoPan ``` FindMyFriends support all community detection algorithms implemented in iGraph, but the Infomap algorithm in particular has recieved praise for its performance (in areas other than pangenome grouping though). The MCL algorithm is unfortunately not available in iGraph so a direct comparison to OrthoMCL is not possible, but a recent article comparing different community detection algorithms indicated better performance using infomap ([Aldecoa & Marin, 2013](http://www.nature.com/srep/2013/130717/srep02216/full/srep02216.html)). ### Guided Pairwise Comparison While using a kmer based approach instead of blast yields a huge speed improvement, it doesn't solve one of the biggest problems with pangenome calculation, mainly the way it scales. When doing all-vs-all comparisons a doubling in the number of elements will result in 4 times the number of computations. This means that as you continue to add new genomes the computational time will increase at a faster and faster rate. So while time has been cut drastically by using kmers instead of blast, there will come a time when the computer will just give up. FindMyFriends does away with this by introducing a new approach called Guided Pairwise Comparison (GPC). It works by building a tree of the organisms in the set, and recursively combine pangenomes of subtrees into new pangenomes. Each time two pangenomes are combined, representatives for each gene group in the two pangenomes are chosen at random and compared to each other using kmer similarity. The resulting grouping is then propagated to genes in the two pangenomes. As the number of comparisons are related to the number of branching points on the tree, and the number of branching point is equal to the number of organisms minus one, the algorithm is approximatly linear in scaling. True linearity would only exist in the case where the number of gene groups are constant for all generated pangenomes. As it is expected that the number of gene groups increases as more and more distant pangenomes are merged there will be a small deviation from linearity. ```{r, eval=TRUE, echo=TRUE} mycoPan <- gpcGrouping(mycoPan, lowerLimit=0.5) mycoPan ``` It is possible to supply the tree yourself using the `tree` argument, but omitting it will result in the algorithm creating one itself based on kmer count of the combined proteins for each organism. As can be seen, due to the nature of the algorithm, grouping is done directly instead of the two-pass approach used in the standard all-vs-all algorithm. As the algorithm scales with the number of gene groups it is generally advised to keep the grouping very liberal by providing a low lowerLimit (e.g. 0.5, the default). This will result in fewer, more heterogenous groups, but this will be resolved at a later stage (see Gene group splitting). ### Manual grouping It is also possible to define the grouping of genes manually and thus import results from other tools to use in the FindMyFriends framework. The supplied information can either be a vector of integers giving the group membership for each gene, or a list of vectors giving the indexes of genes for each group. ```{r, echo=TRUE, eval=TRUE} # Compute grouping based on hierarchical clustering - only for illustrative # purpose. Not advisable. distMat <- as.matrix(1/mycoSim-1) distMat[is.infinite(distMat)] <- 1e5 htree <- hclust(as.dist(distMat), method='ward.D2') members <- cutree(htree, k=2000) # Add membership manually mycoPanMan <- manualGrouping(mycoPan, members) mycoPanMan ``` ## Pangenome post-processing The initial results of either kmerSimilarity or GPC grouping can be further processed in a number of ways. Most notably gene groups can be split based on the similarity of the neighborhood of each gene within the group. This allows for splitting of paralogues into separate groups. Apart from that it is possible to link paralogue gene groups together, and add new organisms while retaining the gene group information. ### Gene group splitting If sequence location information has been provided it is possible to split up the gene groups created above into groups with similar neighborhood. The idea of using neighborhood to split paralogues has been pursued by [PanOCT](http://www.ncbi.nlm.nih.gov/pubmed/22904089) too, but in a slightly different way. There the blast hit score is weighted by the similarity of the neightborhood and incorporated into the initial grouping, whereas FindMyFriends provide it as a separate and optional step. When splitting by neighborhood, FindMyFriends builds up a of all genes within each group based on a range of calculations. Genes in the graph are only connected if they are not from the same organism, have a sequence similarity above a certain threshold and a neighborhood similarity above a certain threshold. Based on this graph, maximal cliques are detected and sorted by the quality of their edges, so that cliques with high internal sequence and neighborhood similarity are prefered. Cliques are then extracted as new groups until all members of the group has been assigned. ```{r, echo=TRUE, eval=TRUE} mycoPan <- neighborhoodSplit(mycoPan, lowerLimit=0.75) mycoPan ``` ### Paralogue linking While it is generally a boon to split up paralogue gene groups, as the genes might have different functionality within the cell (and probably be under different regulation), it can be nice to maintain a link between groups with a certain similarity. Such a link can be obtained by calculating a kmer similarity between representatives from each gene group. Once paralogue links have been created they can be used when extracting raw sequences and the information will be taken into account when calculating organism statistics. Furthermore it is possible to merge paralogues into true gene groups (effectively undoing the neighborhood splitting). ```{r, echo=TRUE, eval=TRUE} mycoPan <- kmerLink(mycoPan, lowerLimit=0.8) genes(mycoPan, split='paralogue')[[1]] mycoPan collapseParalogues(mycoPan, combineInfo='largest') ``` ### Removing genes Usually genes are detected automatically by programs suchs as Prodigal and Glimmer, but while these programs are generally good, they are not perfect. A pangenome analysis can potentially reveal genes that, while annotated, are no longer functioning due to frameshifts etc. These will be apparant as members of gene groups that have vastly different length than the majority of the members. Removing such genes will in general improve whatever downstream analysis that the data will be used for. Using the `removeGene()` function it is possible to remove single or sets of genes from your pangenome object. ```{r, echo=TRUE, eval=TRUE} # Remove a gene by raw index removeGene(mycoPan, ind=60) # Remove the first organism by index removeGene(mycoPan, organism=1) # or by name name <- orgNames(mycoPan)[1] removeGene(mycoPan, organism=name) # Remove the second member of the first gene group removeGene(mycoPan, group=1, ind=2) ``` ## Investigating the results Having a nice structure around your pangenome data is just a part of a great framework - having a set of analyses at hand for investigating the results is another part. Luckily, by being part of Bioconductor and using the standard data representations, a lot of the functionality already provided in Bioconductor is at your disposal. The two main data types that you might want to extract in order to do further analysis is the pangenome matrix and raw sequences: ```{r, echo=TRUE, eval=TRUE} # Get the pangenome matrix as an ExpressionSet object as(mycoPan, 'ExpressionSet') # or as a regular matrix as(mycoPan, 'matrix')[1:6, ] # Get all genes split into gene groups genes(mycoPan, split='group') ``` Apart from what is already available within Bioconductor, FindMyFriends also comes with a set of tools to investigate the results further. Two of these calculates different statistics on the two main gene groupings in the dataset, namely gene groups and organisms. These functions are called `groupStat()`and `orgStat()` and are very straightforward to use: ```{r, echo=TRUE, eval=TRUE} groupStat(mycoPan)[[1]] head(orgStat(mycoPan)) ``` To get a very broad overview of your result `plotStat()` can help you: ```{r, echo=TRUE, eval=TRUE, fig.height=18, fig.width=9, fig.align='center'} plotStat(mycoPan, color='Species', type='qual', palette=6) ``` Three other plot functions exists to let you asses general properties of the pangenome. `plotEvolution()` lets you follow the number of singleton, accessory and core genes as the number of organisms increases. Generally this type of plot is very biased towards the order of organisms, so while it is possible to supply a progression of organisms, the default approach is to create bootstraped values for the plot: ```{r, echo=TRUE, eval=TRUE, fig.height=6, fig.width=9, fig.align='center'} plotEvolution(mycoPan) ``` `plotSimilarity()` creates a heatplot of the similarity matrix of the organisms in the pangenome. The similarity of two organisms is here defined as either the percent of shared genes or the cosine similarity of their total kmer feature vector. The ordering can be done manually or automatically according to a hierachcal clustering: ```{r, echo=TRUE, eval=TRUE, fig.height=9, fig.width=9, fig.align='center'} # Pangenome matrix similarity plotSimilarity(mycoPan) # Kmer similarity plotSimilarity(mycoPan, type='kmer', kmerSize=5) # No ordering plotSimilarity(mycoPan, ordering='none') ``` `plotTree()` Plots a dendrogram of the organisms in the pangenome based either on the pangenome matrix or kmer counts. The tree can be augmented with additional information from either orgInfo or supplied manually: ```{r, echo=TRUE, eval=TRUE, fig.width=9, fig.height=9, fig.align='center'} plotTree(mycoPan, clust='ward.D2', dist='minkowski') plotTree(mycoPan, type='kmer', kmerSize=5, clust='ward.D2', dist='cosine', circular=TRUE, info='Species') + ggplot2::scale_color_brewer(type='qual', palette=6) ``` Apart from these general statistics it is possible to get the neighborhood of any given gene group as a weighted graph to further investigate how the up- and downstream genes are organised. ```{r, echo=TRUE, eval=TRUE} library(igraph) getNeighborhood(mycoPan, group=15, vicinity=5) ``` The resulting graph object can be plotted and handled by the functionality of the igraph package or plotted directly using FindMyFriends `plotNeighborhood()` method, which applies appropriate styling to the plot, making it easier to interpret: ```{r, echo=TRUE, eval=TRUE, fig.align='center', fig.height=9, fig.width=9} plotNeighborhood(mycoPan, group=15, vicinity=5) ``` ### Panchromosomal analysis While pangenomes are often envisioned as presence/absence matrices, this is not the only way to organise the information. As chromosomal position is available in most circumstances, the information can also be converted into a panchromosomal graph where each gene group is a vertex and chromosomal adjacency is weighted edges. The graph will generally be very sparse, consisting of long strings of gene groups interspersed with regions of higher variability. ```{r, echo=TRUE, eval=TRUE} panchromosome <- pcGraph(mycoPan) plot(panchromosome, layout=layout.drl(panchromosome, options = igraph.drl.coarsen), vertex.shape='none', vertex.label=NA) ``` This graph structure can be used for other things than random line drawing though. Often local regions of high variability can point to insertion/deletion events, frameshifts, problems with the gene grouping (god forbid) or general regions of high chromosomal plasticity. These regions can of course be detected in FindMyFriends and investigated accordingly. ```{r, echo=TRUE, eval=TRUE} localVar <- variableRegions(mycoPan, flankSize=6) localVar[[1]] plot(localVar[[1]]$graph) ``` ## Dealing with huge datasets Progress within computational efficiency is often followed by a quest to push these new boundaries to the fullest. FindMyFriends have been designed with this in mind and tries to cater to this as much as possible. The first option comes at the time the pangenome object is created, where it is possible to specify that sequences should not be kept in memory. This lowered footprint comes at the expense of slightly longer computational time as the sequence lookup is slower. It can become a necessity as the number of organisms increases (to date the full amount of sequenced E. coli weighs in just below 7 Gb in translated form). When using the lowMem option some calculations are also repeated instead of keeping the result in memory - most notably kmer count during GPC. For the E. coli example above the collective kmer count information in sparse format would occupy just below 50 Gb (with a kmer size of 5). Allocating this amount of memory (if possible) would often take longer than repeating a couple of computations. When utilizing parallelization (see below) this becomes even more of an issue as the memory footprint should be multiplied be the number of threads used (provided you dont use a cluster but a single multicore computer). ### Parallelization At certain computationally heavy steps it is possible to use parallelization based on the BiocParallel package interface. If a subclass of BiocParallelParam is supplied to either `kmerSimilarity`, `gpcGrouping`, `orgTree` or `kmerLink` the computations will be split out according to the supplied parameter. All of these are embarrassingly parallel and thus easy to distribute evenly. ### Caching When analysing huge number of organisms the computations can easily go on for days. In that time a lot of things can go wrong. The computer can crash due to prolonged high memory pressure, the power can be accidentally turned off etc. In order to avoid having to redo everything from scratch in these instances, FindMyFriends employs a caching mechanism courtesy of the filehash package. At the moment the caching is only implemented for the GPC algorithm as this algorithm will be the natural choice for huge datasets anyway. It works by providing a path to a folder where the cached results can be kept to the cacheDB argument in `gpcGrouping()`. It only works if the other parameters are kept across runs! ### Good strategies for long running analyses There is a few good strategies to optimise the analysis when analysing huge sets of organisms. In no particular order: * Precalculate the tree when using GPC instead of letting the algorithm do it. * Use caching. * Use `lowMem=TRUE` when creating the pangenome object. * Lower the lowerLimit compared to what you would normally do, and then rely on neighborhood splitting to resolve paralogues. This will cut computational time. * Use parallelization when available. * Be patient. ## Session ```{r, echo=TRUE, eval=TRUE} sessionInfo() ```