---
title: "RFLOMICS"
output: 
  BiocStyle::html_document:
        toc: true
        toc_depth : 2
        number_sections: true
date: "`r format(Sys.time(), '%B %d, %Y')`"
package: RFLOMICS
vignette: >
  %\VignetteIndexEntry{RFLOMICS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
 
abstract: > 
  This document will go through the analysis of single and multi-omics data
  using the RFLOMICS package. 
  It will be shown how to set up the 
  statistical framework of the designed protocol: i.e. translate biological 
  condition comparisons into contrasts, perform quality control, data 
  pre-processing and statistical analysis of each omics dataset: i.e. find 
  differentially expressed genes/proteins/metabolites (DEF) per contrast, 
  identify clusters of co-expressed DEF and perform functional annotation 
  (over representation analysis)
  of the lists of DEF or clusters of co-expressed 
  DEF using GO, KEGG or a custom annotation. Then, the vignette will 
  show how to explore the common variance/correlation between the different 
  omics datasets and 
  simultaneously identify key features (genes/proteins/metabolites) that best 
  correlate with biological factors.
---
 

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

The acquisition of multi-omics data with complex experimental designs is a 
widely adopted practice to identify entities and unravel the biological 
processes they are involved in.

Exploring each omics layer usually serves as an initial step to investigate and
extract relevant biological variability. Statistical integration can then focus
on pertinent omics layers and features.

Most existing solutions for the analysis of such data are command-line based, 
which limits accessibility for many users. Handling many packages, methods and 
their results can become a burden even for experienced R-users. Interface solutions 
exist, like [ExploreModelMatrix](https://www.bioconductor.org/packages/ExploreModelMatrix), 
[pcaExplorer](https://www.bioconductor.org/packages/pcaExplorer)
[MatrixQCvis](https://bioconductor.org/packages/MatrixQCvis/)
or [ideal](https://bioconductor.org/packages/ideal/),
which each allows perform one particular step of the analysis, or
[MSstatsShiny](https://bioconductor.org/packages/MSstatsShiny/), 
which is designed for one particular omics type.

Here we present RFLOMICS, an R package with a Shiny interface, designed to ensure 
a guided, comprehensive analysis and visualization of transcriptomics, proteomics,
or metabolomics data. RFLOMICS covers the entire process from defining the statistical
model to multi-omics integration, all within a single application.

RFLOMICS uses extended classes of the [MultiAssayExperiment](https://bioconductor.org/packages/MultiAssayExperiment/) 
and [SummarizedExperiment](https://bioconductor.org/packages/SummarizedExperiment/)  
classes to ensure efficient data management of heterogeneous data.

A flexible HTML report can be generated at any stage of the analysis, providing 
results, methods, and settings to facilitate the analysis reproducibility. 
Additionally, users can download an archive containing all produced results.

# Install and run RFLOMICS

You can install the latest RFLOMICS development from github:

```{r fromGithub, message=FALSE}
# BiocManager::install("RFLOMICS/RFLOMICS")
```

Then you can access the interface with the following commands:

```{r loadRflomics, message=FALSE}
library(RFLOMICS)
```

# Guided tour of the interface

This part of the vignette will follow the different steps of a multi-omics 
analysis trough the RFLOMICS interface which are: 

- Data loading
- Statistical framework setting
- Single omics analysis of each dataset 
- Single omics results summary
- Multi-omics analysis (with the constraint of at least 2 pre-processed 
datasets)

They will appear in the left tab menu of the interface along with the analysis 
progression. Whereas single omics analysis steps will be displayed in tab panel 
associated to each dataset .

The interface can be accessed using:

```{r runInterface}
# runRFLOMICS()
```

## Load Data

Data loading starts by setting a project name (it is mandatory), then load the 
experimental design and finally the different omics datasets.

### Dataset example description

Data used for this vignette have been provided by Pr. Loic Rajjou and Gwendal 
Cueff. It is included in the 'ExamplesFiles' directory of the package.
Briefly, A. thaliana's transcriptome, metabolome and proteome have been obtained 
in the context of the study of seed germination and vigor. In particular, the 
authors were interested in deciphering key entities involved in response to 
environmental stresses (on the mother plant): influences of temperature 
(high, medium and low) and imbibition stage (Dry: DI, early imbibition: 
EI and late imbibition: LI). See(https://www.uibk.ac.at/botany/ecoseed/home/) 


### Load Experimental Design

The experimental design file must contain experimental information/conditions 
for each sample in a tab separated values format (see RFLOMICS-input-data.html).
As soon as this file is loaded, for each design factor, we have to set up the 
factor type (Biological factor: Bio; batch factor: batch; 
metadata information: meta) and a reference level.

* *It is required to have a minimum of 1 biological factor (maximum 3) and 1 batch factor. Order of the columns does not matter.*

* *It is possible to remove factor's conditions from the design. Associated samples will be excluded from the analysis even if they are present in the omics dataset matrix.*

* *It is also possible to re-order factor modalities* 

### Load Omics data

  Experimental data must be loaded as matrix files in a tab separated values format.
  (for more information see RFLOMICS-input-data.html).
  For each dataset it is mandatory to specify the type of omics (transcriptomics, 
  proteomics, metabolomics ) as it will adapt the data processing and analysis 
  methods. By default, datasets will be called set1, set2, set3, ..., respectively.
  A more suitable name can be given by changing 'Dataset name' in the interface.
 
 To load the 3 examples datasets of the Ecoseed project: RNAseq data (read counts), 
 proteomics data (abundance of proteins), and metabolomics data (intensity of metabolites) 
 (see RFLOMICS-input-data.html), please press the 'Load Ecoseed Data'.
 

As soon as the data are loaded the **RFLOMICS MultiAssayExperiment** object is 
created. The data overview panel displays the samples overlap across datasets 
and the number of datasets per combination of biological factors.

![ ](Images/01_loadData.png){width=80%}

## Set the statistical framework

 The interface provides models written from the simplest one to the most complete
 one. Only the second order interaction terms between the biological factors appear. 
 Batch factors never appear in interaction terms. In our example, the complete model 
 is chosen.

All 'a priori' hypotheses (contrasts) are then calculated from the chosen model 
and displayed. The user must select those that correspond to the biological questions.
Contrasts are numbered. These labels will follow for all analysis.

In our example we are studying the effect of temperature on seed germination 
(transition from the state of dormancy to the state of germination vigor). 
For that we select 3 average contrasts (see image):

This statistical framework will be applied on all loaded datasets.  

![ ](Images/02_setModel.png){width=80%}

## Single omics data analysis

Once the contrasts are chosen, 3 Item menus appear in the side bar menu that 
correspond to the 3 loaded omics types.

For each dataset, different steps of data analysis are proposed as a tab panel. 
These analysis steps must be performed sequentially. It is possible to switch 
between datasets.
However, you cannot perform a new task until the previous one is completed. 
A progress bar is displayed indicating the progress status.

### Data processing and quality check

In this tab panel we access a data quality control screen and we perform the 
appropriate data filtering and processing according to the type of omics.

By default, features with 0 count in all conditions are removed from the data.

Specifically for each data we have the possibility to exclude samples or to 
select a subset of samples to analyze, provided the completeness condition is met. 

#### Completeness check

 The design must be complete and as possible balanced. A **Complete design (mandatory)** 
 mean that all possible combinations of biological factors' levels (called groups) 
 are presents. A **Balanced design (recommended)** has an equal number of 
 observations (replicates) for each group.

These constraints are checked and represented as graph (each square represent 1 
group and each color level indicate the number of observations). 

```{r, fig.show="hold", out.width="40%", fig.align = "center", echo=FALSE}
knitr::include_graphics(c("Images/00_completeDesign.png","Images/00_nonBalancedDesign.png"))
```

The effect of data filtering or sample removing step can be explored thanks to 
many QC graphs. For each QC, a graph is built on each raw data and filtered data 
to allow the comparison.


#### Transcriptomics data

For RNAseq data, low expressed genes are filtered so that only genes whose cpm 
expression is greater than **"CPM cutoff"** in x samples are kept. The value of 
x is given by **"Filtering strategy"** (NbOfsample_over_cpm <= NbConditions). 
By default the data are filtered based on NbCondition strategy with CPM cutoff 
equal to 5. In this example, there are 9 conditions. The genes 
with a CPM less than 5 in at least 9 samples are removed.

To correct for differences in sequencing depths (library size), RFLOMICS proposes
one method for normalization, **"TMM method"** from the edgeR package.

For RNAseq data the QC graphs are drawn before and after the filtering and 
normalization steps. The following graphs are displayed:

- The log2(count) distribution per sample: Expect aligned medians. Samples with 
shifted median may be outliers.
- The library size distribution which is the distribution of the total number of 
reads per sample.
- The density distribution of log2(count) per sample: Expect a unimodal gaussian 
density distribution.
- An interactive Principal Component Analysis allowing to color samples by any 
factor of the design to decode the main sources of variation in the data.

![ ](Images/03_DataProcess.png){width=80%}


### Differential expression analysis

In this tab panel, we can run the differential expression analysis. The analysis
is run by contrast.  
For RNAseq data, the **glmFit()** function from the **edgeR** package is used 
whereas for proteomics and metabolomics data, the **lmFit()** function from **limma**
packages is used.


Results will appear in two components: a scrolling menu will give the results of 
the analysis by contrast and, at the bottom of the panel, the intersecting sets of 
the lists of DEG is given thanks to the **UpSetR()** function.  

Each results has to be validated:

- You **have to** to check the Pvalue distribution for each contrast and validate 
it (**ok checkbox**). 
  + If the distribution is uniform: it is ok. A few genes are DE and the FDR will
  find them.
  + If the distribution as a peak near 0 and then is uniform: it is also ok. The
  higher the peak, the higher the number of DE genes. FDR correction can be applied.
  + If the distribution has 2 peaks one near 0 and one near 1 or just one peak near 1, 
  it is not a good news! You have to understand what's happened.

- You can filter differentially expressed genes/proteins/metabolites either by 
fold change (**|FC|**) or by False Discovery Rate (BH) cut-off. Graphs will be 
automatically updated.

- You have to **validate** to save the cut-off thresholds but also the contrasts
result selection  and pass to the co-expression analysis or to the annotation 
enrichment step.
 
![ ](Images/04_diffAnal.png){width=80%}
 
### Coexpression analysis

* In this tab panel, choice is given to merge (**union**) or intersect  
(**intersection**) the lists of differentially expressed genes/proteins/metabolites 
associated to contrasts.

* Co-expression analysis is done using to the **coseq** package with a set of 
optimal parameters.

  - For **RNAseq**, data transformation method is set to the **arcsin** function 
  and normalization method to **TMM**. No scaling is done for these data. 
  **gaussian mixture model** is used to decipher the different profiles of gene 
  expression with a set of parameters to estimate fixed to (Gaussian_pK_Lk_Ck).
  - For **proteomics/metabolomics** data, scaling is done on proteins/metabolites.
  Another sets of parameters estimation is also proposed for the modeling of gene 
  expression profile: (Gaussian_pK_Lk_Bk). It has to be used, if the other one 
  doesn't fit.


The number of technical replicates to perform (**iter**) for each **K** which is 
the number of expressions' profile into which
the DEG have to be cut, can be first set to 2 to precise the range of K and then 
put to a minimum of 20 replicates per K.


* Results description (see [CoSeq](https://www.bioconductor.org/packages/release/bioc/vignettes/coseq/inst/doc/coseq.html))

  - Results can be explored with several plots proposed by **coseq**. 
  The ICL graph has been slightly modified to show all the range of the ICL 
  values for a given K. 
   
  - A table of jobs summary is also given. It groups error messages by K.  

![ ](Images/05_coExpAnal.png){width=80%}

### Annotation and Enrichment analysis using clusterProfiler

 We select the lists of DE genes or clusters to annotate. All available lists 
 are selected by default. 

 The enrichment analysis is performed using the [clusterProfiler R-package](https://www.sciencedirect.com/science/article/pii/S2666675821000667).
 For more information on the methods and the package, please see 
 [the article](https://www.sciencedirect.com/science/article/pii/S2666675821000667), 
 [Bioconductor Package](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) 
 and the [clusterProfiler Vignette](https://yulab-smu.top/biomedical-knowledge-mining-book/). 

You can chose between several domains, depending on several parameters: 

* custom: preferred option. It requires an annotation file (tabulated, at least 
two columns, one for the name of the gene, the other for the name or id of the 
annotation term). A second panel will allow the user to indicate which columns 
indicates what. You can have as many domains/ontology as you want, you just need 
to indicate the columns name that differentiate them in the right area;
* GO (or GO:BP, GO:CC, GO:MF if only one GO domain is of interest). The user will 
have to chose a database library ([org.*.db](https://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb)) 
installed on the libPath and the keytype of the entities (identifier to make the 
connection with the annotation). 
* KEGG: In the setting panel, enter the three letters to identify the organism 
and the keytype of the entities. It requires to have an online connection. 

Once everything is set and the analysis is conducted, several panels are displayed 
to show the results. 

![Results of the analysis for multiple lists](Images/06_ORAAnal.png){width=80%}

### Dataset analysis summary 

 This panel summarizes the effects of the filtering steps on the selected datasets
 and the results of the differential analyses. The first row shows how many 
 samples and entities are left in each table. The integration will be performed 
 on these datasets. The second row shows the number of DE entities for each 
 selected contrast and for each table, with the distinction between up-regulated 
 and down-regulated entities.  


## Integration

 Once at least two tables have passed the pre-processing step, the Integration 
 tab will appear in the side bar menu. Two sub menus are available. For each of 
 them, you will have to run a preprocessing before accessing to the integration 
 panel: select the datasets you want to integrate together and the filtering 
 strategy for each data. A filtering is strongly advised when having large data 
 or unbalanced number of features between them with the goals to homogenize the 
 number of features between the datasets.
 
 You can filter according to the DE entities found in the differential analysis, 
 or based on the variation coefficient, taking a certain number of variables 
 that have the largest coefficient. For this example, we chose to keep only 500
 features from the RNAseq data, based on the variation coefficient.
 
 RNAseq data will be transformed before the integration. This is done with the 
 **voom** transformation from the **limma** R package.   

 Batch effects are removed before applying the function, using 
 **removebatcheffect** from **limma**.
 
 Features are then centered before the integration.

![](Images/07_prepData.png){width=100%}

### MOFA

 This integration analysis is performed using the **MOFA2** R-package. It is an 
 unsupervised method that searches for latent factors to decompose the variance 
 of the data, similarly to a PCA, but applied to a multi-dataset situation. The 
 interpretation of the results is also very similar to those of a PCA: the most
 interesting factors are the one explaining more variance, whether it's on 
 several dataset or just one table. It also gives weights to the features of 
 each table, for each factor. This is interesting in case a factor can be 
 **a posteriori** linked to a covariate (in the case displayed on the figure, 
 the temperature seems linked to the first factor, the features used to build
 this factor are to be considered very relevant in the analysis).  

![](Images/07_MOFA.png){width=80%}

 The user can chose to scale or not each tables, with the scale views argument,
 then set the number of factors to be computed and finally the maximum 
 iterations for the MOFA run. We recommend that the user do not change the 
 number of iterations.  Once everything is set, you can run the analysis by 
 clicking on the run button. 
 
Several figure are provided by the interface once the analysis is done.

### MixOmics

 This integration analysis is performed using the **MixOmics** R-package. 
 In RFLOMICS, we focused on the supervised analyses using the functions 
 block.(s)plda (also called DIABLO) from the package. Only complete cases 
 will be used for the analysis. The user can select all biological factors 
 as response variables. The meta and  batch factors are not available.

 Using Sparse Analysis will set the "s" in the function names. It this case, 
 the tuning 'cases' argument will be considered and passed to tune.splsda or 
 tune.block.splsda: it determines the number of features selection to be tested. 

 In RFLOMICS, we chose to only ask for a number of cases. This number of cases 
 is used to determine the keep.X argument, using a span of nfeatures/number of 
 cases. By example, if the user chose 5 cases and has 2 datasets with 
 respectively 500 and 250 features, the function will test 25 combinations of 
 features per component. For each datasets, features will be picked-up in a 
 cumulative manner ie for the DS1: 100,200,300,400,500 and 
 DS2: 50, 100, 150, 200, 250. An example of combination will be 
 c(DS1=100,DS2=50). The best combination of results by components will be used 
 by mixOmics block.splsda.

 Once everything is set, you can run the analysis by clicking on the run button. 
Several figures are provided by the interface once the analysis is done, for 
each selected response variable.  

![](Images/07_mixOmics.png){width=100%} 


# Exploring results from RFLOMICS

Weather the results are obtained as a .RData object downloaded from the 
interface or directly using the command line functions, the structure of 
the results is the same and visualization functions 
(from RFLOMICS, MultiAssayExperiment or SummarizedExperiment) 
can be used to further investigate the analyses results
and produce suitable figures for an article. 
Most visualization functions from RFLOMICS return a ggplot object and can be
modified using the ggplot2 functions.

Each of the analysis step has their own plot and getters methods. 

# Using RFLOMICS without the interface

All the pipeline, results and figures produced in the interface can be obtained 
by running methods directly in the R console. 

## Load example data

Ecoseed data can be accessed using the following command:

```{r}
data("ecoseed.mae")
```

## Create a RflomicsMAE object

The first step is to create the object containing all the data that can be 
used by RFLOMICS.
This requires a project name, the datasets to be analyzed, a design matrix and 
information for each factor: the type (biological, batch or meta-factor) and a
reference level, all in the form of data.frames. 
The reference level must be set with a particular care, as it will be used to
define all contrasts in the differential analysis. 

```{r}
factorInfo <- data.frame(
  "factorName"   = c("Repeat", "temperature", "imbibition"),
  "factorType"   = c("batch", "Bio", "Bio")
)
```

```{r createRMAE}
testObject <- createRflomicsMAE(
  projectName = "VignetteProject",
  omicsData   = ecoseed.mae,
  omicsTypes  = c("RNAseq","proteomics","metabolomics"),
  factorInfo  = factorInfo
)
```

```{r}
class(testObject)
```

RFLOMICS uses a structure derived from the [MultiAssayExperiment](https://www.bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) 
package. 
Each dataset is stored as a RflomicsSE (derived from [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html))
in a RflomicsMAE object.
All methods from the MultiAssayExperiment package are also available for RFLOMICS objects. 

```{r}
is(testObject, "MultiAssayExperiment")
```

```{r}
upsetSamples(testObject)
```

# Session Info

```{r}
sessionInfo()
```