---
title: "RFLOMICS Command line"
output: 
  BiocStyle::html_document:
        toc: true
        toc_depth : 4
        number_sections: true
date: "`r format(Sys.time(), '%B %d, %Y')`"
package: RFLOMICS
vignette: >
  %\VignetteIndexEntry{RFLOMICS Command line}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  echo = TRUE,
  comment = "#>"
)
```


# Introduction

RFLOMICS is a R package that provides a complete workflow for the analysis 
of multi-omics data within complex experimental designs framework. 
The package was originally designed to be used through a user interface 
(see `RFLOMICS` vignette), but has been developed in an object oriented 
programming with a limited number of classes and methods. 
This allow to run data analysis as command line in a workflow with the "\|\>" 
operator, each step being a method of the extended 
[MultiAssayExperiment](https://bioconductor.org/packages/MultiAssayExperiment/) 
or [SummarizedExperiment](https://bioconductor.org/packages/SummarizedExperiment/) 
classes.


# The RflomicsMAE and RflomicsSE classes

To ensure efficient data management the 
[MultiAssayExperiment](https://bioconductor.org/packages/MultiAssayExperiment/) 
and [SummarizedExperiment](https://bioconductor.org/packages/SummarizedExperiment/) 
classes were extended to fix an organization for the metadata slot. 
The statistical design and analysis parameters of the multi-omics data integration 
are stored in the `metadata` slot of the `RflomicsMAE` class, while the parameters 
and results of the single-omics are stored in the `metadata` slot of `RflomicsSE` 
objects of the `ExperimentList`.

This organization can be illustrated by presenting the workflow of the RFLOMICS 
package.

The workflow is structured in three main parts:

-   The first consists of loading the experimental design and omics data, 
setting up the statistical model and defining the contrasts (hypotheses) 
associated with the biological questions. The design list of the metadata 
slot of the RflomicsMAE object stores this statistical settings 
(in green in Fig 1).

![Fig 1: Organization of the metadata slot of the RflomicsMAE object](Images/DM_RflomicsMAE-design.png){width="100%"}

-   The second part allows a complete single-omics analysis for each dataset. 
The raw and processed omic datasets are stored in the ExperimentList as 
RflomicsSE. For each analysis step, settings and results are stored 
(in green in Fig 2)

    1.  The DataProcessing list in the metadata slot contains the parameters 
    and results of pre-processing including filtering, normalization and 
    transformation

    2.  The DiffExpAnal list in the metadata slot contains the parameters 
    and results of the differential expression analysis.

    3.  The CoExpAnal list in the metadata slot stores the parameters and 
    results of the coexpression analysis on differentially expressed entities.

    4.  DiffExpEnrichAnal and CoExpEnrichAnal lists in the metadata slot store 
    the results and parameters of the functional enrichment analysis for the 
    differential expression and the coexpression analysis, respectively.


![Fig 2: Organization of the metadata slot of the RflomicsSE objects](Images/DM_RflomicsSE-overview.png){width="150%"}


-   The third part is the multi-omics integration, with two proposed methods: a 
supervised approach using the mixOmics package (DIABLO), and an unsupervised 
approach using MOFA2. (settings and results stored in RflomicsMAE object)


![Fig 3: storage of multi-omics integration settings and results in the metadata slot of the RflomicsMAE object](Images/DM_RflomicsMAE-integration.png){width="100%"}


As you will see, this analysis workflow is implemented as methods of the two 
classes (`RflomicsMAE` and `RflomicsSE`), providing a generic interface whatever 
the type of omics data being analyzed (RNAseq, proteomics, metabolomics, etc.). 
These methods return the same class as the input object 
(`RflomicsMAE` and `RflomicsSE`) with the metadata slot enriched with parameters 
and results. Getter and Setter methods have been defined to access the metadata 
objects.


# Methods for RflomicsMAE and RflomicsSE classes

## Instantiating a RflomicsMAE object from a MAE object

In this example we will instantiate a RflomicsMAE object from an existing MAE 
object and a design data.frame. We will use the ecoseed dataset from the RFLOMICS
package (see `data(ecoseed)`).

`ecoseed` is a multi-omics dataset composed of three data matrices: 
transcriptomics (raw RNAseq read count data matrix), metabolomics and proteomics
(relative abundance matrix as XIC) obtained from a study examining the effect of
seed production temperature on the germination potential of Arabidopsis thaliana.

The experimental design includes three factors:

\- Repeat: batch factor: rep1, rep2, rep3

\- temperature: biological factor with three conditions: Low, Medium, Elevated

\- imbibition: biological factor with three conditions: DS (Dry Deed), EI (Early
Imbibition), LI (Late Imbibition)

Crucial information is needed to complete the design and create the RflomicsMAE 
object. Experimental factors must be defined with their type (batch or biological) 
and a reference level. These information are used to automatically generate the 
statistical design.

PCA on raw data is performed during the instantiation of the RflomicsMAE object.

```{r, echo=FALSE}
library(RFLOMICS)
```

```{r createRlomisMAE}
data(ecoseed.mae)

factorInfo <- data.frame(
  "factorName"   = c("Repeat", "temperature", "imbibition"),
  "factorType"   = c("batch", "Bio", "Bio"),
  "factorRef"    = c("rep1", "Low", "DS")
)

# create rflomicsMAE object with ecoseed data
MAE <- createRflomicsMAE(
  projectName = "Tests",
  omicsData   =  ecoseed.mae,
  omicsTypes  = c("RNAseq", "proteomics", "metabolomics"),
  factorInfo  = factorInfo
)

names(metadata(MAE))

head(getDesignMat(MAE))
```


## Overview of the main methods

Omics data is heterogeneous and each omics requires specific methods and 
parameters to take into account of the nature of data. RFLOMICS has a set 
of generic methods than can be applied to the `RflomicsMAE` or `RflomicsSE` 
objects and which adapt the methods and parameters to the type of omics data: 
RNAseq, proteomics, metabolomics.

The main methods are:

| **Methods**             | **Description**                                                                                                     |
|:------------------------|:--------------------------------------------------------------------------------------------------------------------|
| `createRflomicsMAE`       | instantiate a `RflomicsMAE` object from a `MAE` object and a design data.frame                         |
| `runDataProcessing`       | Process data: filtering: features or samples, normalization and/or transformation, PCA                              |
| `runDiffAnalysis`         | perform a Differential features analysis on omics data                                                              |
| `runCoExpression`         | perform a CoExpression analysis from the union/intersection of differentially expressed features list.              |
| `runAnnotationEnrichment` | perform an annotation enrichment analysis (GO, KEGG, custom) on the list of differentially or co-expressed features |
| `runOmicsIntegration`     | perform an integration analysis (MOFA or DIABLO)                                                                    |
| `generateReport`          | generate a report from the RflomicsMAE object and optionally export the report and results                          |

*Table 1:* Main methods of the RflomicsMAE and RflomicsSE classes for running a 
multi-omics analysis workflow.

When applied to a `RflomicsMAE` object, the name of the `RflomicsSE` object must
be specified with the `SE.name` parameter.


## Executing the workflow :

### Define the statistical settings: model and contrasts

The next step is to define the model and associated contrasts by translating the 
biological question into statistical terms. The `generateModelFormulae` function 
is used to find all possible models given the experimental design settings.

Rflomics accepts up to three biological factors and one batch effect. Only 
second-order interaction terms between biological factors will appear in the 
model formulae. Batch factors will never appear in the interaction terms. 
For the rest of the example, we choose the second formula. The model without 
the interaction term.

-   set the model formulae

```{r designMetadata}
form <- generateModelFormulae(MAE)
form


MAE <- setModelFormula(MAE, form[2])


getModelFormula(MAE)
```

Given the formula chosen, all the hypotheses (contrasts) can be obtained using 
the `generateExpressionContrast` function. 
Three types of contrast are expressed: pairwise comparison, average expression 
and interaction expression.

In this example, we will select the first three contrasts from the averaged 
contrasts.

-   set contrasts:

```{r contr}
possibleContrasts <- generateExpressionContrast(MAE)
possibleContrasts$averaged$contrastName

MAE <- setSelectedContrasts(MAE, contrastList = possibleContrasts$averaged[1:3])

getSelectedContrasts(MAE)[, -c(1, 3)]
```

This statistical framework will be applied on all loaded datasets.


### Single-omics analysis

#### Data processing and quality control

Data processing, including normalization and transformation parameters and 
results, can be carried out in a single step using the `runDataProcessing` 
method. Or in several steps using the methods called by `runDataProcessing`, 
which are:`filterLowAbundance` (`RNAseq`), `runNormalization` and `runTransformData`.

In this example, the three omics data sets are processed in a single step.

```{r runNorm}
MAE <- runDataProcessing(
  object = MAE,
  SE.name = "RNAtest",
  samples = colnames(MAE[["RNAtest"]])[-1],
  filterStrategy = "NbReplicates",
  cpmCutoff = 1,
  normMethod = "TMM"
) |>
  runDataProcessing(
    SE.name = "metatest", normMethod = "median",
    transformMethod = "log2"
  ) |>
  runDataProcessing(
    SE.name = "protetest", normMethod = "none",
    transformMethod = "log2"
  ) |>
  runOmicsPCA(SE.name = "metatest")

# Access to the normalization settings for metabolomics data

getNormSettings(object = MAE[["metatest"]])

# Obtain the list of filtered features for the RNAseq data

getFilteredFeatures(object = MAE[["RNAtest"]])[1:10]
```

It is important to note that none of the data is directly transformed in the 
assay, unless the user specifically requests it by setting the `modifyAssay` 
argument to TRUE. This way, you can always change your mind and test several 
normalisation or transformation processes.

You can access to processing results and plot them.

```{r plotPCA, fig.width=10, fig.height=5, fig.show = "hold"}
plotOmicsPCA(MAE[["metatest"]], raw = TRUE)
plotOmicsPCA(MAE[["metatest"]], raw = FALSE)
```

```{r plotDataDistribution, fig.width=10, fig.height=5, fig.show = "hold"}
plotDataDistribution(MAE[["metatest"]], raw = TRUE)
plotDataDistribution(MAE[["metatest"]], raw = FALSE)
```


#### Differential analysis: settings and results

Differential analysis on each contrasts can be performed using the `runDiffAnalysis`
method. This method is specialized for each omics type and uses the `limma` package 
for proteomics and metabolomics data and the `edgeR` package for RNAseq data. 
The method returns the same object with the metadata slot enriched with the 
parameters and results of the analysis.

```{r diff}
MAE <- runDiffAnalysis(MAE,
  SE.name = "RNAtest", p.adj.method = "BH",
  method = "edgeRglmfit", p.adj.cutoff = 0.05, logFC.cutoff = 0
) |>
  runDiffAnalysis(
    SE.name = "protetest", p.adj.method = "BH",
    method = "limmalmFit", p.adj.cutoff = 0.05, logFC.cutoff = 0
  ) |>
  runDiffAnalysis(
    SE.name = "metatest", p.adj.method = "BH",
    method = "limmalmFit", p.adj.cutoff = 0.05, logFC.cutoff = 0
  )

# access to the settings

getDiffSettings(MAE, SE.name = "RNAtest")

# Summary of the differential analysis

getDiffStat(MAE[["RNAtest"]])
```

```{r volcano, fig.width=7, fig.height = 4.5, fig.align="center"}
plotDiffAnalysis(MAE,
  SE.name = "RNAtest",
  contrastName = "(temperatureMedium - temperatureLow) in mean",
  typeofplots = "volcano"
)
```

```{r boxplot, fig.width=7, fig.align="center"}
plotBoxplotDE(MAE[["RNAtest"]],
  feature = "AT4G04810"
)
```


#### Co-expression analysis: settings and results

Co-expression analysis on the list of differentially expressed features can be
performed using the `runCoExpression` method. This method uses methods from 
the `coseq` package with a set of optimal parameters tailored to each omics. 
In this example, we will perform a co-expression analysis on the `proteomics` data.

The results can be plot using the `getCoExpAnalysesSummary` method.

A co-expression profile can be plotted for a given cluster using the 
`plotCoExpressionProfile` function.

```{r coexp}
MAE <- runCoExpression(MAE,
  SE.name = "protetest",
  K = 2:5, replicates = 5,
  merge = "union"
)

getCoExpAnalysesSummary(MAE)

plotCoExpression(MAE[["protetest"]])$ICL

plotCoExpressionProfile(MAE[["protetest"]], cluster = 2)
```


#### Annotation enrichment analysis: settings and results

Annotation enrichment analysis can be performed using the `runAnnotationEnrichment`
method. 
This method uses the `clusterProfiler` package. In this example, the `org.At.tair.db` 
database for Arabidopsis thaliana will be used to perform the GO enrichment analysis 
of the DE protein list.

The results can be viewed using the `getEnrichRes` method, a summary of the 
number of enriched terms by the `sumORA` function and plotted, by example, 
using the `plotEnrichComp` method.

```{r enrich}
MAE <- runAnnotationEnrichment(MAE,
  SE.name = "protetest",
  OrgDb = "org.At.tair.db",
  keyType = "TAIR",
  pvalueCutoff = 0.05,
  from = "DiffExp",
  database = "GO",
  domain = "CC"
)

results <- getEnrichRes(MAE[["protetest"]],
  from = "DiffExp", database = "GO"
)

sumORA(MAE[["protetest"]], from = "DiffExp", database = "GO")

plotEnrichComp(MAE[["protetest"]],
  from = "DiffExp", database = "GO",
  matrixType = "FC"
)
```

### Multi-omics integration analysis

The multi-omics step in RFLOMICS takes two steps: the preparation of the object,
based on the user-selected method, and the actual run of the selected method. 
This is done using `prepareForIntegration` and `runOmicsIntegration`.
In this example, we will use only the proteomics and metabolomics dataset 
with the mixOmics package (block.plsda function).

`prepareForIntegration` does not return a RflomicsMAE object, it returns the 
corresponding type to be an input for either MOFA2 or mixOmics functions.

```{r}
listIntegration <- prepareForIntegration(MAE,
  omicsNames = c("protetest", "metatest"),
  variableLists = rownames(MAE),
  method = "mixOmics"
)

MAE <- runOmicsIntegration(
  object = MAE,
  preparedObject = listIntegration,
  method = "mixOmics",
  selectedResponse = "temperature",
  ncomp = 3
)
```

Visualizing results is made through the original package functions.

```{r, fig.width = 7, fig.align = "center"}
mixOmics::plotIndiv(getMixOmics(MAE, response = "temperature"))
```


# Generating a report from the RflomicsMAE object

The `generateReport` function is used to generate a report from a RflomicsMAE object.
The report is generated in HTML format. 

The MAE object can be saved (export = TRUE), exporting the object in a .RData,
the report in html format, the graphs in png format and the data in txt (tabulated) format.

It's use is straightforward, using the `generateReport(object = MAE, fileName = "ecoseed_report.html", export = FALSE)` command. 
It can take a long time, depending on the amount of analyses performed and the data size.

# Session information

```{r}
sessionInfo()
```