---
title: 'OMICsPCA: An R package for quantitative integration and analysis of multiple omics assays from heterogeneous samples'
author: "Subhadeep Das"
date: "`r Sys.Date()`"
output:
    html_document:
        df_print: tibble
        fig_height: 7
        fig_width: 9
        highlight: pygments
        number_sections: yes
        theme: lumen
        toc: yes
        toc_depth: 6
vignette: >
    %\VignetteEncoding{UTF-8} 
    %\VignetteIndexEntry{OMICsPCA} 
    %\VignetteEngine{knitr::rmarkdown}
---

<style type="text/css">

body, td {
    font-size: 20px;
    text-align: justify
}
code.r{
    font-size: 15px;
}
pre {
    font-size: 15px
}
</style>


```{r include = TRUE, echo = FALSE, message = FALSE, warning = FALSE}
library(OMICsPCA)
library(OMICsPCAdata)
library(kableExtra)
library(rmarkdown)
library(knitr)
library(magick)
library(rgl)
library(grDevices)
library(MultiAssayExperiment)

```

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#",
eval = TRUE
)

```

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.pos = 'H')
```


# Introduction
\  

## Background:

OMICsPCA is a statistical framework designed to integrate and
analyse multiple types of heterogeneous assays, factors, data 
types or modalities (e.g. ChIP-seq, RNA-seq) from different 
samples or sources ( e.g. multiple cell lines, time points etc.). 
Biological processes in eukaryotes are complicated molecular
interactions, spanned over multiple layers of regulation.
In order to understand the collective effects of the 
interactions of biological processes, they should be studied 
parallelly. High throughput sequencing has enabled genome-wide 
assays across several disease conditions, cell types, time points
etc. at a much faster rate.  On the other hand, replicated
experiments on diverse samples (e.g. various cell lines) help
us to understand the origin of variation (e.g. the genes, TSSs,
exons etc) observed between the samples.  Therefore,
integrative multi-omics studies on heterogeneous samples are
gaining their importance every day. As a consequence, a huge
number of datasets are being produced and stored in public 
databases, allowing researchers to perform integrative analysis
of multiple data modalities.

Unfortunately, most of such studies are limited to very few 
samples(e.g. cell lines) or data modalities (e.g. 
various ChIP-seq assays) due to the huge cost, ethical issues,
and scarcity of samples associated with the entire experimental 
process. And, most of the data integration methods rely on the 
condition of homogeneity of samples, i.e. the samples
corresponding to each modality should be the same. As a result,
integrative analysis on publicly available data sets is very
much rare and extremely restricted to very few numbers of samples
or modalities (or both). 
\  

## Illustration of the problem with example

Letâ€™s illustrate the problem with an example. Suppose our
objective is to decipher the collective effect of CpG 
methylation and H3k4me3 histone modification on the gene
expression, followed by clustering of the similar genes on 
the basis of the consolidated effect of these 2 data modalities.
Consider we have the following datasets:
\  

1)    CpG methylation on 5 Cell lines A, B, C, D, and E
\  

2)    H3k4me3 histone modification on 10 Cell lines
A,D,E,I,K,Q,P,Z,W and G.

Here, we can not calculate the collective effect directly,
since the cell lines and the number of cell lines are
different in the two experiments. 
\  

## Previous approaches/frameworks towards multi-omics data integration

Several data integration frameworks have been designed over
the past few years to overcome such difficulty. For example, 
Huang (1) proposed a regression-based joint modeling approach
to integrate SNPs, DNA methylation, and gene expression;
He et al. (2)developed a pattern matching method to integrated 
eQTL and GWAS data; Xiong et al. (3) developed a statistical
framework to associate SNP and gene expression information etc.
All such applications are designed to perform in a supervised 
manner, i.e. the integrated data modalities must be associated 
with labelled classes (e.g. disease and control class; pathway 
1 and pathway 2 etc.). Such constraints impose a limitation on 
the algorithms to be used in varieties of applications (e.g., 
canâ€™t be used in clustering).  In contrast, Ha, et al. (4)
developed an unsupervised Gaussian process model for qualitative
(logical combination of binary values, e.g. 0/1 or TRUE/FALSE) 
integration of a large number of epigenetic factors and applied 
it to describe cell lineage-specific gene regulatory programs. 
Although this overcomes the limitation of class association to
some extent, this process is only sufficient for the 
applications where qualitative integration is sufficient.

As mentioned earlier, there is another important application of
data integration, which is associated with the disentanglement
of the source of variation observed between the samples or the
individuals. A common strategy used in such 
studies is to perform a large number of association tests
between the features and the data modalities (5, 6). 
Alternatively, kernel or graph-based methods are also considered
as methods of integration and the integrated data is then used
to build common similarity networks between samples (7, 8). 
Although effective in accomplishing specific tasks, such 
approaches lack proper interpretability (9) and do not suffice
to explain many other relevant queries applicable to integrated 
omics data. Such queries include identification of variably 
controlled genes, genes sharing similar epigenomic state across
various samples and many more. 
\  

## OMICsPCA

OMICsPCA is a multipurpose tool designed to overcome the difficulties
associated with both the applications of data integration. It is 
designed to identify the source of variation among the 
samples/variables (the columns of a table, e.g. cell lines, 
patients etc.) or individuals (the rows of a table, e.g. genes,
exons etc.) associated with each data modality (i.e. the tables, 
e.g. CpG_methylation, ChIP-seq on a protein etc.). It guides the
user through various analyses to decide on similar samples/variables 
(e.g.Cell lines). This is followed by selecting the data modalities
(e.g. the assays) that can be included or excluded leading to data 
integration. The selected data modalities can be further integrated 
and analysed and the data points or individuals (e.g. genes, TSSs, 
exons) may be clustered on the basis of the integrated data coming 
from single or multiple modalities.

In short, OMICsPCA can be used in clustering genes, TSSs, exons 
on the basis of multiple types of heterogeneous experimental or
theoretical data or vice-versa. Such kind of multivariate analyses 
is useful in identification of similar cell lines or assays prior 
to the integration process. In this vignette, we show an example 
of the entire analysis process done on human transcription start
sites.
\  

# DATA

In the following example, we run our analyses on a subset of 3 epigenetic
Assays and one expression assay. All the data are publicly available
at **ENCODE** and **ROADMAP epigenomics project** portal. The data is 
stored in the **MultiAssayExperiment** object named **multi_assay**, which
is stored in the supporting data package **OMICsPCAdata**.

More detailed description of the datasets are provided at the man
page of **multi_assay**, which can be accessed by `?multi_assay`


\  

# Example
\  
In this section we demonstrate the functionality of OMICsPCA with an example.
We divided the GENCODE annotated TSSs into 4 groups on the basis of their 
on-off state in 31 cell lines and mapped Histone ChIP-seq data
with them in order to study the consolidated effect of
these data modalities on their expression. Before calculating the
consolidated effect, we filtered out the outlier samples and identified the data
modalities that show discriminatory density pattern among the expression groups.
Finally, we clustered the similar TSSs together on the basis of the consolidated
effect of discriminatory factors.

## Preparation of data

OMICsPCA has an inbuilt function **prepare_dataset()** that aggregates
the samples (e.g. cell lines) of each data modality (e.g. a histone 
modification ChIP-seq assay) corresponding to each feature of the genome
(notably gene, TSS, exon etc.) and returns a dataframe.

The rows of this dataframe object contain the features (e.g. genes)
and each column contains the values corresponding to a sample
(e.g. Cell line).

The arguments taken by this function is described below:

1.    `factdir`: full path of the directory containing files corresponding
to various cell lines in .bed format. This directory should contain only
the files of the same assay (modality) (e.g. H3k9ac) from different
samples (e.g. cell lines).

2.    `annofile`: full path of the file containing the annotation file
(e.g exons, TSSs, genes etc) in .bed format. 

3.    `annolist`: name of full set or subset of the entries in the 
annotation file. The **.bed** files should have at least 4 columns.
The first column is the chromosome name, the second and third is
the start and end position of the feature (e.g. a gene or a
ChIP-seq peak etc). The fourth column contains the value of the
described feature (e.g. expression of a gene, the ChIP-seq
intensity of a peak or name of a gene etc.).

The `.bed` files should have at least 4 columns. The first column is the
chromosome name, the second and third is the start and end of the feature
(e.g. a gene or a ChIP-seq peak etc) accordingly and the fourth column 
contains the value (e.g. expression of a gene, ChIP-seq intensity of a peak
or name of a gene etc.).

Some examples of the input file formats are given below:
\  

```{r echo = FALSE}

Cell1 <- read.delim(
  system.file("extdata/factors2/demofactor/Cell1.bed",
  package = "OMICsPCAdata"),
  header=FALSE, sep = "\t")

names(Cell1) <- c("chromosome", "start","end","intensity")

annotation <- read.delim(
  system.file("extdata/annotation2/TSS_groups.bed",
  package = "OMICsPCAdata"),
  header=FALSE, sep = "\t")

names(annotation) <- c("chromosome", "start","end","TSS ID") 

TSS_list <- read.delim(
  system.file("extdata/annotation2/TSS_list",
  package = "OMICsPCAdata"),
  header=FALSE, sep = "\t")

names(TSS_list)[1] <- c("TSS ID") 

```

\  

```{r, eval = TRUE, echo = FALSE, result = 'asis'}
#for html

knitr::kable(Cell1[1:5,], 
             caption = "Intensity of demo peaks",
             align = 'c') %>%  kable_styling("bordered",full_width = TRUE, position = "left")
```
\  

```{r, eval = TRUE, echo = FALSE, result = 'asis'}
#for html

knitr::kable(annotation[1:5, 1:4], caption = "demo annotation file(`annofile`)",
             align = 'c') %>% kable_styling("bordered",full_width = TRUE,
                                            position = "left")
```
\  

```{r, eval = TRUE, echo = FALSE, result = 'asis'}
#for html

knitr::kable(TSS_list[1:5,],
             caption = "demo TSSs (`annolist`)",
             align = 'c', col.names = "TSS ID") %>% 
  kable_styling("bordered",full_width = TRUE, position = "left")
```
\  

Here we show the use of **prpareDataset()** by integrating demo 
peaks (.bed format) from 2 files into a data frame named.
The example data sets are packaged inside the `OMICsPCAdata` 
package
\  

```{r echo = TRUE}
anno <- system.file("extdata/annotation2/TSS_groups.bed",
                    package = "OMICsPCAdata")

list <- system.file("extdata/annotation2/TSS_list",
                    package = "OMICsPCAdata")

fact <- system.file("extdata/factors2/demofactor",
                    package = "OMICsPCAdata")


list.files(path = fact)
```
\  

prepare_dataset() will combine these 2 bed files into a dataframe.
The rows of the dataframe will correspond to the entries in file
entered through the variable named `list`  and the columns will
correspond to the 2 files shown above.
\  

```{r eval = TRUE}

all_cells <- prepare_dataset(factdir = fact,
annofile = anno, annolist = list)

head(all_cells[c(1,14,15,16,20),])

```
\  

For a different assay type repeat the process:
\  

```{}

factor_x  <- prepare_dataset(factdir = fact,
                            annofile = anno, annolist = list)

# where `fact` is the directory of the samples corresponding to factor_x

For example,

fact <- system.file("path to cpg")
cpg <- prepare_dataset(factdir = fact, annofile=anno, annolist = list)
```
\  

## Reading data 

OMICsPCA is designed to read various assays done on various cell lines
into an **MultiAssayExperiment** class object and then run various
analyses on this object.
\  

```{r echo = TRUE}

library(MultiAssayExperiment)

datalist <- data(package = "OMICsPCAdata")
datanames <- datalist$results[,3]
data(list = datanames)

assaylist <- list("demofactor"  = all_cells)

demoMultiAssay <- MultiAssayExperiment(experiments = assaylist)


```
\  

For ease of analysis, we compiled some preloaded data into
an **MultiAssayExperiment** object named **multi_assay**.

## Subsetting into groups

Once the dataframes corresponding to the assays/modalities
is prepared and stored in the "MultiAssayExperiment" object",
the user may want to divide the entire annotation set
into smaller groups. Here we show an example to divide 28770 GENCODE
TSS groups into 4 expression groups. The grouping criteria are based
on the on/off status (determined from CAGE experiments) of the TSSs
in 31 cell lines.

OMICsPCA has a function **create_group()** to do this task. This
function takes the following 4 arguments:

1. **group_names** : a vector containing the user-defined names of the
groups to be created

2. **factor** : name of the data frame on which groups will be created

3. **comparison** : a vector of comparison symbols such as >,
<, ==, >=, <=, %in% etc
4. **condition** : a vector of conditions corresponding to `comparison`.
condition should be a vector or range of digits (e.g. c(1,3,7,9).
or 1:5) if %in% is chosen as comparison and a single digit for 
other cases.

5. **name** : name of the **MultiAssayExperiment** object containing the
assay data

6. **grouping_factor** : name of the assay/modality on which grouping will
be done
\  

```{r echo = TRUE}
groupinfo <- create_group(name = multi_assay, group_names = c("WE" ,
"RE" ,"NE" ,"IntE"),
grouping_factor = "CAGE",
comparison = c(">=" ,"%in%" ,"==" ,"%in%"),
condition = c("25" ,"1:5" ,"0" ,"6:24"))
```
\  

In the above example the dataframe CAGE is divided into
4 smaller data frames e.g; WE, RE, NE and IntE. â€œWEâ€ contains TSSs
expressed in >= 25 cell lines, â€œREâ€ represents TSSs expressed in
1 to 5 cell lines and so on.

The CAGE data frame looks like this:
  \  

```{r, eval = TRUE, echo = FALSE, result = 'asis'}
#for html
knitr::kable(assays(multi_assay)[["CAGE"]][1:5,1:5], align = 'c') %>% 
kable_styling("bordered",full_width = TRUE, position = "center")
```
\  

The **groupinfo** object is accessd by subsequent functions.
So this object has to be present in the global environment to
run other functions. If the user do not wish to divide the 
datasets into groups or if the user has predefined groups,
then this information may be read directly from a file or
object having this information. We compiled a predefined
group format in a dataframe **groupinfo_ext** and compiled
it in **OMICsPCAdata**.
\  

```{r, eval = TRUE, echo = FALSE, result = 'asis'}
#for html

knitr::kable(groupinfo_ext[1:5,, drop = FALSE], align = 'c') %>% 
  kable_styling("bordered",full_width = TRUE, position = "center")
```
\  

## Exploratory data analysis (EDA) on created groups

Once the group information is read, we can study the the distribution of
various Assays/factors(e.g. H2az, H3k9ac etc) on the annotations
(e.g. TSSs, genes) of various groups (In the above example, the groups
are created on the basis of on-off state of TSSs in 31 cell lines). 
OMICsPCA has several functions for EDA.
\  

### descriptor()

Descriptor works in two modes depending on the value passed through
the argument `choice`. choice = 1 displays panels of boxplots
corresponding to the Assay(s) or factor(s) or modality(ies) chosen
through argument `factor`. Each panel shows a number of boxplots
corresponding to the  group(s) passed through argument `groups`.
Each boxplot in choice 1  represents the distribution of the 
percentage of cell lines corresponding to a group that is overlapped 
with the selected assay.  

**NOTE :** in `groups`, one group name should be used only once.
\  

**NOTE :** `groups` can handle only the predefined groups or the groups 
created by "create_group()"
\  

```{r}
descriptor(name = multi_assay,
factors = c(
"H2az",
"H3k4me1","H3k9ac"),
groups = c("WE","RE"),
choice = 1,
title = "Distribution of percentage of cell types overlapping
with various factors",
groupinfo = groupinfo)
```
\  

Above boxplot shows that "H2az" and "H3k9ac" overlaps with Widely Expressed
("WE") TSSs in almost 100 % cell lines of respective assays. Thus they can 
be thought of as characteristics of "WE". In addition, almost the opposite
trait is observed in "NE", which in turn, indicates that these 6 properties 
may be the discriminatory factors between the various expression groups.
\  

In `choice` = 2, a value needs to be supplied to the argument 
`choice2group`. This value should be the name of one of the 
groups created by "create_group()".
The plot displayed in this mode represents the distribution 
of the percentage of celll lines corresponding to a
factor/Assay/modality with various combination of other
Assays/factors/modalities supplied through `factors`.
\  

```{r}
descriptor(name = multi_assay,
factors = c("H2az",
"H3k4me1","H3k9ac"),
groups = c("WE","RE"),
choice = 2,
choice2group = "WE",
title = "Distribution of percentage of cell types overlapping with
a factor in various combinations of epigenetic marks in the
selected group",
groupinfo = groupinfo)

```
\  

The above set of boxplots indicates that most of the epigenetic factors in
Widely expressed TSSs ("WE") are present with other factors.
This gives us an indication of the interplay between various epigenetic
marks
\  

### chart_correlation()

This function creates pairwise correlation and scatter plots and
histogram on selected groups. The type of results should be passed 
through the argument `choice`. This is a wrapper function on various
functions and thus can take additional and non-conflicting arguments
specific to them.
\  

```{r, eval = TRUE,echo=FALSE, results='asis'}
#for html
data <- data.frame(choice = c("table","scatter","hist", "all"),
                   functions = c(
"cor {stats}",
"pairs {graphics}",
"ggplot,geom_histogram,facet_wrap{ggplot2}",
"chart.Correlation {PerformanceAnalytics}"),
output = c("correlation table","scatterplot of each pair",
"histogram of each column","all of the above together"))

knitr::kable(data, caption = "Summary of choices", align = 'c') %>% 
  kable_styling("bordered",full_width = FALSE, position = "center")
```
\  

```{r eval = TRUE, echo = TRUE}
groups <- c("WE")
chart_correlation(name = multi_assay, Assay = "H2az",
groups = "WE", choice = "scatter", groupinfo = groupinfo)
```
\  

All the correlation scatterplots of TSSs in "WE" looks similarly shaped
except the ones coming from Hepg2. This is an indication that H2az pattern
of the expression group "WE" is different in Hepg2 than other 4 cell lines.

The correlation value between each pair of cell lines may be calculated
by `choice` = table. The slight difference in Hepg2 is visible here also.
\  

```{r eval = TRUE, echo = TRUE}
chart_correlation(name = multi_assay,
Assay = "H2az",
groups = "WE", choice = "table",
groupinfo = groupinfo)
```
\  

The diversion of hepg2 is more evident when plotted as a histogram.
The mode of "WE" is close to 0 in Hepg2 compared to other cell lines.
This indicates that most of the "WE" in Hepg2 do not have H2az marks on them.
\  

```{r}
chart_correlation(name = multi_assay, Assay = "H2az",
groups = "WE", choice = "hist", bins = 10,
groupinfo = groupinfo)
```
\  

All the above 3 plots can be plotted in a single plot
\  

```{r}
chart_correlation(name = multi_assay, Assay = "H2az",
groups = "WE", choice = "all",
groupinfo = groupinfo)
```
\  

## Integration of Multiple Cell lines

If an assay (e.g. ChIP-seq for the histone modification of type H3k9ac)
is done on multiple cells/conditions/treatment/time, it might be, 
sometimes, necessary to integrate or combine them, in order to
understand the source of variation or to calculate a common metric
to be served as the representative of all variables etc. Such a 
combination may be done by several techniques like Principal
Component Analysis (PCA) or Factor Analysis (FA). OMICsPCA
has a function **integrate_variables()** to integrate data
from multiple cells or conditions using PCA. 

***integrate_variables()*** takes 3 mandatory inputs:
  
1. **Assays**: name of the assays read in the
  **MultiAssayExperiment** object

2. **name**: name of the **MultiAssayExperiment** object
containing the assays to be integrated.

3. **groups**: a full set or subset of groups present in 
**groupinfo** object. The group names should be provided as a 
vector string. e.g. c("WE" ,"RE")

***integrate_variables()*** runs PCA on the assays/data modalities
supplied through the `Assays` argument.This function is a wrapper
around the function PCA() from package FactoMineR and 
thus can take additional arguments specific to the
function PCA()

In this example, we supply two more additional arguments
**scale.unit = FALSE** and **graph = FALSE** 
  
This function returns a list containing the PCA results.
\  

```{r eval = TRUE}
PCAlist <- integrate_variables(Assays = c("H2az","H3k4me1",
"H3k9ac"), name = multi_assay,
groups = c("WE","RE", "NE", "IntE"), groupinfo = groupinfo,
scale.unit = FALSE, graph = FALSE)
```
\  

## Analysis of integrated data

The **analyse_vaiables()** function is designed for a quick 
analysis and visualization of the data integrated by 
integrate_variables() function. It takes 3  compulsory
arguments **1) name**, **2) Assay** and **3) choice**.
The type of analysis should be selected through
the `choice` argument. This function acts as a wrapper around a 
collection of functions of package "factoextra" and "corrplot" and 
thus can take additional arguments specific and non-conflicting to 
such functions. Types of analysis and their corresponding input in
`choice` are listed below:
  \  

```{r echo =FALSE}
choice_table <- data.frame(choice = 1:6, graphical_output = c(
  "Barplot of variance explained by each principal component (PC)
  or dimension",
  "Loadings (in terms of cos2, contrib or coord
  supplied through the argument var_type) of columns/variables
  (cell lines in this example) on PC1 and PC2",
  # Each Cell line(variable) has a loading on each
  #principal components. 
  #Loadings = Eigenvectorsâ‹…sqrt(Eigenvalues).
  "Matrix plot of correlations of each column/variable
  (Cell line in this example) with each PC",
  "Barplot of squarred loadings (or cos2) of each column/variable 
  (cell line in this example) on the PC/dimension of choice",
  "The contributions of each column/variable (cell line in
  this example) in accounting for the variability in a given
  PC/dimension. The contribution (in percentage) is calculated
  as : squared loading of the variable 
  (e.g. a cell line) * 100 / total squared loading of
  the given PC/dimension",
  "Variance explained by each of the
  first few PCs together with barplot explaining total 
  variance explained by the displayed PCs in each assay"
  ),
  
  console_output = c(
    "Table of eigen values and corresponding variance",
    "Table of loadings in terms of coord, cos2 or contrib as
    supplied through  the argument var_type",
    "Table of correlations of variables (Cell lines) with 
    PCs/Dimensions",
    "Table of squarred loadings/cos2 of
    each variable (Cell line) on the PCs/ Dimensions",
    "Table of contribution of each variable on the selected
    PC/Dimension",
    "None"
  ),
  
  function_used = c(
    "fviz_eig",
    "fviz_pca_var",
    "corrplot",
    "fviz_cos2",
    "fviz_contrib",
    "ggplot, plot_grid"
  ),
  
  package = c(
    "factoextra",
    "factoextra",
    "factoextra and corrplot",
    "factoextra",
    "factoextra",
    "ggplot2, cowplot"
  ),
  
  additional_arguments = c(
    "addlabels",
    "var_type",
    "is.corr",
    "PC",
    "PC",
    "various graphical arguments to ggplot2 and cowplot functions"
  )
  )

names(choice_table) <-  c("choice", "graphical\noutput","console\noutput",
"function\nused", "package", "additional\narguments")

library(kableExtra)
```
\  

**Note:** `PC` argument is mandatory for `choice` 4 and 5.
(PC = principal component)

**Note:** User may use the functions listed below directly to supply more
graphical parameters (e.g. `ncp` in choice 1, i.e. in fviz_eig, to restrict
the display of the number of dimensions or use `select.var` in choice 2
to select columns/variables/cell lines satisfying some condition.).
use `?` followed by the function name listed in the above table to 
explore more options available for each choice of "analyse_variables()".
\  

```{r, echo=FALSE, results='asis'}
# for html
knitr::kable(choice_table, caption = "available choices", align = 'c') %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, width = "1em") %>%
  column_spec(2, width = "18em") %>%
  column_spec(3, width = "8em") %>%
  column_spec(4, width = "5em") %>%
  column_spec(5, width = "6em") %>%
  column_spec(6, width = "5em")
```
\  

Here are some examples of different choices:
\  

### choice = 1 

Barplot of variance explained by each principal component
(PC) or dimension
\  

```{r eval = TRUE, echo = TRUE}
analyse_variables(name = PCAlist, Assay = "H2az", choice = 1,
title = "variance barplot", addlabels = TRUE)
```
\  

83.3 % of variation is captured by PC1 alone, which indicates that 5 cell 
lines are very similar to one another. However, there may be the influence
of the huge number of "NE" (see integrate_variables() step). Therefore
doing this analysis might be worth on each individual expression group.
\  

### choice = 2

```{r eval = TRUE, echo = TRUE}
analyse_variables(name = PCAlist, Assay = "H3k4me1",choice = 2,
title = "Loadings of cell lines on PC1 and PC2", xlab = "PCs")
```
\  

Loadings of each variable (here the cell lines) explain the intensity of their
correlation with principal components. Each Cell line(variable) has a loading
on each principal components, which is calculated as:
  **Loadings = Eigenvectors x sqrt(Eigenvalues)**.

In the above plot, the loading is expressed
as the contribution of each cell lines on PC 1 and 2. Although the contribution
of K562 is very high in PC1, its direction indicates its strong correlation with
other PCs (PC2). On the other hand, Hmec, Huvec, Nha, Nhek and Osteobl are
highly correlated only with PC1. The small value (hence the small arrows)
of the contribution of other cell lines with PC1 indicates they might be 
correlated with other PCs also. This is more clearly explained when the plot
is studied with the table generated by this command simultaneously.
\  

### choice = 3

As explained in "chart_correlation()", Hepg2 is slightly different than other
cell lines, and this is evident in the correlation plot also.

```{r eval = TRUE, echo = TRUE}
analyse_variables(name = PCAlist, Assay = "H2az",
choice = 3,title = "Correlation matrix", xlab = "PCs")
```
\  

### choice = 4

This choice depicts the message interpreted above more clearly.
\  

```{r eval = TRUE, echo = TRUE}
analyse_variables(name = PCAlist, Assay = "H2az",
choice = 4,PC = 1,
title = "Squarred loadings of Cell lines on PC1")
```
\  

### choice = 5

The contribution (in percentage) is calculated as: squared loading of the
variable(e.g. a cell line) * 100 / total squared loading of the given
PC/dimension. This is another way to represent similar message we got from 
choice 2
\  

```{r eval = TRUE, echo = TRUE}
analyse_variables(name = PCAlist, Assay = "H2az",
choice = 5,PC=1,
title = "Contribution of Cell lines on PC1")
```
\  

### choice = 6

Percentage of variance explained by principal components in all the 
selected assays. The number of PCs is automatically reduced to the 
column dimension of the assay having the least number of cell lines
(and thus least number of PCs or dimensions)
\  

```{r eval = TRUE}
analyse_variables(name = PCAlist,
Assay = c("H3k9ac","H2az",
"H3k4me1"), choice = 6)
```
\  

## Analysis of individuals/annotations from integrated data

We can take the projection of each TSS on the principal components and
plot them on the space of PCs.
\  

### choice = 1

```{r eval = TRUE, echo = TRUE}
analyse_individuals(name = PCAlist,
Assay = "H3k9ac", groupinfo = groupinfo,
choice = 1, PC = c(1,2))
```
\  

The TSSs on the PC space appears bimodal, indicating two
broad groups of TSSs in terms of H3k4me1 overlap on them.

Now, let's look at some top rows/TSSs according to some
conditions like their contribution to selected PCs (contrib)
\  

```{r}

# selecting top 40 TSS groups according to their contribution on
# PC 1 and PC 2 using the argument "select.ind"

selected <- analyse_individuals(name = PCAlist,
Assay = "H3k4me1",
choice = 1, PC = c(1,2),
select.ind = list(contrib = 100),
groupinfo = groupinfo)

# plot selected individuals
plot(selected)

# extracted information of the selected individuals
head(selected$data)
```
\ 

Let's mine some information about these selected TSSs.
\  

```{r}

# The TSSs used in this example (each row) are obtained by combining
# many neighboring TSSs together. The combinations should be uncombined
# to find the corresponding annotations.

names  <- gsub(";",",",paste(as.character(selected$data$name),
                             collapse = ","))

names <- as.vector(strsplit(names, ",", fixed = TRUE)[[1]])

## The top 100 combined TSS_groups actually come from 115 TSSs

length(names)

# retrieve details of top 100 individuals
# transcript details contains the GENCODE v 17
# annotation of TSSs.

details <- transcript_details[
transcript_details$transcript_id %in% names,,drop = FALSE]

head(details)

## checking the gene type

table(details$gene_type)
```
\  

We can do various analysis with the gene name or id. Here is
an example:
  \  

```{r eval = FALSE}

# find GO annotation 

library(clusterProfiler)
library(org.Hs.eg.db)

gene <- details$gene_name

gene.df <- bitr(gene, fromType = "SYMBOL",
                toType = c("ENSEMBL", "ENTREZID"),
                OrgDb = org.Hs.eg.db)

ggo <- groupGO(gene     = unique(gene.df$ENTREZID),
               OrgDb    = org.Hs.eg.db,
               ont      = "MF",
               level    = 5,
               readable = TRUE)

# If we want to see the top results, we need to reorder
#the list. So, the following line is strictly optional.

#ggo@result <- ggo@result[order(-ggo@result$Count),]
#head(ggo@result)


# The barplot may not fit to the Rstudio window, therefore
# we may plot it in a different window

#grDevices::windows()
#barplot(ggo, drop=TRUE, showCategory=20)
```
\  

There are many packages that do a similar kind of analysis. We 
encourage to visit the website of **clusterProfiler** to check
the available analyses by this package.
\  

***NOTE: *** If we need to see some top individuals from a specific
expression group, then we should select that group only in 
"integrate_variables()" function call.
\  

We can draw a similar plot in 3D also.
\  

### choice = 2

```{r eval = TRUE, echo = TRUE}
analyse_individuals(name = PCAlist, Assay = "H3k9ac",
choice = 2, PC = c(1,2,3),
col = c("RED", "BLACK"), groupinfo = groupinfo)
```
\  

## Density analysis

The scatter plot of data points (here, rows/TSSs/individuals) 
may be highly disorganized. In such a situation, we may take
the help of density functions to understand the overall structure
of the dataset.
\  

Once the cell lines of an assay are integrated into PCs, we may
check the density distribution of scores on various principal 
components. OMICsPCA has a function ***plot_desnity()*** to 
calculate and plot the density distribution of PC scores. This 
function uses ggplot(), geom_density() and other additional
functions from  package ggplot2. Additional arguments of
geom_density (e.g. `adjust`) may be also supplied through this
function. The returned value is a "gg,ggplot" object.

Such kind of analysis may help us to choose factors
(e.g. H2az, H3k4me1) that can  discriminate between various groups
(e.g. our expression groups "WE", "RE", "NE" and "IntE").

**NOTE :** the `groups` argument should not contain a value that is
not used in **integrate_variables**.
\  

### 1D plots

Here is a discriminatory factor:
  \  

```{r eval = TRUE }
#head(groupinfo)
densityplot <- plot_density(name = PCAlist,
Assay = "H2az", groupinfo = groupinfo,
PC = 1, groups = c("WE","RE","IntE"),
adjust = 1)
```
\  

Density plots can be modified using other graphical functions like xlim(),
theme() etc.
\  

```{r eval = TRUE, echo =TRUE}
library(ggplot2)
library(graphics)

densityplot <- densityplot+xlim(as.numeric(c("-8", "23")))
densityplot
```

and here is a non-discriminatory factor:
\  

```{r eval = TRUE, echo = TRUE}
densityplot <- plot_density(name = PCAlist, Assay = "H3k4me1",
PC = 1, groups = c("WE" ,"RE"), adjust = 2,
groupinfo = groupinfo)

densityplot <- densityplot+xlim(as.numeric(c("-8", "15")))

print(densityplot)
```
\  

### 3D plots

OMICsPCA also has a function to visualize the density the scores on 2 principal
components together in 3D
\  

#### static = FALSE

If `static` is set to "FALSE" the function opens the interactive 3D density plot
in a new window.
\  

```{r eval = TRUE, echo = TRUE}
plot_density_3D(name = PCAlist, Assay = "H2az",
group = "WE", PC1 = 1,PC2 = 2, grid_size = 100,
groupinfo = groupinfo,
static = FALSE)
```
\  

#### static = TRUE

If `static` is set to "TRUE", some additional graphical parameters may
be supplied
\  

```{r eval = TRUE, echo = TRUE}
plot_density_3D(name = PCAlist, Assay = "H2az", group = "WE",
groupinfo = groupinfo,
PC1 = 1,PC2 = 2, grid_size = 100, static = TRUE, theta = -50,
phi = 40, box = TRUE, shade = 0.1, ticktype = "detailed", d = 10)
```
\  

**NOTE :** 2D density is calculated using the **kde2d** function from
package **MASS** which use kernel density estimation (KDE) to calculate
the density of 2D data. Now, if the variance on either or both of the
PCs are 0, the KDE can't be calculated.
\  

## Integration of multiple assays/ modalities

Once the discriminatory Assays between the groups are identified by
previous analyses,  we may integrate all such assays together.
\  

```{r eval = TRUE}
Assays = c("H2az", "H3k9ac")
```
\  

In this process it is possible to observe that some variables (Cell lines)
are not similar to others and thus can be excluded :
\  

```{r eval = TRUE}
exclude <- list(0,c(1,9))
```
\  

The length of the list "exclude" should be equal to the Number of
assays (i.e. length of "Assays""). If no columns are selected for
exclusion, a "0" should be placed in the "exclude" list. In the
above example, we are excluding none of the columns from 
"H2az" and column 1 and 9 from "H3k9ac"

The integration of multiple assays is handled by **integrate_pca**
function in OMICsPCA. It works in 2 modes:

1) runs PCA on selected groups

2) runs PCA on all individuals (e.g. annotations: gene, TSS etc.)

The type of merging the assays should be entered by the argument
`mergetype`

**integrate_pca()** returns a list of 2 elements containing
1) the start and end column of  each Assay in the integrated Assay

2) and the integrated PCA results
\  

```{r eval = TRUE}
int_PCA <- integrate_pca(Assays = c("H2az",
"H3k9ac"),
groupinfo = groupinfo,
name = multi_assay, mergetype = 2,
exclude = exclude, graph = FALSE)
```
\  

if `mergetype` is set to 1, then **integrate_pca** asks for the name of 
groups to be integrated.
\  

```{}
Enter a vector of strings containing group names:
```
\  

Here is an example of the allowed input
\  

```{}
Enter a vector of strings containing group names: c("WE","RE")
```
\  

Let's split the output into different variables

```{r}

start_end = int_PCA$start_end

name = int_PCA$int_PCA
```
\  
## Analysis of integrated data

All the analyses on variables and individuals of integrated Assays
are designed to be done by the functions
**analyse_integrated_variables()** and 
**analyse_integrated_individuals()**. These functions work
similarly as **analyse_variables()** and **analyse_individuals()**
described in section **3.6** and **3.7**. The only difference between
the analyses in section **3.6** and **3.7** is that in those sections
multiple cell lines from a single type of assay were integrated.
Whereas, the analyses described in this section will be on multiple
assays or modalities.
\  

### Analyses of variables

The extra argument needs to be supplied in
**analyse_integrated_variables()** is `start_end` which 
is the returned value by  **integrate_pca()**.

The `Assay` argument works a little bit differently here. it takes
the number corresponding to the order of assays used in the
`Assays` argument of **integrate_pca()**. For example, in the
above example `Assays` = c("H2az", "H3k9ac").
So here `Assay` = 1 indicates to "H2az" and
2 = "H3k9ac"

The default argument set to `Assay` is "all", which show analyses
on full  integrated assays. Here are some examples:
\  

#### choice = 1

```{r echo = FALSE}
start_end = list(start = c(1, 6), end = c(5, 16))
```

```{r eval = TRUE}

analyse_integrated_variables(start_end = start_end, name = name,
choice = 1, title = "variance barplot", Assay = 1, addlabels = TRUE)
```
\  

A large part of the total variance in the 2 assays (H2az
and H3k9ac) are captured by PC1 alone. This indicates that
these two Histone modifications are very similar and this is followed
by most of the cell lines. Remember the `mergetype` was set to 2 in
**integrate_pca()**, which uses all the
groups **("WE", "RE", "NE","IntE")** . This analysis may overshadow
the true structure of data due to a strong influence of "NE", coming
from their large number. So redoing the analysis on smaller groups
(use `mergetype` = 1 in integrate_pca()) may be useful in this
context.
\  

#### choice = 2

```{r eval = TRUE}
analyse_integrated_variables(start_end = start_end, Assay = 1,
name = name, choice = 2 ,
title = "Loadings of cell lines on PC1 and PC2",
var_type = "contrib")
```
\  

We can see that almost all the cell lines are very strongly correlated
with PC1. So PC1 of the integrated assays may be a representation of H2az.
If we repeat this analysis for other assays, we may find out a strong
correlation of multiple assays with a single PC. If this happens,
we may conclude that those Assays are similar to one another. However,
they may differ in various groups and cell lines under study. Therefore,
the `exclude` argument in **integrate_pca** should be carefully chosen 
in order to exclude non-related cell types.
\  

#### choice = 3

```{r eval = TRUE}
analyse_integrated_variables(start_end = start_end, Assay = 1,
name = name, choice = 3,
title = "Correlation of Cell lines with PCs of integrated Assays",
is.cor = TRUE)
```
\  

A Similar message is appearing here as observed in `choice` = 1.
Hepg2 in H2az is relatively less correlated with PC1 
and more with other PCs (Dim), indicating that it should be excluded
by `exclude` argument of integrate_pca()
\  

#### choice = 4

```{r eval = TRUE}
analyse_integrated_variables(start_end = start_end, Assay = 1,
name = name, choice = 4, PC = 1,
title = "Squarred loadings of Cell lines on PC1 of integrated Assays")
```
\  

Most of the variances explained by PC1 is actually contributed
by Hsmm, Gm12878,Osteobl and K562. This again indicates that
Hepg2 is different from others.
\  

#### choice = 5

```{r eval = TRUE}
analyse_integrated_variables(start_end = start_end, Assay = 1,
name = name, choice = 5, PC=1,
title = "Contribution of Cell lines on PC1 of integrated Assays")
```
\  

A total of 16 cell lines are integrated by integrate_pca().
Therefore, the contribution of each of these 16 should be reflected
on PCs, if they are of similar type. Here we see that 4 cell lines
of H2az contribute around 7% to PC1. This means that PC1
is not only the representation of H2az, but the other
assay H3k9ac is also represented well in PC1. However, we 
should redo this analysis on H3k9ac to confirm. 
\  

### Analysis of integrated individuals/annotations

This is the same as the analyses done using "analyse_individuals()".
The downstream analyses shown at section **3.7** are also applicable
here. 
\  

#### choice 1: 2D scatter plot

`PC` should contain 2 values and length(`col`) should match with
length(`PC`)
\  

```{r eval = TRUE, echo = TRUE}
analyse_integrated_individuals(
name = name, choice = 1, PC = c(1,2),
groupinfo = groupinfo)
```
\  

#### choice 2: 3D scatter plot

`PC` should contain 3 values and length(`col`) should match with
length(`PC`)
\  

```{r eval = TRUE}
analyse_integrated_individuals(
name = name,
choice = 2, PC = c(1,2,3),
col = c("RED", "BLACK","GREEN"),
groupinfo = groupinfo)
```
\  

It seems the bimodal structure observed in case of individual assays
also retained in integrated assays. This structure would be more clear
if seen from the angle of density.
\  

### Density analysis of integrated data

The scatter plot of data points (here, rows/TSSs/individuals) 
may be highly disorganized. In such a situation, we may take
the help of density functions to understand the overall structure
of the dataset.
\  

#### 1D plot

```{r}
densityplot <- plot_integrated_density(name = name, PC = 1,
groups = c("WE","RE","IntE","NE"), groupinfo = groupinfo)

# additional graphical functions (e.g. xlim, ylim, theme) may be
#added with densityplot (see section VIII. Density analysis)

densityplot
```
\  

#### 3D plot


##### static = TRUE

```{r}
plot_integrated_density_3D(name = name, PC1 = 1, PC2 = 2,
group = c("RE","RE"), gridsize = 100, static = TRUE,
groupinfo = groupinfo)
```
\  

##### static = FALSE

```{r}
plot_integrated_density_3D(name = name, PC1 = 1, PC2 = 2,
group = c("WE","RE"), gridsize = 100, static = FALSE,
groupinfo = groupinfo)
```
\  

The above analyses clearly indicate the presence of two broad groups
according to combined epigenetic states of various expression groups.
The obvious next job is to find out the TSSs in these groups.
\  

## Cluster Analysis

The principal components obtained by combining multiple assays
represent the overall epigenetic state of each annotation (here TSSs).
Our next objective is to find out groups of similar TSSs in terms of
the combined epigenetic state. We will use a suitable clustering
technique(s) to accomplish this job.

There are various clustering techniques available, each is designed
to perform specific tasks. Most of the partition based clustering
algorithms need the number of clusters as an input. From our density
analysis, we observed two dense regions in the data space of PC1 and PC2.
This suggests that the number of clusters should be set at 2. However,
there are sophisticated indices designed to validate the number of clusters.

OMICsPCA has a function "cluster_parameters()" to perform this job. It runs
two algorithms "clvalid" and "NbClust" in the background, which in turn use
different sets of validation techniques to decide the number of clusters.

"cluster_parameters()" extracts data from the PCA the object object supplied
through `name` argument via the "extract()" function. Both the algorithms
used in "cluster_parameters()" function takes huge memory and CPU for a
higher number of samples. Therefore, it is recommended to use a subset
of original data while running this function. A subset of rows can be
selected randomly through the `rand` argument of "extract()". The user may also
choose whether to use integrated assays or nonintegrated assays through 
the logical argument `integrated` of the function "extract()".
\  

```{r}

# extracting data from integrated PCA
data <- extract(name = name, PC = c(1:4),
groups = c("WE","RE"), integrated = TRUE, rand = 600,
groupinfo = groupinfo)

#or

# extracting PCA data from an individual assay
# if integrated = TRUE, an assay name should be entered by 
# Assay == "assayname"

data <- extract(name = PCAlist, PC = c(1:4),
groups = c("WE","RE"), integrated = FALSE, rand = 600,
Assay = "H2az", groupinfo = groupinfo)
```
\  

"cluster_parameters()" uses the output of "extract()" and runs one of the
algorithms between "clvalid" or "NbClust" to determine the number of
clusters.

The parameters of "clvalid" and "NbClust" are passed through the 
`clusteringMethods`, `validationMethods` and `distance` argument of
"cluster_parameters()" function. The argument pairs are different in
"NbClust" and "clvalid" and thus should be chosen carefully.
\  

```{r eval = FALSE}

#### Using "clValid" ####

clusterstats <- cluster_parameters(name = data,
optimal = FALSE, n = 2:6, comparisonAlgorithm = "clValid",
distance = "euclidean", clusteringMethods = c("kmeans","pam"),
validationMethods = c("internal","stability"))

#### plot indexes vs cluster numbers returned by clValid
#plot(clusterstats)


#### using "NbClust"

data <- extract(name = name, PC = c(1:4),
groups = c("WE","RE"),integrated = TRUE, rand = 400,
groupinfo = groupinfo)

clusterstats <- cluster_parameters(name = data, n = 2:6,
comparisonAlgorithm = "NbClust", distance = "euclidean",
clusteringMethods = "kmeans",
validationMethods = "all")

library(factoextra)

fviz_nbclust(clusterstats, method = c("silhouette",
"wss", "gap_stat"))
```
\  

Now, depending on the output of "cluster_parameters()", we may choose
the appropriate clustering algorithm and number of clusters.

OMICsPCA has a function "cluster()" that is designed for the clustering task.
"cluster()" has many clustering options which should be chosen through 
`choice` argument. The class of the returned value by "cluster()" depends
on the type of clustering method chosen. So, extraction and visualization
of data from cluster objects are different for different choices.
\  

```{r}

data <- extract(name = name, PC = c(1:4),
groups = c("WE","RE"), integrated = TRUE, rand = 1000,
groupinfo = groupinfo)

library(factoextra)

### kmeans clustering

clusters <- cluster(name = data, n = 2, choice = "kmeans",
title = "kmeans on 2 clusters")

# visualization of clusters

print(clusters$plot)

clustered_data <- cbind(data,clusters$cluster$cluster)
plot3d(data[,1:3], col = clusters$cluster$cluster)

### density-based clustering

clusters <- cluster(name = data, choice = "density",
eps = 2, MinPts = 100, graph = TRUE,
title = "eps = 1 and MinPts = 1.5")

# visualization of clusters
print(clusters$plot)

clustered_data <- as.data.frame(cbind(data,clusters$cluster$cluster))

#removing unclustered points
clustered_data <- clustered_data[clustered_data$V5 != 0, ]

library(rgl)
#plotting clusters on first 3 PCs
plot3d(clustered_data[,1:3], col = clustered_data$V5)

```

### Validation of clusters

```{r}

#### Using silhouette index
# calculating the distance matrix
library(cluster)
dist <- daisy(data)

sil = silhouette(clusters$cluster$cluster,dist)

# RStudio sometimes does not display silhouette plots
#correctly. So sometimes, it is required to plot in separate
#window.

#grDevices::windows()
#plot(sil)
```

### Identification of misclassified points

```{r}
# Identification of misclassified data points
# silhouette width of misclassified points are negative
misclassified <- which(sil[,3] < 0)

head(sil[misclassified,])

data[misclassified[3:7],]
```
\  

We may reorganize the misclassified points into appropriate
clusters.
\  

### Exploration of clusters

let's explore the clusters more
\  

```{r}
bp <- cluster_boxplot(name = multi_assay,
Assay = "H2az", clusterobject = clustered_data,
clustercolumn = 5)


bp <- bp+xlab("Cell lines") + ylab(
"value")+ggtitle(
"Distribution of H2az_cell_wise in various clusters")+theme(
plot.title=element_text(hjust=0.5),
legend.position = "top")+guides(
fill=guide_legend("Cell lines"))

bp
```
\  

The overall distribution of H2az is significantly different in
the two clusters, implying a good clustering by density based.
This process may be applied to other factors or other clustering
methods.
\  

# Summary
\  

Most of the multi-omics data integration algorithms are focused on
2 applications:
\  

1) use of integrated data for class prediction (e.g. identification
of diseased individuals) and
\  

2) analysis of the source of variation across samples and individuals.

However, such approaches do not suffice to other potential
applications including quantitative representation of the combined 
effect of multiple data modalities, clustering of individuals on 
the basis of combined epigenetic state etc. We propose here a new 
tool OMICsPCA, designed for integrative analysis of multi-omics data 
including the ones mentioned above.

Integration of multi-omics data involves 2 major tasks:
\  

1) analysis of the source of variation among the samples of
a data modality and
\  

2) calculation of the combined effect of multiple data modalities
on the individuals.

We used a **3 step process** to analyse the variation among
the samples of each data modality. **First**, we used **descriptor()**
and **chart_correlation()** to get an initial clue about the overall
nature of the samples. **Second**, we used **integrate_variables()**,
**analyse_variables()** and **analyse_individuals()** to dissect the
variation among the samples of a modality, as well as the individuals 
in a group-wise manner. **Third**, we used density analysis to study the 
internal structure of the data distribution which in turn used to 
identify the â€œdiscriminatory data modalitiesâ€ or assays that show
different density structures in different groups of individuals. 
These 3 step process is used to study the difference between the 
groups of individuals and to select the similar samples corresponding
to a data modality.

In the second task, we linearly combined all the similar samples 
corresponding to the selected factors (these may be the
discriminatory factors) using Principal Component Analysis (PCA).
Each of the Principal components represents the combined state of
various data modalities across multiple samples. We used these 
combined metrices to identify similar individuals (TSSs) by various
clustering methods.

OMICsPCA overcomes the limitations of the existing tools 
and has a good potential to be used as an multipurpose data 
integration and analysis tool. 
\  

# SessionInfo
\  

```{r}
sessionInfo()
```
\  

# References
\  

1.    Huang YT. Integrative modeling of multiple genomic data from
different types of genetic association studies. Biostatistics.
2014 Oct;15(4):587-602. PubMed PMID: 24705142.

2.    He X, Fuller CK, Song Y, Meng Q, Zhang B, Yang X, et al.
Sherlock: detecting gene-disease associations by matching patterns
of expression QTL and GWAS. American journal of human genetics.
2013 May 2;92(5):667-80. PubMed PMID: 23643380. Pubmed Central 
PMCID: 3644637.

3.    Xiong Q, Ancona N, Hauser ER, Mukherjee S, Furey TS. 
Integrating genetic and gene expression evidence into genome-wide 
association analysis of gene sets. Genome research. 2012 Feb;22(2):
  386-97. PubMed PMID: 21940837. Pubmed Central PMCID: 3266045.

4.    Ha M, Hong S. Gene-regulatory interactions in embryonic stem
cells represent cell-type specific gene regulatory programs. 
Nucleic acids research. 2017 Oct 13;45(18):10428-35.
PubMed PMID: 28977540. Pubmed Central PMCID: 5737473.

5.    Chen L, Ge B, Casale FP, Vasquez L, Kwan T, Garrido-Martin D,
et al. Genetic Drivers of Epigenetic and Transcriptional Variation
in Human Immune Cells. Cell. 2016 Nov 17;167(5):1398-414 e24.
PubMed PMID: 27863251. Pubmed Central PMCID: 5119954.

6.    Consortium GT. Human genomics. The Genotype-Tissue Expression
(GTEx) pilot analysis: multitissue gene regulation in humans. 
Science. 2015 May 8;348(6235):648-60. PubMed PMID: 25954001.
Pubmed Central PMCID: 4547484.

7.    Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS.
A statistical framework for genomic data fusion. Bioinformatics.
2004 Nov 1;20(16):2626-35. PubMed PMID: 15130933.

8.    Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M,
et al. Similarity network fusion for aggregating data types on
a genomic scale. Nature methods. 2014 Mar;11(3):333-7. 
PubMed PMID: 24464287.

9.    Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni
JC, et al. Multi-Omics Factor Analysis-a framework for unsupervised
integration of multi-omics data sets. Molecular systems biology.
2018 Jun 20;14(6):e8124. PubMed PMID: 29925568. Pubmed Central
PMCID: 6010767.