---
title: "L-shaped selection from expression and methylation data"
author: "Berta Miro Cau and Alex Sánchez-Pla"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette: default
vignette: > 
  %\VignetteIndexEntry{L-shaped selection from methylation and expression}
  %\VignetteEncoding{UTF-8}  
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction
This vignette provides examples of how to use the `LHeuristic` package, 
designed to identify genes with an L-shaped pattern in scatterplots 
comparing gene expression and methylation.

The package enables users to:

- Data Input

    - Load gene expression and methylation values from CSV files
    
- Data Preprocessing

    - Set parameters for data usage, choosing either all genes or those meeting 
    specific filters (e.g., $\mbox{corr}< 0$)
    - Select the L-shape detection method and configure its parameters
    
- Data Processing

    - Run computations to identify L-shaped genes
    - Generate scatterplots for all genes or filtered selections
    
- Data Output

    - Save the list of identified genes and scatterplots


# Input data

The data is a subset of methylation and expression arrays from the 
colon adenocarcinoma (COAD) dataset in The Cancer Genome Atlas (TCGA). 
The full dataset is available on the TCGA website.

Methylation and expression data are in two separate files. 
Columns represent patient samples, and rows represent genes.

```{r loadLibs}
library(Lheuristic)

# library(kableExtra)
# library(VennDiagram)

data("TCGAexpression")
data("TCGAmethylation")
```

## Using `lhCreateMAE` based on `MultiAssayExperiment` as data container

To effectively integrate and manage omics data,
we utilize the `MultiAssayExperiment` data structure.
This approach ensures consistency between datasets and 
facilitates downstream analysis.

The `lhCreateMAE` function constructs a MultiAssayExperiment
object using two datasets—typically methylation and
expression data—while enforcing consistency in sample
and feature names. It ensures:

The datasets have matching row and column names.

Experiments are labeled appropriately
(e.g., "methylation" and "expression").

Sample metadata can be incorporated
via the colData argument.

The function includes an internal validation step,
which verifies that both datasets are aligned correctly.
If discrepancies exist in row or column names, the
function will return an error, prompting users to
correct inconsistencies before proceeding.


```{r createMAE, warning=FALSE, message=FALSE}
library(MultiAssayExperiment)

colDat <- DataFrame(sampleID = colnames(TCGAmethylation))
rownames(colDat) <- colDat$sampleID


# Construct MultiAssayExperiment
mae1 <- lhCreateMAE(xDat = TCGAmethylation, yDat = TCGAexpression,
    xName = "methylation", yName = "expression", colData = colDat)

# Display dataset names
print(names(experiments(mae1)))
```

# Data analysis

The `LHeuristic` package implements the heuristic algorithm by 
Sanchez et al. (2019) to detect L-shaped scatterplots. However, since many 
researchers rely on negative correlation to identify genes "regulated by 
methylation," the package includes both methods for completeness.

Each method has various parameters optimized for selecting L-shaped patterns 
and detecting negative correlations in scatterplot data.

The functions also support parameter tuning, allowing for other forms of 
correlation selection.

## Correlation method

The *correlation method* is a simple approach that identifies genes with a
significant negative correlation between their expression and methylation 
values.

```{r corMethod}
cl <- correlationSelection(mae1, type = "Spearman",
    pValCutoff = 0.25, rCutoff = -0.5, adj = TRUE)

correlationL <- cl[!is.na(cl$SigNegCorr) & 
    cl$SigNegCorr, ]
correlationNoL <- cl[!is.na(cl$SigNegCorr)& 
    !cl$SigNegCorr, ]

message(
    "The number of genes selected 
    with the correlation method is: ", sum(correlationL$SigNegCorr),
    "\n"
)
```

We can observe the distribution of the correlation coefficients:

```{r corrDistri,fig.width = 6, fig.height=4}
d <- density(correlationL[, 1])
x2 <- data.frame(x = d$x, y = d$y)

library(ggplot2)
ggplot(x2, aes(x,y)) + geom_line() +
    labs(
    title = "Significant correlations in the TCGA dataset",
    x = "Correlation",
    y = "Density"
    )+
    theme_minimal()

```

The number of genes retained depends on the *p-value* and *r* cutoff.

The result is a table listing "selected" and "unselected" genes, with columns
for the *r coefficient*, *p-value*, *adjusted p-value*, 
*distance correlation*, and a Boolean indicating whether the gene was selected
based on a significantly negative correlation.

```{r showCorrs}
head(correlationL)
```

The selected genes can be plotted and saved as a PDF (leaving the file name 
empty displays the plot on screen). The figure below shows the first four 
selected genes.

```{r plotGenesFromCorr1, fig.height=6, fig.width=6}
genes2plot <- rownames(correlationL)[1:3]

plotGenesMat(mae1, geneNames = genes2plot,
    text4Title = correlationL[rownames(correlationL), ""],
    saveToPDF = FALSE
)
```


To plot scatterplots showing the relationship between methylation and expression
for a list of genes, use the function `plotGenesMat`.

## Heuristic method

The heuristic method identifies L-shaped scatterplots by overlaying a grid 
on the graph and defining specific cells that must contain a minimum 
(or maximum) percentage of points for the scatterplot to qualify as L-shaped.

It computes a score where points in the designated L region receive 
positive values, while points outside this region receive negative values.
Properly adjusting scores and weights results in positive scores for 
L-shaped scatterplots and negative scores for non-L-shaped ones.

This concept can be clarified with a "three-band rule":

1. Overlay a \(3 \times 3\) grid on the scatterplot.

2. Classify the scatterplot as **"L" or "non-L"** based on a few conditions    
    2.1. A *minimum* number of points must be in the upper-left (cell (1,1))
    and lower-right (cell (3,3)) corners.
    2.2. A *maximum* number of points must be in the upper-right (cell (1,3)), 
    as points here indicate hypermethylation and hyperexpression, which are 
    contrary to our goal.
    2.3. A minimum of points in cell (3,1) is usually *not required* 
    unless we strictly want an L-shape; we can also accept diagonals 
    that reflect a negative correlation.

3. Score points in each sub-grid as follows:
    3.1. Points in permitted regions (left margin: cells (1,1), (2,2), (3,1),
    (3,2), (3,3)) score positively if classified as L, or zero 
    if classified as non-L.
    3.2. Points in non-desired regions (outer band: cells (1,2), (1,3), 
    (2,3)) always score negatively.
    3.3. Some regions, like cell (2,2), may be neutral and not score.

4. *Tune scoring parameters either manually (based on experience and dataset*
*characteristics) or through cross-validation* 
(**if a set of positive and negative L-shaped genes is available**).

The scheme can be summarized using the following equation:


$$
S(X) = W_L \circ X \times \mathbf{1}_L(X) + W_{L^C} \circ X \times 
\mathbf{1}_{L^c}(X)
$$ 
where:


- \(X\) is the matrix of *counts*, representing the number of counts in each 
cell of the grid.
- \(W_L\) is the matrix of scores per cell and point if the scatterplot is 
classified as \(L\).
- \(W_{L^c}\) is the matrix of scores per cell and point if the scatterplot is 
classified as non-\(L\) (\(L^c\)). 
The symbol \(\circ\) denotes the Hadamard product of the two
matrices \(W_{L/L^c}\) (i.e., element-wise multiplication), 
and \(\mathbf{1}_{L/L^c}()\) is
the indicator function for \(L\) or \(L^c\).

The classification of the scatterplot as \(L\) or \(L^c\) can also be 
represented by the Hadamard product of three matrices:

\[
\mathbf{1}_L(X) = \bigwedge_{i,j} X \circ C \circ 
\left( mMP \times \sum_{i,j} x_{ij}
\right),
\]

where:
- \(X\) is the matrix of *counts*, indicating the number of counts 
in each grid cell.

- \(C\) is the matrix of conditions to verify if the scatterplot 
should be classified as \(L\).
- \(mMP\) is the matrix of minimum and maximum percentages of points 
required in each cell for classification as \(L\).
- \(\circ\) represents point-wise logical operations, allowing the product
of the three matrices to act as a logical operation.
- \(\bigwedge_{i,j}\) denotes a logical "AND" operation across all cells;
if all cells are TRUE, the result is assigned to \(L\); if any cell fails,
it is assigned to \(L^c\).


```{r setWeights}
sampleSize <- dim(mae1[[2]])[2]
numGenes <- dim(mae1[[2]])[1]

reqPercentages <- matrix(c(2, 20, 5, 5, 40, 20, 3, 3, 2), 
    nrow = 3, byrow = TRUE)
sum(reqPercentages)
(maxminCounts <- toReqMat(sampleSize, reqPercentages))

(theWeightMifL <- matrix(c(2, -2, -sampleSize / 5, 1, 0, -2, 1, 1, 2), 
    nrow = 3, byrow = TRUE))
(theWeightMifNonL <- matrix(c(0, -2, -sampleSize / 5, 0, 0, -2, 0, 0, 0),
    nrow = 3,
    byrow = TRUE
))

heur <- scoreGenesMat(mae1,
    aReqPercentsMat = reqPercentages, aWeightMifL = theWeightMifL,
    aWeightMifNonL = theWeightMifNonL
)

message("Number of scatterplots scored  : ", dim(heur)[1], "\n")
message("Number of L-shape scatterplots : ", sum(heur[, 1]), "\n")

heurL <- heur[heur$logicSc, ]
heurNoL <- heur[!heur$logicSc, ]
```

We can check the results in the following table, which includes a logical value
indicating whether the gene exhibits an L-shape based on our criteria, 
along with the *numerSc* score.


```{r showTrueLs}
knitr::kable(heurL)
```

Next, we can visualize the scatterplots of the selected genes or save them
on a PDF file.

```{r plotTrueLs, fig.height=6, fig.width=6}

genes2plot2 <-   rownames(heurL)[1:4] #  

plotGenesMat(mae1, geneNames = genes2plot2,
    fileName = NULL, text4Title = heurL[genes2plot2, "numeriSc"],
    saveToPDF = FALSE
)

```


In summary, the Heuristic method allows us to select genes with an L-shaped 
scatterplot as follows:

1. **Select datasets:** A pair of row-column matched matrices for expression and
methylation.

2. **Set parameters:**
    2.1. Grid definition
    2.2. Binary scoring
    2.3. Numerical scoring

3. **Score the selected data:** Return scores, the classification 
(L-Shape = TRUE / non-L-shape = FALSE), and plots for each gene.

# Comparison of Selected Gene Lists

After identifying genes using the two methods, we can create the intersection 
of all lists and visualize the results with a Venn Diagram.

We may choose genes selected by two or more methods to ensure greater 
consistency in the selection.


```{r checkCommon}
inCommonL <- intersect(rownames(correlationL), 
    rownames(heurL))
inCorrelationLOnly <- setdiff(rownames(correlationL),
    inCommonL)
inheurLLOnly <- setdiff(rownames(heurL), inCommonL)
```


We can also plot selected genes.
As an example we will plot the genes 1, 2, 3 and 5 by index.

```{r plotSelected, fig.height=6, fig.width=6}
par(mfrow = c(2, 2))
myGene1 <- inCommonL[1]
titleT <- paste(myGene1, "(May be GRM)")
plotGeneSel(mae1,myGene1,
    titleText = titleT,
    x1 = 1/3, x2 = 2/3)

myGene2 <- inCommonL[2]
titleT <- paste(myGene2, "(May be GRM)")
plotGeneSel(mae1, myGene2,
    titleText = titleT,
    x1 = 1/3, x2 = 2/3)


myGene3 <- inCommonL[3]
titleT <- paste(myGene3, "(May be GRM)")
plotGeneSel(mae1,myGene3,
    titleText = titleT,
    x1 = 1/3, x2 = 2/3)

myGene5 <- inCommonL[5]
titleT <- paste(myGene5, "(May be GRM)")
plotGeneSel(mae1, myGene5,
    titleText = titleT,
    x1 = 1/3, x2 = 2/3)
```

# References

Sanchez-Pla A, Miró B, Carmona F et al. 
A heuristic algorithm to select genes potentially regulated by methylation
[version 1;  not peer reviewed]. 
F1000Research 2019, 8:1017.
(https://doi.org/10.7490/f1000research.1116986.1)

# Session info

```{r}
sessionInfo()
```