This vignette provides examples of how to use the LHeuristic
package, designed to identify genes with an L-shaped pattern in scatterplots comparing gene expression and methylation.
The package enables users to:
Data Input
Data Preprocessing
Data Processing
Data Output
The data is a subset of methylation and expression arrays from the colon adenocarcinoma (COAD) dataset in The Cancer Genome Atlas (TCGA). The full dataset is available on the TCGA website.
Methylation and expression data are in two separate files. Columns represent patient samples, and rows represent genes.
library(Lheuristic)
# library(kableExtra)
# library(VennDiagram)
data("TCGAexpression")
data("TCGAmethylation")
lhCreateMAE
based on MultiAssayExperiment
as data containerTo effectively integrate and manage omics data, we utilize the MultiAssayExperiment
data structure. This approach ensures consistency between datasets and facilitates downstream analysis.
The lhCreateMAE
function constructs a MultiAssayExperiment object using two datasets—typically methylation and expression data—while enforcing consistency in sample and feature names. It ensures:
The datasets have matching row and column names.
Experiments are labeled appropriately (e.g., “methylation” and “expression”).
Sample metadata can be incorporated via the colData argument.
The function includes an internal validation step, which verifies that both datasets are aligned correctly. If discrepancies exist in row or column names, the function will return an error, prompting users to correct inconsistencies before proceeding.
library(MultiAssayExperiment)
colDat <- DataFrame(sampleID = colnames(TCGAmethylation))
rownames(colDat) <- colDat$sampleID
# Construct MultiAssayExperiment
mae1 <- lhCreateMAE(xDat = TCGAmethylation, yDat = TCGAexpression,
xName = "methylation", yName = "expression", colData = colDat)
# Display dataset names
print(names(experiments(mae1)))
#> [1] "methylation" "expression"
The LHeuristic
package implements the heuristic algorithm by Sanchez et al. (2019) to detect L-shaped scatterplots. However, since many researchers rely on negative correlation to identify genes “regulated by methylation,” the package includes both methods for completeness.
Each method has various parameters optimized for selecting L-shaped patterns and detecting negative correlations in scatterplot data.
The functions also support parameter tuning, allowing for other forms of correlation selection.
The correlation method is a simple approach that identifies genes with a significant negative correlation between their expression and methylation values.
cl <- correlationSelection(mae1, type = "Spearman",
pValCutoff = 0.25, rCutoff = -0.5, adj = TRUE)
correlationL <- cl[!is.na(cl$SigNegCorr) &
cl$SigNegCorr, ]
correlationNoL <- cl[!is.na(cl$SigNegCorr)&
!cl$SigNegCorr, ]
message(
"The number of genes selected
with the correlation method is: ", sum(correlationL$SigNegCorr),
"\n"
)
#> The number of genes selected
#> with the correlation method is: 27
We can observe the distribution of the correlation coefficients:
d <- density(correlationL[, 1])
x2 <- data.frame(x = d$x, y = d$y)
library(ggplot2)
ggplot(x2, aes(x,y)) + geom_line() +
labs(
title = "Significant correlations in the TCGA dataset",
x = "Correlation",
y = "Density"
)+
theme_minimal()
The number of genes retained depends on the p-value and r cutoff.
The result is a table listing “selected” and “unselected” genes, with columns for the r coefficient, p-value, adjusted p-value, distance correlation, and a Boolean indicating whether the gene was selected based on a significantly negative correlation.
head(correlationL)
#> r (Sp) p (Sp) adj.Spear.Pval distCor SigNegCorr
#> ABCC2 -0.8754171 2.441980e-10 2.441980e-07 0.8541382 TRUE
#> ACOX2 -0.6222469 2.412488e-04 1.855760e-02 0.6833996 TRUE
#> ACSL5 -0.5466073 1.776232e-03 8.073781e-02 0.5711910 TRUE
#> ADCY7 -0.5808676 7.635690e-04 4.242050e-02 0.6024344 TRUE
#> ADGRE3 -0.5675195 1.072547e-03 5.362737e-02 0.5547401 TRUE
#> ADH6 -0.6298109 1.919776e-04 1.599813e-02 0.6159664 TRUE
The selected genes can be plotted and saved as a PDF (leaving the file name empty displays the plot on screen). The figure below shows the first four selected genes.
genes2plot <- rownames(correlationL)[1:3]
plotGenesMat(mae1, geneNames = genes2plot,
text4Title = correlationL[rownames(correlationL), ""],
saveToPDF = FALSE
)
To plot scatterplots showing the relationship between methylation and expression for a list of genes, use the function plotGenesMat
.
The heuristic method identifies L-shaped scatterplots by overlaying a grid on the graph and defining specific cells that must contain a minimum (or maximum) percentage of points for the scatterplot to qualify as L-shaped.
It computes a score where points in the designated L region receive positive values, while points outside this region receive negative values. Properly adjusting scores and weights results in positive scores for L-shaped scatterplots and negative scores for non-L-shaped ones.
This concept can be clarified with a “three-band rule”:
Overlay a \(3 \times 3\) grid on the scatterplot.
Classify the scatterplot as “L” or “non-L” based on a few conditions
2.1. A minimum number of points must be in the upper-left (cell (1,1)) and lower-right (cell (3,3)) corners. 2.2. A maximum number of points must be in the upper-right (cell (1,3)), as points here indicate hypermethylation and hyperexpression, which are contrary to our goal. 2.3. A minimum of points in cell (3,1) is usually not required unless we strictly want an L-shape; we can also accept diagonals that reflect a negative correlation.
Score points in each sub-grid as follows: 3.1. Points in permitted regions (left margin: cells (1,1), (2,2), (3,1), (3,2), (3,3)) score positively if classified as L, or zero if classified as non-L. 3.2. Points in non-desired regions (outer band: cells (1,2), (1,3), (2,3)) always score negatively. 3.3. Some regions, like cell (2,2), may be neutral and not score.
Tune scoring parameters either manually (based on experience and dataset characteristics) or through cross-validation (if a set of positive and negative L-shaped genes is available).
The scheme can be summarized using the following equation:
\[ S(X) = W_L \circ X \times \mathbf{1}_L(X) + W_{L^C} \circ X \times \mathbf{1}_{L^c}(X) \] where:
The classification of the scatterplot as \(L\) or \(L^c\) can also be represented by the Hadamard product of three matrices:
\[ \mathbf{1}_L(X) = \bigwedge_{i,j} X \circ C \circ \left( mMP \times \sum_{i,j} x_{ij} \right), \]
where: - \(X\) is the matrix of counts, indicating the number of counts in each grid cell.
sampleSize <- dim(mae1[[2]])[2]
numGenes <- dim(mae1[[2]])[1]
reqPercentages <- matrix(c(2, 20, 5, 5, 40, 20, 3, 3, 2),
nrow = 3, byrow = TRUE)
sum(reqPercentages)
#> [1] 100
(maxminCounts <- toReqMat(sampleSize, reqPercentages))
#> [,1] [,2] [,3]
#> [1,] 1 6 2
#> [2,] 2 12 6
#> [3,] 1 1 1
(theWeightMifL <- matrix(c(2, -2, -sampleSize / 5, 1, 0, -2, 1, 1, 2),
nrow = 3, byrow = TRUE))
#> [,1] [,2] [,3]
#> [1,] 2 -2 -6
#> [2,] 1 0 -2
#> [3,] 1 1 2
(theWeightMifNonL <- matrix(c(0, -2, -sampleSize / 5, 0, 0, -2, 0, 0, 0),
nrow = 3,
byrow = TRUE
))
#> [,1] [,2] [,3]
#> [1,] 0 -2 -6
#> [2,] 0 0 -2
#> [3,] 0 0 0
heur <- scoreGenesMat(mae1,
aReqPercentsMat = reqPercentages, aWeightMifL = theWeightMifL,
aWeightMifNonL = theWeightMifNonL
)
message("Number of scatterplots scored : ", dim(heur)[1], "\n")
#> Number of scatterplots scored : 1000
message("Number of L-shape scatterplots : ", sum(heur[, 1]), "\n")
#> Number of L-shape scatterplots : 67
heurL <- heur[heur$logicSc, ]
heurNoL <- heur[!heur$logicSc, ]
We can check the results in the following table, which includes a logical value indicating whether the gene exhibits an L-shape based on our criteria, along with the numerSc score.
logicSc | numericSc | |
---|---|---|
A2ML1 | TRUE | 18 |
A4GNT | TRUE | -4 |
ABCB5 | TRUE | 10 |
ACADL | TRUE | 32 |
ACOT2 | TRUE | 32 |
ACOT8 | TRUE | 36 |
ACOX2 | TRUE | 39 |
ACRBP | TRUE | -6 |
ADAM11 | TRUE | 30 |
ADAM19 | TRUE | 33 |
ADAM8 | TRUE | 14 |
ADAMTS1 | TRUE | 17 |
ADAMTS14 | TRUE | 1 |
ADAMTS3 | TRUE | 31 |
ADGRL2 | TRUE | 28 |
ADH6 | TRUE | 31 |
AEBP1 | TRUE | 8 |
AGPAT4 | TRUE | 32 |
AGT | TRUE | 27 |
AGTR1 | TRUE | 26 |
ALDH1A2 | TRUE | 27 |
ALDH1A3 | TRUE | 9 |
ALDH1L1 | TRUE | 34 |
ALDOB | TRUE | 32 |
ALOX5AP | TRUE | 16 |
ALX1 | TRUE | 26 |
AMBN | TRUE | -4 |
AMIGO2 | TRUE | 34 |
AMMECR1L | TRUE | 10 |
ANGPTL5 | TRUE | 34 |
ANK2 | TRUE | 12 |
ANKRD53 | TRUE | 12 |
ANTXR1 | TRUE | 18 |
ANXA9 | TRUE | 31 |
AOX1 | TRUE | -2 |
AP3B2 | TRUE | 36 |
APH1B | TRUE | 31 |
APOC2 | TRUE | 24 |
APOLD1 | TRUE | 35 |
AQP9 | TRUE | 9 |
ARHGDIB | TRUE | 14 |
ARHGDIG | TRUE | 28 |
ARHGEF10 | TRUE | 36 |
ARHGEF25 | TRUE | 12 |
ARHGEF7 | TRUE | 6 |
ARL10 | TRUE | 14 |
ARMC4 | TRUE | 26 |
ARPP21 | TRUE | 10 |
ASB4 | TRUE | 29 |
ASCL1 | TRUE | 29 |
ASIC4 | TRUE | 33 |
ATP6V0D2 | TRUE | 9 |
ATP6V1C2 | TRUE | 40 |
ATP6V1G3 | TRUE | 18 |
AVIL | TRUE | 5 |
B3GALNT1 | TRUE | 36 |
B4GALNT1 | TRUE | 29 |
B4GALNT4 | TRUE | 31 |
BACE2 | TRUE | 33 |
BACH2 | TRUE | 34 |
BAMBI | TRUE | 24 |
BBOX1 | TRUE | 31 |
BCAN | TRUE | 24 |
BCAR1 | TRUE | 9 |
BCHE | TRUE | 34 |
BDKRB1 | TRUE | 20 |
BEST4 | TRUE | 11 |
Next, we can visualize the scatterplots of the selected genes or save them on a PDF file.
genes2plot2 <- rownames(heurL)[1:4] #
plotGenesMat(mae1, geneNames = genes2plot2,
fileName = NULL, text4Title = heurL[genes2plot2, "numeriSc"],
saveToPDF = FALSE
)
In summary, the Heuristic method allows us to select genes with an L-shaped scatterplot as follows:
Select datasets: A pair of row-column matched matrices for expression and methylation.
Set parameters: 2.1. Grid definition 2.2. Binary scoring 2.3. Numerical scoring
Score the selected data: Return scores, the classification (L-Shape = TRUE / non-L-shape = FALSE), and plots for each gene.
After identifying genes using the two methods, we can create the intersection of all lists and visualize the results with a Venn Diagram.
We may choose genes selected by two or more methods to ensure greater consistency in the selection.
inCommonL <- intersect(rownames(correlationL),
rownames(heurL))
inCorrelationLOnly <- setdiff(rownames(correlationL),
inCommonL)
inheurLLOnly <- setdiff(rownames(heurL), inCommonL)
We can also plot selected genes. As an example we will plot the genes 1, 2, 3 and 5 by index.
par(mfrow = c(2, 2))
myGene1 <- inCommonL[1]
titleT <- paste(myGene1, "(May be GRM)")
plotGeneSel(mae1,myGene1,
titleText = titleT,
x1 = 1/3, x2 = 2/3)
myGene2 <- inCommonL[2]
titleT <- paste(myGene2, "(May be GRM)")
plotGeneSel(mae1, myGene2,
titleText = titleT,
x1 = 1/3, x2 = 2/3)
myGene3 <- inCommonL[3]
titleT <- paste(myGene3, "(May be GRM)")
plotGeneSel(mae1,myGene3,
titleText = titleT,
x1 = 1/3, x2 = 2/3)
myGene5 <- inCommonL[5]
titleT <- paste(myGene5, "(May be GRM)")
plotGeneSel(mae1, myGene5,
titleText = titleT,
x1 = 1/3, x2 = 2/3)
Sanchez-Pla A, Miró B, Carmona F et al. A heuristic algorithm to select genes potentially regulated by methylation [version 1; not peer reviewed]. F1000Research 2019, 8:1017. (https://doi.org/10.7490/f1000research.1116986.1)
sessionInfo()
#> R Under development (unstable) (2025-03-13 r87965)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 MultiAssayExperiment_1.33.9
#> [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.4
#> [7] IRanges_2.41.3 S4Vectors_0.45.4
#> [9] BiocGenerics_0.53.6 generics_0.1.3
#> [11] MatrixGenerics_1.19.1 matrixStats_1.5.0
#> [13] Lheuristic_0.99.01
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4
#> [4] fastmap_1.2.0 digest_0.6.37 rpart_4.1.24
#> [7] lifecycle_1.0.4 cluster_2.1.8.1 magrittr_2.0.3
#> [10] compiler_4.6.0 rlang_1.1.5 Hmisc_5.2-3
#> [13] sass_0.4.9 tools_4.6.0 yaml_2.3.10
#> [16] data.table_1.17.0 knitr_1.50 ggsignif_0.6.4
#> [19] S4Arrays_1.7.3 labeling_0.4.3 htmlwidgets_1.6.4
#> [22] DelayedArray_0.33.6 abind_1.4-8 foreign_0.8-90
#> [25] withr_3.0.2 purrr_1.0.4 nnet_7.3-20
#> [28] grid_4.6.0 ggpubr_0.6.0 colorspace_2.1-1
#> [31] scales_1.3.0 cli_3.6.4 rmarkdown_2.29
#> [34] crayon_1.5.3 rstudioapi_0.17.1 httr_1.4.7
#> [37] BiocBaseUtils_1.9.0 energy_1.7-12 cachem_1.1.0
#> [40] stringr_1.5.1 XVector_0.47.2 base64enc_0.1-3
#> [43] vctrs_0.6.5 boot_1.3-31 Matrix_1.7-3
#> [46] jsonlite_2.0.0 carData_3.0-5 car_3.1-3
#> [49] rstatix_0.7.2 Formula_1.2-5 htmlTable_2.4.3
#> [52] tidyr_1.3.1 jquerylib_0.1.4 glue_1.8.0
#> [55] cowplot_1.1.3 stringi_1.8.7 gtable_0.3.6
#> [58] UCSC.utils_1.3.1 gsl_2.1-8 munsell_0.5.1
#> [61] tibble_3.2.1 pillar_1.10.1 htmltools_0.5.8.1
#> [64] GenomeInfoDbData_1.2.14 R6_2.6.1 evaluate_1.0.3
#> [67] lattice_0.22-6 backports_1.5.0 broom_1.0.8
#> [70] bslib_0.9.0 Rcpp_1.0.14 gridExtra_2.3
#> [73] SparseArray_1.7.7 checkmate_2.3.2 xfun_0.51
#> [76] pkgconfig_2.0.3