--- title: "A comprehensive framework for precision medicine with omics data" author: - name: Jordi Martorell-Marugán affiliation: - GENYO, Centre for Genomics and Oncological Research - Andalusian Foundation for Biomedical Research in Eastern Andalusia (FIBAO) email: jordi.martorell@genyo.es - name: Iván Ellson affiliation: - GENYO, Centre for Genomics and Oncological Research - name: Raúl López-Domínguez affiliation: - GENYO, Centre for Genomics and Oncological Research - name: Daniel Toro-Domínguez affiliation: - GENYO, Centre for Genomics and Oncological Research - Karolinska Institutet email: danieltorodominguez@gmail.com - name: Pedro Carmona-Sáez affiliation: - Department of Statistics and Operational Research. University of Granada - GENYO, Centre for Genomics and Oncological Research email: pcarmona@ugr.es package: pathMED date: "`r BiocStyle::doc_date()`" abstract: > PathMED is a collection of tools to facilitate precision medicine studies with omics data (e.g. transcriptomics). Among its funcionalities, genesets scores for individual samples may be calculated with several methods. These scores may be used to train machine learning models and to predict clinical features on new data. For this, several machine learning methods are evaluated in order to select the best method based on internal validation and to tune the hyperparameters. Performance metrics and a ready-to-use model to predict the outcomes for new patients are returned. output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{pathMED} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} references: - id: toroDominguez2022 title: "Scoring personalized molecular portraits identify Systemic Lupus Erythematosus subtypes and predict individualized drug responses, symptomatology and disease progression" author: - family: Toro-Domínguez given: D - family: Martorell-Marugán given: J - family: Martínez-Bueno given: M - family: López-Domínguez given: R - family: Carnero-Montoro given: E - family: Barturen given: G - family: Goldman given: D - family: Petri given: M - family: Carmona-Sáez given: P - family: Alarcón-Riquelme given: ME container-title: Briefings in Bioinformatics issued: year: 2022 volume: 23 issue: 5 - id: hanzelman2013 title: "GSVA: gene set variation analysis for microarray and RNA-Seq data" author: - family: Hänzelmann given: S - family: Castelo given: R - family: Guinney given: J container-title: BMC Bioinformatics issued: year: 2013 volume: 14 issue: 7 - id: barbie2009 title: "Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1" author: - family: Barbie given: DA - family: Tamayo given: P - family: Boehm given: JS - family: et al container-title: Nature issued: year: 2009 volume: 462 - id: foroutan2018 title: "Single sample scoring of molecular phenotypes" author: - family: Foroutan given: H - family: Bhuva given: DD - family: Lyu given: R - family: et al container-title: BMC Bioinformatics issued: year: 2018 volume: 19 - id: tomfohr2005 title: "Pathway level analysis of gene expression using singular value decomposition" author: - family: Tomfohr given: J - family: Lu given: J - family: Kepler given: TB container-title: BMC Bioinformatics issued: year: 2005 volume: 6 - id: lee2018 title: "Inferring Pathway Activity toward Precise Disease Classification" author: - family: Lee given: E - family: Chuang given: HY - family: Kim given: JW - family: et al container-title: PLOS Computational Biology issued: year: 2018 volume: 4 issue: 11 - id: aibar2017 title: "SCENIC: single-cell regulatory network inference and clustering" author: - family: Aibar given: S - family: González-Blas given: CB - family: Moerman given: T - family: et al container-title: Nature Methods issued: year: 2017 volume: 14 - id: badia2022 title: "decoupleR: ensemble of computational methods to infer biological activities from omics data" author: - family: Badia-i-Mompel given: P - family: Santiago given: JV - family: Braunger given: J - family: et al container-title: Bioinformatics Advances issued: year: 2022 volume: 2 issue: 1 - id: korotkevich2021 title: "Fast gene set enrichment analysis" author: - family: Korotkevich given: G - family: Sukhov given: V - family: Budin given: N - family: et al container-title: bioRxiv editor_options: markdown: wrap: 72 --- # Introduction Omics technologies have advanced precision medicine by enabling the prediction of clinically relevant outcomes based on individual molecular profiles. However, their diagnostic and prognostic applications remain limited due to variability introduced by different technological platforms, laboratory protocols, and data processing methods. These inconsistencies hinder the generalizability of machine learning (ML) models trained on individual datasets. Single-sample molecular scoring has emerged as a promising solution to improve interstudy reproducibility. By summarizing molecular features (e.g., gene expression) into pathway or gene set activity scores, these methods provide a more standardized and robust representation of biological processes. This enables the development of ML models that generalize across datasets, even when quantification methods differ. Several R packages implement molecular scoring methods, such as `decoupleR` (@badia2022) and `GSVA` (@hanzelman2013). `pathMED` unifies and simplifies these approaches, providing a common input format and allowing users to compute multiple scores efficiently. Additionally, `pathMED` includes our novel method, M-scores, which enhances disease characterization and prediction by leveraging reference datasets (@toroDominguez2022). A key limitation of pathway-based scoring is the heterogeneity within gene sets, where different genes may have opposing effects on biological processes. Direct scoring of such broad gene sets can obscure meaningful patterns. To address this, pathMED incorporates methods to refine gene sets by clustering features with coordinated activity, improving pathway granularity and revealing previously hidden subpathways. Many ML-based clinical prediction models also suffer from overfitting, unreliable performance metrics, and limited reproducibility in external datasets, restricting their clinical applicability. To overcome these challenges, pathMED provides a unified framework for ML model training, validation, and prediction. While it leverages the caret package, pathMED simplifies its usage and implements nested cross-validation by default to enhance model reliability. # Installation You can install pathMED from the GitHub repository: ```{r, eval=FALSE} devtools::install_github("jordimartorell/pathMED") ``` Please note that on some operating systems such as Ubuntu there may be installation errors in the packages "metrica", "factoextra" and "FactoMineR", as it is necessary to install system dependencies with administrator permissions. These system dependencies can be identified the following code (adapted to Ubuntu 20.04) and installed by running the output in a terminal as administrator: ```{r, eval=FALSE} if (!require("pak")) { install.packages("pak")) } pak::pkg_sysreqs( c("metrica", "factoextra", "FactoMineR"), "ubuntu", "20.04" ) ``` # Core pipeline ![](CorePipeline.png) ## (Optional) Decomposing gene sets into coexpressed subsets Gene sets often contain genes associated with a specific biological function, but their expression can be heterogeneous, sometimes even exhibiting opposing directions when the function is altered (e.g., in disease). The `dissectDB()` function clusters the genes within each set into subsets based on coordinated expression patterns. This clustering can be performed using one or more molecular datasets, which may include control samples or not. The same dataset used in the following sections can also be used for this purpose. Both preloaded and custom gene sets can be used (see the next section for details). The following code splits KEGG pathways into subpathways using the example dataset: ```{r, message = FALSE, results='hide', warning=FALSE} library(pathMED) data(pathMEDExampleData) customKEGG <- dissectDB(list(pathMEDExampleData), geneSets = "kegg") ``` The output is a gene set that can be used with the `getScores()` function. In the new gene set, subpathways are labeled with the suffix \_splitx. For example, for the pathway hsa04714: ```{r, message = FALSE, warning=FALSE} # Before splitting data(genesetsData) print(head(genesetsData[["kegg"]][["hsa04714"]])) # After splitting print(customKEGG[grep("hsa04714", names(customKEGG))]) ``` Only the features available in the input dataset are included in this object. ## Scoring The `getScores()` function calculates a score for each gene set and sample. It takes as input a matrix or data frame with molecular features in rows and samples in columns, along with a gene set collection. The function supports 20 different scoring methods, which can be specified using the `method` parameter. The available methods and their references are: - M-score (@toroDominguez2022) - GSVA (@hanzelman2013) - ssGSEA (@barbie2009) - singscore (@foroutan2018) - Plage (@tomfohr2005) - Z-score (@lee2018) - AUCell (@aibar2017) - decoupleR methods (MDT, MLM, ORA, UDT, ULM, WMEAN, norm_WMEAN, corr_WMEAN, WSUM, norm_WSUM and corr_WSUM) (@badia2022) - FGSEA and norm_FGSEA (@korotkevich2021) Gene sets can be provided as a named list, where each entry contains the genes associated with a specific gene set, using the `geneSets` parameter. Alternatively, preloaded databases can be used by specifying their name in the `geneSets` parameter. - Reactome ("reactome") - KEGG ("kegg") - Transcriptional modules associated with immune functions ("tmod") - Gene Ontology BP, MF and CC ("go_bp", "go_mf", "go_cc") - Disgenet ("disgenet") - Human Phenotype Ontology ("hpo") - WikiPathways ("wikipathways") - PharmGKB ("pharmgkb") - LINCS ("lincs") - Comparative Toxicogenomics Database ("ctd") Note that these preloaded gene sets are built with gene symbols. Therefore, the row names of the input data should be gene symbols to use them. If running the script on a multi-core processor, you can use the `cores` parameter to specify the number of workers (most `pathMED` functions include this parameter). Additionally, specific parameters for the scoring functions can be customized (see, for example, `?GSVA::gsvaParam`). To run this function with the example data: ```{r, message = FALSE, warning = FALSE} library(pathMED) data(pathMEDExampleData) scoresExample <- getScores(pathMEDExampleData, geneSets = "kegg", method = "Z-score") print(scoresExample[1:5, 1:5]) ``` The result is a matrix with gene set identifiers as rows and samples as columns. To annotate the identifiers with their corresponding descriptions, you can use the `ann2term()` function. ```{r, message = FALSE} annotatedPathways <- ann2term(scoresExample) head(annotatedPathways) ``` ## Training ML models The next step in the core `pathMED` pipeline is to use the scores calculated in the previous step to train machine learning (ML) models for predicting categorical or numerical clinical variables. The `trainModel()` function fits and tests multiple algorithms, selecting the one with the best performance for predicting a variable. The input objects are `inputData` (the scores matrix obtained in the previous step, or any other numerical matrix), and `metadata` (a data frame containing sample metadata). The `var2predict` parameter indicates the column name of metadata with the variable to be predicted. The function `methodsML()` should be used to prepare a list of ML algorithms to be trained and tested. One or more algorithms can be selected using the `algorithms` parameter (see the function help for details), and the amount of hyperparameters combinations to be tested is specified with the `tuneLength` parameter. Nested cross-validation is performed using `Koutter` folds for the outer cross-validation (CV) and `Kinner` folds for the inner CV. The inner k-fold CV can be repeated multiple times (`repeatsCV` parameter) for hyperparameter tuning. In the example dataset, the metadata include a column called `Response`, indicating whether patients responded to a treatment. The `positiveClass` parameter can be specified to calculate performance metrics based on the chosen positive class (e.g., diseased). ```{r, message=FALSE} data(pathMEDExampleMetadata) modelsList <- methodsML(algorithms = c("rf", "knn"), outcomeClass = "character") set.seed(123) trainedModel <- trainModel(scoresExample, metadata = pathMEDExampleMetadata, var2predict = "Response", positiveClass = "YES", models = modelsList, Koutter = 2, Kinner = 2, repeatsCV = 1 ) print(trainedModel) ``` The output object is a list with the following elements: - **model**: The selected algorithm trained with the entire dataset. - **stats**: Performance metrics for each tested algorithms. These metrics are calculated with the testing samples of the outter CV. - **bestTune**: Optimized hyperparameters for the selected algorithm. - **subsample.preds**: The prediction probability for each class and sample. This object may be used with the `predictExternal()` function to make predictions on new data with the same features than the training data. ```{r, message=FALSE} data(refData) scoresExternal <- getScores(refData$dataset1, geneSets = "kegg", method = "Z-score" ) predictions <- predictExternal(scoresExternal, trainedModel) head(predictions) ``` # Using M-scores to extract disease-relevant information In addition to previously developed molecular scoring methods, pathMED includes the novel M-score method. M-scores were used in a previous study that used personalized scoring to predict drug response, symptoms and disease progression in Systemic Lupus Erythematosus patients based on gene expression data @toroDominguez2022. M-scores are calculated by comparing pathway activities between cases and healthy controls, requiring the availability of control sample data. To use M-scores with the `getScores()` function, the `labels` parameter must be used to indicate control samples (using 0 or "Healthy"). As described previously @toroDominguez2022, M-scores can be used to identify key pathways associated with the studied disease. To achieve this, a reference dataset of patients and healthy controls must be constructed, either from a single dataset or by integrating multiple datasets, following the steps outlined below. This reference will also enable the imputation of M-scores in new cohorts or individual patients where healthy controls are not available. ## Build the reference data object We will load `refData` into the environment, which includes four log2-normalized gene expression datasets and four corresponding metadata datasets containing clinical information. These datasets were extracted from NCBI GEO and contain both lupus and healthy samples, encoded in the metadata as "Healthy_sample" and "Disease_sample". The `buildRefObject()` function can be used to create a `refObject` object with the required structure for the following steps. The function require: - A list of expression datasets (`data`) or a single dataset if only one is available. - A list of metadata files (one per dataset) or a single metadata table containing information for all samples. - The name of the grouping column in the metadata (`groupVar`). - The control group label (`controlGroup`), typically representing healthy samples. If different datasets use different control labels, a separate value should be provided for each metadata file. ```{r, message = FALSE, results='hide', warning=FALSE} data(refData) refObject <- buildRefObject( data = list( refData$dataset1, refData$dataset2, refData$dataset3, refData$dataset4 ), metadata = list( refData$metadata1, refData$metadata2, refData$metadata3, refData$metadata4 ), groupVar = "group", controlGroup = "Healthy_sample" ) ``` To create the `refObject` object, it is recommended to include as many datasets as possible to improve robustness. Ensure that all gene expression datasets are log2 transformed and follow a normal distribution. Additionally, gene names must be annotated as Gene Symbols in both reference and test datasets. ## Calculate the reference M-scores After creating a `refObject` object, the next step is to calculate M-scores for the reference datasets with the `mScores_createReference()` function. ```{r, eval=TRUE, warning=FALSE, message=FALSE, results='hide'} refMscores <- mScores_createReference(refObject, geneSets = "tmod", cores = 1 ) ``` ## Identify key gene sets To identify the most relevant gene sets for the studied phenotype, you can use the `mScores_filterPaths()` function. This function selects pathways based on two key parameters: `perc_samples`, which defines the percentage of samples in which a pathway must be statistically significant, and `min_datasets`, which sets the minimum number of datasets that must meet this th These parameters help identify pathways that may be significant in small patient subgroups—a common scenario in heterogeneous diseases—while also ensuring that the selected pathways appear deregulated across multiple studies, reducing the risk of study-specific biases. The significance of each pathway in each sample is determined using its M-score, as described in the original study @toroDominguez2022. ```{r, warning=FALSE} relevantPaths <- mScores_filterPaths( MRef = refMscores, min_datasets = 3, perc_samples = 10 ) ``` ## Calculate the M-scores without healthy controls Once the reference data is prepared and the relevant gene sets are selected, you can calculate M-scores for datasets without healthy controls using the `mScores_imputeFromReference()` function. The function requires a matrix containing case samples, gene sets (or the output of the `mScores_filterPaths()` function), and the reference data generated by `mScores_createReference()`. M-scores are computed using the **k** most similar samples from the reference data, with **k** specified through the `nk` parameter. The `distance.threshold` parameter defines the maximum allowable Euclidean distance between a sample and the reference for M-score imputation. The output is a list containing a matrix of M-scores, where gene sets are in rows and samples in columns, along with a dataframe of calculated distances between each sample and the reference. ```{r, warning=FALSE} mScoresExample <- mScores_imputeFromReference( inputData = pathMEDExampleData, geneSets = relevantPaths, externalReference = refMscores, distance.threshold = 50 ) print(mScoresExample$Mscores[1:5, 1:5]) print(mScoresExample$Distances[1:5, ]) ``` # Session info ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # References