--- title: "ClustSIGNAL tutorial" date: "`r Sys.Date()`" author: - name: Pratibha Panwar affiliation: - School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia; - Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia; - Charles Perkins Centre, The University of Sydney, NSW 2006, Australia - name: Boyi Guo affiliation: - Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, MD, USA - name: Haowen Zhou affiliation: - Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA - name: Stephanie Hicks affiliation: - Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, MD, USA; - Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA; - Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA; - Malone Center for Engineering in Healthcare, Johns Hopkins University, MD, USA - name: Shila Ghazanfar affiliation: - School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia; - Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia; - Charles Perkins Centre, The University of Sydney, NSW 2006, Australia output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default package: clustSIGNAL vignette: | %\VignetteIndexEntry{ClustSIGNAL tutorial} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 80 --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = FALSE, warning = FALSE) ``` # ClustSIGNAL The R package ClustSIGNAL performs spatially-informed cell type clustering on high-resolution spatial transcriptomics data. It uses both the gene expression and spatial locations of cells to group them into clusters. ## Motivation ClustSIGNAL aims to: (i) overcome data sparsity using an adaptive smoothing approach that is guided by the heterogeneity/homogeneity of each individual cell's neighbourhood; (ii) embed spatial context information into the gene expression generating a transformed, adaptively smoothed expression matrix that can be used for clustering; and (iii) generate entropy data that captures the heterogeneity/homogeneity information from each cell's neighbourhood and can be used to create a spatial map of heterogeneity distribution in a sample tissue. ## Overview In this vignette, we demonstrate how spatially-informed clustering can be performed with ClustSIGNAL, assessing the clusters using pre-defined metrics like [adjusted rand index (ARI)](https://www.rdocumentation.org/packages/aricode/versions/0.1.1/topics/ARI) and [normalized mutual information (NMI)](https://www.rdocumentation.org/packages/aricode/versions/0.1.1/topics/NMI) from the [aricode](https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf) R package, as well as spatial plots to visualize them. ClustSIGNAL is a multisample spatial clustering approach, and we show this using an example dataset. We also display the use of entropy values, which are generated as part of the ClustSIGNAL process, in understanding the tissue structure of a sample. ClustSIGNAL is very flexible in that it allows for, (i) user-provided input values for most parameters (default parameter values are also provided) and (ii) running ClustSIGNAL step-by-step. This tutorial demonstrates how the step-by-step clustering can be performed, and what parameters need to be defined at each step. ```{r load_packages, message = FALSE, warning = FALSE} # load required packages library(clustSIGNAL) library(scater) library(ggplot2) library(dplyr) library(patchwork) library(aricode) ``` # Single sample analysis with ClustSIGNAL In this section, we use the SeqFISH mouse embryo dataset from [Lohoff et al, 2021](https://www.nature.com/articles/s41587-021-01006-2), which contains spatial transcriptomics data from 3 mouse embryos, with 351 genes and 57,536 cells. For this vignette, we have subset the data by randomly selecting 5000 cells from Embryo 2, excluding cells that had been manually annotated as 'Low quality'. We begin by creating a SpatialExperiment object from the gene expression and cell information in the data subset, ensuring that the spatial coordinates are stored in spatialCoords within the SpatialExperiment object. If the data are already in a SpatialExperiment object, ClustSIGNAL can be run as long as basic requirements like spatial coordinates, normalized counts, and unique cell names are met. ```{r embryo_data_prep} # load me_expr containing gene expression logcounts # load me_data containing cell metadata including x-y coordinates data(mEmbryo2) # to create a SpatialExperiment object we need gene expression, cell metadata, # and cell locations. spe <- SpatialExperiment::SpatialExperiment( assays = list(logcounts = me_expr), colData = me_data, # spatialCoordsNames requires column names in me_data that contain # xy-coordinates of cells spatialCoordsNames = c("X", "Y")) spe ``` For running ClustSIGNAL, we need to know the column name in colData slot of the SpatialExperiment object that contains the sample labels. Here, the sample labels are in the 'sample_id' column. ```{r embryo_data_columns} spe |> colData() |> colnames() # column names in the metadata ``` ## Running ClustSIGNAL on one sample The simplest ClustSIGNAL run requires a SpatialExperiment object, the colData column name of sample labels, and the type of output to generate. Other parameters that can be modified include: (i) dimRed - specifies the low dimension data to use (default 'None'); (ii) batch - when TRUE, ClustSIGNAL performs batch correction and needs a valid value for batch_by; (iii) batch_by - name of metadata column containing sample batches contributing to batch effect (default 'None'); (iv) NN - specifies the neighbourhood size (default 30); (v) kernel - specifies the distribution to use for weight generation (default 'G' for Gaussian); (vi) spread - the distribution spread value (default 0.3 for Gaussian); (vii) sort - when TRUE, ClustSIGNAL sorts the neighbourhood; (viii) threads - specifies the number of cpus to use in parallel runs (default 1); and (ix) clustParams - list of parameters to use for non-spatial clustering components. Furthermore, the adaptively smoothed gene expression data generated by ClustSIGNAL could be useful for other downstream analyses and is accessible if the output options 's' or 'a' are selected to return the final SpatialExperiment object. ```{r ClustSIGNAL_singleRun} set.seed(100) samples <- "sample_id" # column name containing sample names # to run ClustSIGNAL, requires a SpatialExperiment object, column name of sample # labels in colData slot, and the output type to generate (clusters, neighbours, # and/or final spe object). res_emb <- clustSIGNAL(spe, samples, outputs = "a") ``` This returns a list that contains a ClustSIGNAL clusters dataframe (clusters), a matrix of cell IDs from each cell's neighbourhood (neighbours with NN neighbourhood size), and a final SpatialExperiment object (spe_final). ```{r embryo_result_list} res_emb |> names() # names of the outputs generated ``` The cluster dataframe contains cell IDs and their cluster labels assigned by ClustSIGNAL. ```{r embryo_clusters_head} res_emb$clusters |> head() # cluster data frame has cell IDs and cluster labels ``` The output SpatialExperiment object contains the adaptively smoothed gene expression data as an additional assay (smoothed), as well as initial clusters and subclusters, entropy values, and ClustSIGNAL clusters. ```{r embryo_final_spe} # for convenience with downstream analyses, we will replace the original spe # object with the one generated by ClustSIGNAL. This does not lead to any loss # of information as ClustSIGNAL only adds information to the input spe object. spe <- res_emb$spe_final spe spe |> colData() |> colnames() ``` ## Visualising ClustSIGNAL clusters We use spatial coordinates of cells and their ClustSIGNAL cluster labels and entropy values to visualize the clustering output. ```{r colors} colors <- c("#635547", "#8EC792", "#9e6762", "#FACB12", "#3F84AA", "#0F4A9C", "#ff891c", "#EF5A9D", "#C594BF", "#DFCDE4", "#139992", "#65A83E", "#8DB5CE", "#005579", "#C9EBFB", "#B51D8D", "#532C8A", "#8870ad", "#cc7818", "#FBBE92", "#EF4E22", "#f9decf", "#c9a997", "#C72228", "#f79083", "#F397C0", "#DABE99", "#c19f70", "#354E23", "#C3C388", "#647a4f", "#CDE088", "#f7f79e", "#F6BFCB", "#7F6874", "#989898", "#1A1A1A", "#FFFFFF", "#e6e6e6", "#77441B", "#F90026", "#A10037", "#DA5921", "#E1C239", "#9DD84A") ``` ```{r embryo_spatialPlots1} # for plotting with scater R package, we need to add the spatial coordinates # to the reduced dimension slot of the spe object reducedDim(spe, "spatial") <- spatialCoords(spe) ``` ```{r embryo_spatialPlots2} # spatial plot spt_clust <- scater::plotReducedDim( spe, colour_by = "ClustSIGNAL", dimred = "spatial", point_alpha = 1, point_size = 4, scattermore = TRUE) + ggtitle("A. Spatial plot of clusters") + scale_color_manual(values = colors) + guides(colour = guide_legend(title = "Clusters", override.aes = list(size = 5))) + theme(text = element_text(size = 12)) ``` ```{r embryo_spatialPlots3} # entropy distribution plotted at cluster-level can indicate which clusters # have cells from homogeneous/heterogeneous space. df_met <- spe |> colData() %>% as.data.frame() ct_ent <- df_met %>% mutate(ClustSIGNAL = as.character(ClustSIGNAL)) %>% group_by(ClustSIGNAL) %>% # calculating median entropy of each cluster category summarise(mdEntropy = median(entropy)) %>% # reordering clusters by their median entropy value arrange(mdEntropy) df_met$ClustSIGNAL <- factor(df_met$ClustSIGNAL, levels = ct_ent$ClustSIGNAL) col_ent <- colors[as.numeric(as.character(ct_ent$ClustSIGNAL))] box_clust <- df_met %>% ggplot(aes(x = ClustSIGNAL, y = entropy, fill = ClustSIGNAL)) + geom_boxplot() + scale_fill_manual(values = col_ent) + ggtitle("B. Entropy distribution of clusters") + labs(x = "ClustSIGNAL clusters", y = "Entropy", name = "Clusters") + theme_classic() + theme(legend.position = "none", text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), plot.title = element_text(face = "bold")) ``` ```{r embryo_spatialPlots4} spt_clust + box_clust + patchwork::plot_layout(guides = "collect", widths = c(2, 3)) ``` The spatial location and entropy distribution of the clusters provide spatial context of the cells and their neighbourhoods, as well as the compositions of the neighbourhoods. For example, in panel (B) the low entropy clusters are generally found in space that is more homogeneous, whereas the high entropy clusters belong to neighbourhoods that have more cell diversity. This can also be visualized in the spatial plot in panel (A). ## Assessing clustering accuracy We assess the clustering efficiency of ClustSIGNAL using the commonly used clustering metrics ARI and NMI, which are usable only when prior cell annotations are available. Here, ARI and NMI measure the similarity or agreement (respectively) between cluster labels obtained from ClustSIGNAL and manual cell annotations. ```{r embryo_clusterMetrics} # to assess the accuracy of clustering, the cluster labels are often compared to # prior annotations. Here, we compare ClustSIGNAL cluster labels to annotations # available with this public data. spe |> colData() %>% as.data.frame() %>% summarise( ARI = aricode::ARI(celltype_mapped_refined, ClustSIGNAL), # calculate ARI NMI = aricode::NMI(celltype_mapped_refined, ClustSIGNAL)) # calculate NMI ``` ## Entropy spread and distribution The entropy values generated through ClustSIGNAL process can be useful in analyzing the sample structure. ```{r embryo_entropyMetrics} # we can assess the overall entropy distribution of the dataset spe |> colData() %>% as.data.frame() %>% summarise(min_Entropy = min(entropy), min_Entropy_count = sum(spe$entropy == 0), max_Entropy = max(entropy), mean_Entropy = mean(entropy)) ``` The entropy range can indicate whether the tissue sample contains any homogeneous regions. For example, a min_Entropy of 0 means that some cells are placed in completely homogeneous space when looking at a neighbourhood size of 30 cells (NN = 30 was used for generating the entropy values). The min_Entropy_count gives us an idea of the total number of such low entropy neighbourhoods in the sample. ```{r entropyPlots1} # we can also visualize the distribution and spread of the entropy values hst_ent <- spe |> colData() %>% as.data.frame() %>% ggplot(aes(entropy)) + geom_histogram(binwidth = 0.05) + ggtitle("A. Entropy spread") + labs(x = "Entropy", y = "Number of neighbourhoods") + theme_classic() + theme(text = element_text(size = 12), plot.title = element_text(face = "bold")) ``` ```{r entropyPlots2} spt_ent <- scater::plotReducedDim(spe, colour_by = "entropy", # specify spatial low dimension dimred = "spatial", point_alpha = 1, point_size = 4, scattermore = TRUE) + ggtitle("B. Entropy spatial distribution") + scale_colour_gradient2("Entropy", low = "grey", high = "blue") + scale_size_continuous(range = c(0, max(spe$entropy))) + theme(text = element_text(size = 12)) ``` ```{r entropyPlots3} hst_ent + spt_ent ``` The spread and spatial distribution of neighbourhood entropies can be useful in visually assessing and comparing tissue compositions in samples - low entropy neighbourhoods are more homogeneous and likely contain cell type-specific niches, whereas high entropy neighbourhoods are heterogeneous with more uniform distribution of different cell types. # Multisample analysis with ClustSIGNAL Here, we use the MERFISH mouse hypothalamus preoptic region dataset from [Moffitt et al, 2018](https://www.science.org/doi/10.1126/science.aau5324), which contains spatial transcriptomics data from 181 samples, with 155 genes and 1,027,080 cells. For this vignette, we have subset the data by selecting 6000 random cells from only 3 samples - Animal 1 Bregma -0.09 (2080 cells), Animal 7 Bregma 0.16 (1936 cells), and Animal 7 Bregma -0.09 (1984 cells), excluding cells that were manually annotated as 'Ambiguous' and 20 genes for which expression was generated using a different technology. We start the analysis by creating a SpatialExperiment object from the gene expression and cell information in the data subset, ensuring that the spatial coordinates are stored in spatialCoords slot within the spe object. ```{r hypothal_data_prep} # load mh_expr containing gene expression logcounts # load mh_data containing cell metadata and cell x-y coordinates data(mHypothal) # create spe object using gene expression, cell metadata, and cell locations spe2 <- SpatialExperiment(assays = list(logcounts = mh_expr), colData = mh_data, # spatialCoordsNames requires column names in # mh_data that contain xy-coordinates of cells spatialCoordsNames = c("X", "Y")) spe2 ``` Next we identify sample labels column in the SpatialExperiment object. ```{r hypothal_data_columns} spe2 |> colData() |> str() # metadata summary ``` Here, the sample labels are in the ‘samples’ column of the object. ## ClustSIGNAL run An important concept to take into account when running multisample analysis is batch effects. When gathering samples from different sources or through different technologies/procedures, some technical batch effects might be introduced into the dataset. We can run ClustSIGNAL in batch correction mode simply by setting batch = TRUE and batch_by = "group", where group will be the name of the colData column of spe object that contains the batch information. ClustSIGNAL then uses [harmony](https://portals.broadinstitute.org/harmony/) internally for batch correction. ```{r ClustSIGNAL_multiRun } set.seed(110) # ClustSIGNAL can be run on a dataset with multiple samples. As before, we need # the SpatialExperiment object and column name of sample labels in the object. # The method can be run in parallel through the threads option. Here we use # thread = 4 to use 4 cores. # Since no batch effects were observed in this data subset, we have not used # the batch and batch_by options. samples <- "samples" # column name containing sample names res_hyp <- clustSIGNAL(spe2, samples, threads = 4, outputs = "a") ``` ```{r hypothal_final_spe} # for convenience with downstream analyses, we replace the original spe object # with the one generated by ClustSIGNAL. spe2 <- res_hyp$spe_final spe2 ``` ## Clustering metrics Clustering and entropy results can be calculated and visualized for each sample. ```{r hypothal_samples} samplesList <- spe2[[samples]] |> levels() # get sample names samplesList ``` ```{r hypothal_clusterMetrics} spe2 |> colData() %>% as.data.frame() %>% group_by(samples) %>% summarise( # Comparing ClustSIGNAL cluster labels to annotations available with the # public data to assess its accuracy. ARI = aricode::ARI(Cell_class, ClustSIGNAL), NMI = aricode::NMI(Cell_class, ClustSIGNAL), # Assessing the overall entropy distribution of the samples in the dataset. min_Entropy = min(entropy), min_Entropy_count = sum(entropy == 0), max_Entropy = max(entropy), mean_Entropy = mean(entropy)) ``` As before, the entropy range can tell us a lot about the tissue structure of the samples. Unlike the seqFISH subset data, where the minimum entropy of the sample was 0, here, the minimum entropy is higher indicating that the tissue doesn't really have any cell type-specific niches when looking at neighbourhood size of 30 cells. Moreover, the relatively high mean entropy value indicates that the tissues slices are quite heterogeneous. ## Visualizing ClustSIGNAL clusters ClustSIGNAL performs clustering on all cells in the dataset in one run, thereby generating the same clusters across multiple samples. The cluster labels do not need to be mapped between samples. For example, cluster 1 represents the same cell type in all three samples, without needing explicit mapping between samples. ```{r hypothal_spatialPlots1} # for plotting with scater R package, we need to add the spatial coordinates # to the reduced dimension section reducedDim(spe2, "spatial") <- spatialCoords(spe2) ``` ```{r hypothal_spatialPlots2} # spatial plot - ClustSIGNAL clusters spt_clust2 <- scater::plotReducedDim(spe2, colour_by = "ClustSIGNAL", # specify spatial low dimension dimred = "spatial", point_alpha = 1, point_size = 4, scattermore = TRUE) + scale_color_manual(values = colors) + facet_wrap(vars(spe2[[samples]]), scales = "free", nrow = 1) + guides(colour = guide_legend(title = "Clusters", override.aes = list(size = 3))) + theme(text = element_text(size = 12)) ``` ```{r hypothal_spatialPlots3} # For visualising cluster-level entropy distribution, we reorder the clusters # by their median entropy value in each sample df_met2 <- spe2 |> colData() %>% as.data.frame() box_clust2 <- list() for (s in samplesList) { df_met_sub <- df_met2[df_met2[[samples]] == s, ] # calculating median entropy of each cluster in a sample ct_ent2 <- df_met_sub %>% mutate(ClustSIGNAL = as.character(ClustSIGNAL)) %>% group_by(ClustSIGNAL) %>% summarise(mdEntropy = median(entropy)) %>% # reordering clusters by their median entropy arrange(mdEntropy) df_met_sub$ClustSIGNAL <- factor(df_met_sub$ClustSIGNAL, levels = ct_ent2$ClustSIGNAL) # box plot of cluster entropy col_ent2 <- colors[as.numeric(ct_ent2$ClustSIGNAL)] box_clust2[[s]] <- df_met_sub %>% ggplot(aes(x = ClustSIGNAL, y = entropy, fill = ClustSIGNAL)) + geom_boxplot() + scale_fill_manual(values = col_ent2) + facet_wrap(vars(samples), nrow = 1) + labs(x = "ClustSIGNAL clusters", y = "Entropy") + ylim(0, NA) + theme_classic() + theme(strip.text = element_blank(), legend.position = "none", text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 0.5)) } ``` ```{r hypothal_spatialPlots4} spt_clust2 / (patchwork::wrap_plots(box_clust2[1:3], nrow = 1) + plot_layout(axes = "collect")) + plot_layout(guides = "collect", heights = c(5, 3)) + plot_annotation( title = "Spatial (top) and entropy (bottom) distributions of clusters", theme = theme(plot.title = element_text(hjust = 0.5, face = "bold"))) ``` The spatial location and entropy distribution of the clusters can be compared in a multisample analysis, providing spatial context of the cluster cells and their neighbourhood compositions in the different samples within the dataset. Since the clusters were generated in a single run, they are same across the different samples. Therefore, if a cluster is not represented in a sample, this would mean that its respective cell type is not present in that sample. ## Visualising entropy spread and distribution In multisample analysis, tissue structure of the different samples in the dataset can be compared using the spread and spatial distribution of the neighbourhood entropy measures. ```{r hypothal_entropyPlots1} hst_ent2 <- spe2 |> colData() %>% as.data.frame() %>% ggplot(aes(entropy)) + geom_histogram(binwidth = 0.05) + facet_wrap(vars(samples), nrow = 1) + labs(x = "Entropy", y = "Number of neighbourhoods") + theme_classic() + theme(text = element_text(size = 12)) ``` ```{r hypothal_entropyPlots2} spt_ent2 <- scater::plotReducedDim(spe2, colour_by = "entropy", # specify spatial low dimension dimred = "spatial", point_alpha = 1, point_size = 4, scattermore = TRUE) + scale_colour_gradient2("Entropy", low = "grey", high = "blue") + scale_size_continuous(range = c(0, max(spe2$entropy))) + facet_wrap(vars(spe2[[samples]]), scales = "free", nrow = 1) + theme(strip.text = element_blank(), text = element_text(size = 12)) ``` ```{r hypothal_entropyPlots3} hst_ent2 / spt_ent2 + plot_layout(heights = c(4, 5)) + plot_annotation( title = "Entropy spread (top) and spatial distribution (bottom)", theme = theme(plot.title = element_text(hjust = 0.5, face = "bold"))) ``` Together, these plots help in visually assessing tissue compositions of the samples - all 3 samples have high entropy neighbourhoods indicating that they mainly have heterogeneous regions with uniform distribution of different cell types. # ClustSIGNAL step-by-step run ClustSIGNAL has five main functions for each distinct step in its algorithm. These functions are accessible and can be run sequentially to generate data from intermediate steps, if needed. For example, ClustSIGNAL can be run step-by-step up to the entropy measurement component, without having to run the complete method. The entropy values will be added to the SpatialExperiment object and can be used for assessing tissue structure in terms of its heterogeneity. Similarly, the adaptively smoothed gene expression can be obtained by running ClustSIGNAL till the adaptive smoothing step. Here, we describe how individual ClustSIGNAL functions can be used sequentially. ```{r ClustSIGNALseq_data} # load logcounts and metadata to the environment data(mEmbryo2) # as before, we read the data into a SpatialExperiment object spe <- SpatialExperiment(assays = list(logcounts = me_expr), colData = me_data, spatialCoordsNames = c("X", "Y")) ``` ```{r ClustSIGNALseq_prep} set.seed(100) # first we need to generate low dimension data for initial clustering spe <- scater::runPCA(spe) ``` ## Step 1: Initial clustering and subclustering The first step in the ClustSIGNAL algorithm is initial clustering and subclustering. For this, we need to provide a spe object with low embedding information. Other parameters have default values: batch = FALSE and batch_by = "None" (if no batch correction needs to be performed), threads = 1, clustParams = list(clust_c = 0, subclust_c = 0, iter.max = 30, k = 10, cluster.fun = "louvain"). Among the clustering parameters, clust_c and subclust_c refer to the number of centers to use for clustering and sub-clustering with KmeansParam. By default clust_c is set to 0, in which case the method uses either 5000 centers or 1/5th of the total cells in the data as the number of centers, whichever is lower. Similarly, subclust_c is set to 0 by default, in which case the method uses either 1 center or half of the total cells in the initial cluster as the number of centers, whichever is higher. For all other values of clust_c and subclust_c, the input is treated as the number of centers. ```{r ClustSIGNALseq_step1} spe <- clustSIGNAL::p1_clustering(spe, dimRed = "PCA") ``` Here, two columns are added to the spe object under the cell metadata: (i) the initial cluster labels, ```{r ClustSIGNALseq_step1_out1} spe$initCluster |> head() # clustering output ``` (ii) the initial subcluster labels. ```{r ClustSIGNALseq_step1_out2} spe$initSubcluster |> head() # subclustering output ``` ## Step 2: Neighbourhood detection The next step involves detecting the neighborhood of all cells. We need the spe object containing the initial cluster and initial subcluster labels and sample IDs for this. By default, ClustSIGNAL identifies 30 nearest neighbors (NN = 30), sorts the neighbourhood (sort = TRUE), and does not use parallel runs (threads = 1). ClustSIGNAL allows the use of external cell labels generated through other methods, in place of the initial clusters and subclusters. For this, the cell cluster and subcluster labels of each cell must be stored in the colData of the spe object as "initCluster" and "initSubcluster", respectively. ```{r ClustSIGNALseq_step2} # This step generates a list of neighbourhood information. outReg <- clustSIGNAL::neighbourDetect(spe, samples = "sample_id") ``` This generates a list containing: (i) a neighborhood matrix containing cell IDs, ```{r ClustSIGNALseq_step2_out1} outReg$nnCells[1:3, 1:3] ``` (ii) a list of arrays containing initial subcluster proportions. ```{r ClustSIGNALseq_step2_out2} outReg$regXclust[[1]] ``` ## Step 3: Entropy measure Now that we know the neighbourhood of each cell, we can calculate entropy of each cell's neighborhood. For this, we need the spe object and initial subcluster proportions. This step can run in parallel, but by default we use 1 cpu core. ```{r ClustSIGNALseq_step3} spe <- clustSIGNAL::entropyMeasure(spe, outReg$regXclust) ``` The entropy values are added to the spe object under cell metadata. ```{r ClustSIGNALseq_step3_out} spe$entropy |> head() # entropy values ``` ## Step 4: Adaptive smoothing Using the entropy values, we can perform adaptive smoothing. This requires the spe object containing the entropy values as well as the neighborhood matrix of cell IDs generated during neighbourhood detection. Other parameters for which default values are provided include number of neighbors (NN = 30), weight distribution type (kernel = "G" for Gaussian), distribution spread (spread = 0.05 representing standard deviation for Gaussian distribution; for exponential distribution we recommend using a spread of 5 indicating rate of the distribution), and number of cores (threads = 1) to use for parallel runs. ```{r ClustSIGNALseq_step4} spe <- clustSIGNAL::adaptiveSmoothing(spe, outReg$nnCells) ``` The adaptively smoothed gene expression data are added to the spe object under assays as 'smoothed'. ```{r ClustSIGNALseq_step4_out} assay(spe, "smoothed")[1:5, 1:3] ``` ## Step 5: Final clustering The final step involves performing clustering on the adaptively smoothed data. We only need to provide the spe object containing the adaptively smoothed data. This step has the same default clustering and batch correction parameters as the initial clustering in first step. ```{r ClustSIGNALseq_step5} spe <- clustSIGNAL::p2_clustering(spe) ``` Cluster labels are added to the colData of the spe object under a ClustSIGNAL column ```{r ClustSIGNALseq_step5_out} spe$ClustSIGNAL |> head() # ClustSIGNAL cluster labels ```
**Session Information** ```{r} sessionInfo() ```