---
title: "Clustering and metaclustering"
author: "Timothy Keyes"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
description: > 
  Read this vignette to learn how to identify clusters of cells with shared
  phenotypic characteristics using {tidytof}.
vignette: >
  %\VignetteIndexEntry{07. Clustering and metaclustering}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)

options(
  rmarkdown.html_vignette.check_title = FALSE
)
```

```{r setup, message = FALSE, warning = FALSE}
library(tidytof)
library(dplyr)
```

Often, clustering single-cell data to identify communities of cells with shared characteristics is a major goal of high-dimensional cytometry data analysis.

To do this, `{tidytof}` provides the `tof_cluster()` verb. Several clustering methods are implemented in `{tidytof}`, including the following: 

- [FlowSOM](https://pubmed.ncbi.nlm.nih.gov/25573116/)
- [k-means](https://www.jstor.org/stable/2346830?origin=crossref&seq=1#metadata_info_tab_contents)
- [PhenoGraph](https://pubmed.ncbi.nlm.nih.gov/26095251/)
- [Supervised distance-based clustering](https://pubmed.ncbi.nlm.nih.gov/29505032/)
- [X-shift](https://pubmed.ncbi.nlm.nih.gov/27183440/)

Each of these methods are wrapped by `tof_cluster()`. 

## Clustering with `tof_cluster()`

To demonstrate, we can apply the PhenoGraph clustering algorithm to `{tidytof}`'s built-in `phenograph_data`. Note that `phenograph_data` contains 3000 total cells (1000 each from 3 clusters identified in the [original PhenoGraph publication](https://pubmed.ncbi.nlm.nih.gov/26095251/)). For demonstration purposes, we also metacluster our PhenoGraph clusters using k-means clustering.

```{r}
data(phenograph_data)

set.seed(203L)

phenograph_clusters <-
    phenograph_data |>
    tof_preprocess() |>
    tof_cluster(
        cluster_cols = starts_with("cd"),
        num_neighbors = 50L,
        distance_function = "cosine",
        method = "phenograph"
    ) |>
    tof_metacluster(
        cluster_col = .phenograph_cluster,
        metacluster_cols = starts_with("cd"),
        num_metaclusters = 3L,
        method = "kmeans"
    )

phenograph_clusters |>
    dplyr::select(sample_name, .phenograph_cluster, .kmeans_metacluster) |>
    head()
```

The outputs of both `tof_cluster()` and `tof_metacluster()` are a `tof_tbl` identical to the input tibble, but now with the addition of an additional column (in this case, ".phenograph_cluster" and ".kmeans_metacluster") that encodes the cluster id for each cell in the input `tof_tbl`. Note that all output columns added to a tibble or `tof_tbl` by `{tidytof}` begin with a full-stop (".") to reduce the likelihood of collisions with existing column names.

Because the output of `tof_cluster` is a `tof_tbl`, we can use `dplyr`'s `count` method to assess the accuracy of our clustering procedure compared to the original clustering from the PhenoGraph paper.

```{r}
phenograph_clusters |>
    dplyr::count(phenograph_cluster, .kmeans_metacluster, sort = TRUE)
```

Here, we can see that our clustering procedure groups most cells from the same PhenoGraph cluster with one another (with a small number of mistakes).

To change which clustering algorithm `tof_cluster` uses, alter the `method` flag.

```{r, eval = FALSE}
# use the kmeans algorithm
phenograph_data |>
    tof_preprocess() |>
    tof_cluster(
        cluster_cols = contains("cd"),
        method = "kmeans"
    )

# use the flowsom algorithm
phenograph_data |>
    tof_preprocess() |>
    tof_cluster(
        cluster_cols = contains("cd"),
        method = "flowsom"
    )
```

To change the columns used to compute the clusters, change the `cluster_cols` flag. And finally, if you want to return a one-column `tibble` that only includes the cluster labels (as opposed to the cluster labels added as a new column to the input `tof_tbl`), set `augment` to `FALSE`.

```{r}
# will result in a tibble with only 1 column (the cluster labels)
phenograph_data |>
    tof_preprocess() |>
    tof_cluster(
        cluster_cols = contains("cd"),
        method = "kmeans",
        augment = FALSE
    ) |>
    head()
```


# Session info

```{r}
sessionInfo()
```