---
title: "HDCytoData data package"
author: 
  - name: Lukas M. Weber
    affiliation: 
      - &id1 "Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland"
      - &id2 "SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland"
  - name: Charlotte Soneson
    affiliation: 
      - *id1
      - *id2
package: HDCytoData
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{HDCytoData data package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Overview

The `HDCytoData` data package contains a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) data sets, formatted into `SummarizedExperiment` and `flowSet` Bioconductor object formats. The data objects are hosted on the Bioconductor ExperimentHub web resource.

The objects contain the cell-level expression values, as well as row and column meta-data, including sample IDs, group IDs, true cell population labels or cluster labels (where available), channel names, protein marker names, and protein marker classes (cell type or cell state).

These data sets have been used for benchmarking purposes in our previous work and publications, e.g. to evaluate the performance of clustering algorithms. They are provided here in the `SummarizedExperiment` and `flowSet` formats to make them easier to access.


# Data sets

Currently, the package contains the following data sets:

- Bodenmiller_BCR_XL

Additional details on each data set are included in the help files for the data sets. For each data set, this includes a description of the data set (biological context, number of samples, number of cells, number and classes of protein markers, etc.), as well as an explanation of the object structures, and references and raw data sources.

The help files can be accessed by the data set names, e.g. `?Bodenmiller_BCR_XL`.


# How to load data

First, we show how to load the data sets.

The data sets can be loaded either with named functions referring directly to the object names, or by using the `ExperimentHub` interface. Both methods are demonstrated below, using the `Bodenmiller_BCR_XL` data set as an example.

See the help file (`?Bodenmiller_BCR_XL`) for details about the structure of the `SummarizedExperiment` or `flowSet` objects for this data set.

Load the data sets using named functions:

```{r}
suppressPackageStartupMessages(library(HDCytoData))

# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()

# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()
```


Alternatively, load the data sets using the `ExperimentHub` interface:

```{r}
# Create an ExperimentHub instance
ehub <- ExperimentHub()

# Query ExperimentHub instance to find data sets
query(ehub, "HDCytoData")

# Load 'SummarizedExperiment' object using index of data set
ehub[["EH1119"]]

# Load 'flowSet' object using index of data set
ehub[["EH1120"]]
```


# Example workflow

We demonstrate the use of the `HDCytoData` package with a short example analysis workflow for one of the data sets.


## Load data

We load the data in `SummarizedExperiment` format for the example workflow.

```{r}
# Load data set in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()

# Inspect the object
d_SE
assay(d_SE)[1:6, 1:6]
rowData(d_SE)
colData(d_SE)
```


## Transform data

Flow and mass cytometry data should be transformed before performing any analysis. For mass cytometry (CyTOF), a commonly used standard transformation is the `arcsinh` with parameter `cofactor = 5`. (For flow cytometry, `cofactor = 150` can be used.) This brings the marker expression profiles closer to normal distributions, which improves clustering performance and visualizations.

We apply the `arcsinh` transform with `cofactor = 5` to all protein marker columns.

```{r}
# Apply transformation to marker columns
cofactor <- 5
cols <- colData(d_SE)$marker_class != "none"
assay(d_SE)[, cols] <- asinh(assay(d_SE)[, cols] / cofactor)

summary(assay(d_SE)[, 1:10])
```


## Clustering

We perform clustering to define cell populations, and compare the clustering results with the reference cell population labels. We use the [FlowSOM](http://bioconductor.org/packages/release/bioc/html/FlowSOM.html) clustering algorithm (Van Gassen et al., 2015), available from Bioconductor.

We use t-SNE plots to visualize the clustering results. Note that the t-SNE plots show only a subset of cells, to speed up runtime.

Since we are interested in cell populations (not cell states), we use only 'cell type' markers for the clustering and t-SNE plots.

```{r}
suppressPackageStartupMessages(library(FlowSOM))
suppressPackageStartupMessages(library(Rtsne))
suppressPackageStartupMessages(library(ggplot2))


# --------------
# Run clustering
# --------------

d_FlowSOM <- as.matrix(assay(d_SE)[, colData(d_SE)$marker_class == "type"])
d_FlowSOM <- flowFrame(d_FlowSOM)

set.seed(123)
out <- ReadInput(d_FlowSOM, transform = FALSE, scale = FALSE)
out <- BuildSOM(out, colsToUse = NULL)

labels_pre <- out$map$mapping[, 1]

# number of meta-clusters
k <- 20
out <- metaClustering_consensus(out$map$codes, k = k, seed = 123)

labels <- out[labels_pre]

# check meta-cluster labels
table(labels)


# ------------------------------
# t-SNE plot with cluster labels
# ------------------------------

d_Rtsne <- as.matrix(assay(d_SE)[, colData(d_SE)$marker_class == "type"])
colnames(d_Rtsne) <- colData(d_SE)$marker_name[colData(d_SE)$marker_class == "type"]

# subsampling
n_sub <- 1000
set.seed(123)
ix <- sample(seq_along(labels), n_sub)

d_Rtsne <- d_Rtsne[ix, ]
labels <- labels[ix]

# remove any duplicate rows (required by Rtsne)
dups <- duplicated(d_Rtsne)

d_Rtnse <- d_Rtsne[!dups, ]
labels <- labels[!dups]

# run Rtsne
set.seed(123)
out_Rtsne <- Rtsne(d_Rtsne, pca = FALSE, verbose = TRUE)

d_plot <- as.data.frame(out_Rtsne$Y)
colnames(d_plot) <- c("tSNE_1", "tSNE_2")
d_plot$cluster <- as.factor(labels)

# create plot
ggplot(d_plot, aes(x = tSNE_1, y = tSNE_2, color = cluster)) +
  geom_point() +
  ggtitle("t-SNE plot: clustering") +
  xlab("t-SNE dimension 1") +
  ylab("t-SNE dimension 2") +
  theme_classic()


# -------------------------------------------
# t-SNE plot with reference population labels
# -------------------------------------------

d_plot$population <- rowData(d_SE)$population_id[ix][!dups]

# create plot
ggplot(d_plot, aes(x = tSNE_1, y = tSNE_2, color = population)) +
  geom_point() +
  ggtitle("t-SNE plot: reference populations") +
  xlab("t-SNE dimension 1") +
  ylab("t-SNE dimension 2") +
  theme_classic()
```


## Data exploration using 'iSEE' package

Additional visualizations to explore the data can be generated using the [iSEE](http://bioconductor.org/packages/devel/iSEE) ("Interactive SummarizedExperiment Explorer") package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface for exploring single-cell data sets stored in `SummarizedExperiment` objects.

For more details on exploring data with `iSEE`, see the `iSEE` package vignette.