---
title: "02 -- Reconstruction and analysis of pancreatic islets from IMC data"
author: 
  - name: "Samuel Gunz"
    affiliation:
      - &DMLS Department of Molecular Life Sciences, University of Zurich, Switzerland
      - &SIB SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland
    email: "samuel.gunz@uzh.ch"
  - name: Mark D. Robinson
    affiliation:
      - *DMLS
      - *SIB
package: "`r BiocStyle::Biocpkg('sosta')`"
output:
  BiocStyle::html_document
abstract: >
  In this vignette we use `sosta` to reconstruct pancreatic islets of different diabetic stages from IMC data [@damondMapHumanType2019]. Based on the reconstruction we calculate structure metrics. Finally, we show how to do staticstical comparison of the metrics accounting for the correlation structure of the dataset.
vignette: >
  %\VignetteIndexEntry{Reconstruction and analysis of pancreatic islets from IMC data}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
bibliography: sosta.bib
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
set.seed(8048)
```

# Installation

The `r BiocStyle::Biocpkg('sosta')` package can be installed from Bioconductor as follows:

```{r installation, include = TRUE, eval=FALSE}
if (!requireNamespace("BiocManager")) {
    install.packages("BiocManager")
}
BiocManager::install("sosta")
```

# Setup

For this vignette, we will need several additional packages:

```{r setup, message=FALSE, warning=FALSE}
library("dplyr")
library("ExperimentHub")
library("ggplot2")
library("lme4")
library("lmerTest")
library("sf")
library("SpatialExperiment")
library("sosta")
library("tidyr")
library("ggfortify")

theme_set(theme_bw())
```

# Introduction

In this vignette, we will use an imaging mass cytometry (IMC) dataset of pancreatic islets from human donors at different stages of type 1 diabetes (T1D) and healthy controls [@damondMapHumanType2019].

```{r loading, echo=FALSE, message=FALSE}
# load experiment hub
eh <- ExperimentHub()
oid <- names(eh[eh$title == "Damond_2019_Pancreas - sce - v1 - full"])

# Load single cell experiment object
spe <- eh[[oid]]
# Convert to spatial experiment object
spe <- toSpatialExperiment(spe,
    sample_id = "image_name",
    spatialCoordsNames = c("cell_x", "cell_y")
)
```

First, we plot the data for illustration. As we have multiple images per patient, we will subset to a few slides. As can be seen, the dimensions of the field of view are differing.

```{r plot-data}
df <- cbind(
    colData(spe[, spe$image_name %in% c("E04", "E03", "G01", "J02")]),
    spatialCoords(spe[, spe$image_name %in% c("E04", "E03", "G01", "J02")])
)
df |>
    as.data.frame() |>
    ggplot(aes(x = cell_x, y = cell_y, color = cell_category)) +
    geom_point(size = 0.25) +
    facet_wrap(~image_name, ncol = 2) +
    coord_equal() +
    scale_color_brewer(palette = "Dark2")
```

Our goal is to reconstruct / segment and quantify the pancreatic islets.

# Reconstruction of pancreatic islets

## Reconstruction of pancreatic islets for one image

In this example, we will reconstruct the islets based on the point pattern *density* of the islet cells. We will start with estimating the parameters that we use for reconstruction afterwards. For one image this can be illustrated as follows.

```{r}
shapeIntensityImage(
    spe,
    marks = "cell_category",
    imageCol = "image_name",
    imageId = "G01",
    markSelect = "islet"
)
```

We see the density (pixel-level) image on the left and a histogram of the intensity values on the right. The estimated threshold is roughly the mean between the two modes of the (truncated) pixel intensity distribution.

Note that the above calculation was done for one image. The function `estimateReconstructionParametersSPE` returns two plots with the estimated parameters for each image. The parameter `bndw` is the bandwidth parameter that is used for estimating the intensity profile of the point pattern. The parameter `thres` is the estimated parameter for the density threshold for reconstruction. We subset 25 random images to speed up computation.

```{r}
n <- estimateReconstructionParametersSPE(
    spe,
    marks = "cell_category",
    imageCol = "image_name",
    markSelect = "islet",
    nImages = 25,
    plotHist = TRUE
)
```

We can inspect the relationship of the estimated bandwidth and threshold. 

```{r}
n |>
    ggplot(aes(x = bndw, y = thres)) +
    geom_point()
```

We note that the estimated bandwidth seems to vary a bit more than the estimated threshold. We will use the mean of the two estimated vectors as our parameters.

```{r}
(thresSPE <- mean(n$thres))
(bndwSPE <- mean(n$bndw))
```

## Reconstruction of pancreatic islets for all images

The function `reconstructShapeDensitySPE` reconstructs the islets for all images in the `spe` object. We use the estimated parameters from above. For computational reasons, we will subset to 20 images per patient for the rest of the vignette.

```{r, eval=TRUE, include=TRUE}
# Sample 15 images per patient
sel <- colData(spe) |>
    as.data.frame() |>
    group_by(patient_id) |>
    select(image_name) |>
    sample_n(size = 20, replace = FALSE) |>
    pull(image_name)

# Select sampled images
speSel <- spe[, spe$image_name %in% sel]
speSel$image_name |>
    unique() |>
    length()
```

```{r, eval=TRUE}
# Run on all images
allIslets <- reconstructShapeDensitySPE(
    speSel,
    marks = "cell_category",
    imageCol = "image_name",
    markSelect = "islet",
    bndw = bndwSPE,
    thres = thresSPE,
    nCores = 1
)
```

The result is a [simple feature collection](https://r-spatial.github.io/sf/articles/sf1.html). It contains the polygons (`<GEOMETRY>` column), a structure identifier (`structID`) and the image identifier (`image_name`). Let's add some patient metadata to the object.

```{r}
colsKeep <- c(
    "patient_stage", "tissue_slide", "tissue_region",
    "patient_id", "patient_disease_duration",
    "patient_age", "patient_gender", "sample_id"
)

patientData <- colData(speSel) |>
    as_tibble() |>
    group_by(image_name) |>
    select(all_of(colsKeep)) |>
    unique()

allIslets <- allIslets |>
    dplyr::left_join(patientData, by = "image_name")
```

Using standard operations on data frames we can inspect the number of structures found per patient.

```{r}
allIslets |>
    st_drop_geometry() |> # we are only interested in metadata
    group_by(patient_id) |>
    summarise(n = n()) |>
    ungroup()
```

# Calculation of metrics

## Structure metrics

Now that we have islet structures for all images, we can now use the function `totalShapeMetrics` to calculate a set of metrics related to the shape of the islets.

```{r}
isletMetrics <- totalShapeMetrics(allIslets)
```

The result is a matrix. We will it to our simple feature collection. 

```{r}
# specify factor levels
lv <- c("Non-diabetic", "Onset", "Long-duration")

allIslets <- allIslets |>
    cbind(t(isletMetrics)) |>
    mutate(patient_stage = factor(patient_stage, levels = lv))
```

# Investigate metrics

## Plot structure metrics

We use PCA to get an overview of the different features. Each dot represents one structure.

```{r}
autoplot(
    prcomp(t(isletMetrics), scale. = TRUE),
    x = 1,
    y = 2,
    data = allIslets,
    color = "patient_stage",
    size = 2,
    # shape = 'type',
    loadings = TRUE,
    loadings.colour = "steelblue3",
    loadings.label = TRUE,
    loadings.label.size = 3,
    loadings.label.repel = TRUE,
    loadings.label.colour = "black"
) +
    scale_color_brewer(palette = "Set2") +
    theme_bw() +
    coord_fixed()
```

We can use boxplots to investigate differences of shape metrics between patient stages. We will subset to a few metrics that are not colinear in the PCA plot. Note that the boxplots don't reveal patient specific effects.

```{r, fig.width=8, fig.height=6}
allIslets |>
    sf::st_drop_geometry() |>
    select(patient_stage, rownames(isletMetrics)) |>
    pivot_longer(-patient_stage) |>
    filter(name %in% c("Area", "Compactness", "Curl")) |>
    ggplot(aes(x = patient_stage, y = value, fill = patient_stage)) +
    geom_violin() +
    geom_boxplot(aes(fill = NULL), width = 0.3) +
    facet_wrap(~name, scales = "free") +
    scale_fill_brewer(palette = "Set2") +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    guides(fill = "none")
```

## Testing using mixed effects models

As the distribution of the area is very skewed we use a 1/6-power transformation. Let's have a look at the transformed area of the islets faceted by patient first.

```{r, fig.width=8, fig.height=6}
allIslets |>
    sf::st_drop_geometry() |>
    select(patient_stage, patient_id, rownames(isletMetrics)) |>
    pivot_longer(-c(patient_stage, patient_id)) |>
    filter(name %in% c("Area")) |>
    ggplot(aes(
        x = as.factor(patient_id),
        y = (value)^(1 / 6),
        fill = patient_stage
    )) +
    geom_violin() +
    geom_boxplot(aes(fill = NULL), width = 0.3) +
    facet_wrap( ~ patient_stage, scales = "free") +
    scale_fill_brewer(palette = "Set2") +
    scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
    guides(fill = "none")
```

We can see that the variability within the patients and the different stages is varying. As the individual structure level metrics are not independent we have to account for dependence between measurements. This dependence can lie on the level of the patient and the slide as we have repeated measurements for each level.

To account for this, we will use mixed linear models with random effects for the patient and the individual slides (`image_name`). We will use the `r BiocStyle::CRANpkg('lme4')` package for fitting linear mixed effects models [@batesFittingLinearMixedEffects2015] and `r BiocStyle::CRANpkg('lmerTest')` for p-value calculation [@kuznetsovaLmerTestPackageTests2017]. 

We can model differences between the patient stages as follows.

```{r}
mod <- lmer(
    (Area)^(1/6) ~ patient_stage +
        (1 | patient_id) + (1 | image_name),
    data = allIslets
)
```

### Model diagnostics

Let's have a look at the model diagnostics. First plot the residuals vs. the fitted values, colored by the patients.

```{r}
plot(
    mod,
    resid(., scaled = TRUE) ~ fitted(.),
    col = allIslets$patient_id,
    pch = 12,
    abline = 0,
    xlab = "Fitted values",
    ylab = "Standardised residuals"
)
```

Next, we'll have a look at the Q-Q plot. The residuals seem to be approximately normally distributed with small deviations in the tails.

```{r}
qqnorm(resid(mod), pch = 16)
qqline(resid(mod))
```

```{r}
summary(mod)
```

As we can see in the fixed effects section in `summary(mod)` there is a significant difference in the transformed islet area of long-duration patients with respect to non-diabetic patients, while the effect for onset patients is not statistically significant at the 5\% level. This results accounts for **correlation at both the patient and image level** as modeled by random intercepts. The somewhat arbitrary transformation (1/6-power) was chosen after inspection of the residual behavior in the model diagnostics and should not serve as a standard. Note that the calculation was performed on a random subset of the patient slides only.

# Session Info

```{r sessionInfo}
sessionInfo()
```

# References