---
title: "Quality control and visualisation with SPIAT"
author: "Yuzhou Feng"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    self_contained: yes
    toc_float: true
    toc_depth: 4
package: "`r pkg_ver('SPIAT')`"
bibliography: "`r file.path(system.file(package='SPIAT', 'vignettes'), 'introduction.bib')`"    
vignette: >
  %\VignetteIndexEntry{Quality control and visualisation with SPIAT}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

First we load the SPIAT library.

```{r message=FALSE}
library(SPIAT)
```

Here we present some quality control steps implemented in SPIAT to check
for the quality of phenotyping, help detect uneven staining, and other
potential technical artefacts.

In this vignette we will use an inForm data file that's already been
formatted for SPIAT with `format_image_to_spe()`, which we can load with
`data()`. We will use `define_celltypes()` to define the cells with certain 
combinations of markers.

```{r}
data("simulated_image")

# define cell types
formatted_image <- define_celltypes(
    simulated_image, 
    categories = c("Tumour_marker","Immune_marker1,Immune_marker2", 
                   "Immune_marker1,Immune_marker3", 
                   "Immune_marker1,Immune_marker2,Immune_marker4", "OTHER"), 
    category_colname = "Phenotype", 
    names = c("Tumour", "Immune1", "Immune2", "Immune3", "Others"),
    new_colname = "Cell.Type")
```

# Visualise marker levels

## Boxplots of marker intensities

Phenotyping of cells can be verified comparing marker intensities of
cells labelled positive and negative for a marker. Cells positive for a
marker should have high levels of the marker. An unclear separation of
marker intensities between positive and negative cells would suggest
phenotypes have not been accurately assigned. We can use
`marker_intensity_boxplot()` to produce a boxplot for cells phenotyped as
being positive or negative for a marker.

```{r, fig.height = 2, fig.width = 3.2}
marker_intensity_boxplot(formatted_image, "Immune_marker1")
```

Note that if phenotypes were obtained from software that uses machine
learning to determine positive cells, which generally also take into
account properties such as cell shape, nucleus size etc., rather than a
strict threshold, some negative cells will have high marker intensities,
and vice versa. In general, a limited overlap of whiskers or outlier
points is tolerated and expected. However, overlapping boxplots suggests
unreliable phenotyping.

## Scatter plots of marker levels

Uneven marker staining or high background intensity can be identified
with `plot_cell_marker_levels()`. This produces a scatter plot of the
intensity of a marker in each cell. This should be relatively even
across the image and all phenotyped cells. Cells that were not
phenotyped as being positive for the particular marker are excluded.

```{r, fig.height=3, fig.width=6, dpi = 72, out.width="100%"}
plot_cell_marker_levels(formatted_image, "Immune_marker1")
```

## Heatmaps of marker levels

For large images, there is also the option of 'blurring' the image,
where the image is split into multiple small areas, and marker
intensities are averaged within each. The image is blurred based on the
`num_splits` parameter.

```{r, fig.height = 2.2, out.width="75%"}
plot_marker_level_heatmap(formatted_image, num_splits = 100, "Tumour_marker")
```

# Identifying incorrect phenotypes

We may see cells with biologically implausible combination of markers
present in the input data when using `unique(spe_object$Phenotype)`. 
For example, cells might be incorrectly typed as positive for two markers known 
to not co-occur in a single cell type. Incorrect cell phenotypes may be present
due to low cell segmentation quality, antibody 'bleeding' from one cell
to another or inadequate marker thresholding.

If the number of incorrectly phenotyped cells is small (\<5%), we advise
simply removing these cells (see below). If it is a higher proportion,
we recommend checking the cell segmentation and phenotyping methods, as
a more systematic problem might be present.

## Removing cells with incorrect phenotypes

If you identify incorrect phenotypes or have any properties (columns) that you 
want to exclude you can use `select_celltypes()`. Set `celltypes` the values 
that you want to keep or exclude for a column (`feature_colname`). Set `keep` 
as `TRUE` to include these cells and `FALSE` to exclude them.

```{r}
data_subset <- select_celltypes(
  formatted_image, keep=TRUE,
  celltypes = c("Tumour_marker","Immune_marker1,Immune_marker3", 
                "Immune_marker1,Immune_marker2",
                "Immune_marker1,Immune_marker2,Immune_marker4"),
  feature_colname = "Phenotype")
# have a look at what phenotypes are present
unique(data_subset$Phenotype)
```

In this vignette we will work with all the original phenotypes present
in `formatted_image`.

## Dimensionality reduction to identify misclassified cells

We can also check for specific misclassified cells using dimensionality
reduction. SPIAT offers tSNE and UMAPs based on marker intensities to
visualise cells. Cells of distinct types should be forming clearly
different clusters.

The generated dimensionality reduction plots are interactive, and users
can hover over each cell and obtain the cell ID. Users can then remove
specific misclassified cells.

```{r}
# First predict the phenotypes (this is for generating not 100% accurate phenotypes)
predicted_image2 <- predict_phenotypes(spe_object = simulated_image,
                                      thresholds = NULL,
                                      tumour_marker = "Tumour_marker",
                                      baseline_markers = c("Immune_marker1", 
                                                           "Immune_marker2", 
                                                           "Immune_marker3", 
                                                           "Immune_marker4"),
                                      reference_phenotypes = FALSE)

# Then define the cell types based on predicted phenotypes
predicted_image2 <- define_celltypes(
  predicted_image2, 
  categories = c("Tumour_marker", "Immune_marker1,Immune_marker2",
                 "Immune_marker1,Immune_marker3", 
                 "Immune_marker1,Immune_marker2,Immune_marker4"), 
  category_colname = "Phenotype",
  names = c("Tumour", "Immune1", "Immune2",  "Immune3"),
  new_colname = "Cell.Type")

# Delete cells with unrealistic marker combinations from the dataset
predicted_image2 <- 
  select_celltypes(predicted_image2, "Undefined", feature_colname = "Cell.Type",
                   keep = FALSE)

# TSNE plot
g <- dimensionality_reduction_plot(predicted_image2, plot_type = "TSNE", 
                                   feature_colname = "Cell.Type")
```

Note that `dimensionality_reduction_plot()` only prints a static version
of the UMAP or tSNE plot. If the user wants to interact with this plot,
they can pass the result to the `ggplotly()` function from the `plotly`
package. Due to the file size restriction, we only show a screenshot of
the interactive tSNE plot.

```{r, echo=TRUE, eval=FALSE}
plotly::ggplotly(g) 
```

```{r, echo=FALSE, fig.width = 4, fig.height = 4, out.width = "70%"}
knitr::include_graphics("tSNE1.jpg")
```

The plot shows that there are four clear clusters based on marker
intensities. This is consistent with the cell definition from the marker
combinations from the "Phenotype" column. (The interactive t-SNE plot
would allow users to hover the cursor on the misclassified cells and see
their cell IDs.) In this example, Cell_3302, Cell_4917, Cell_2297,
Cell_488, Cell_4362, Cell_4801, Cell_2220, Cell_3431, Cell_533,
Cell_4925, Cell_4719, Cell_469, Cell_1929, Cell_310, Cell_2536,
Cell_321, and Cell_4195 are obviously misclassified according to this
plot.

We can use `select_celltypes()` to delete the misclassified cells.

```{r, eval=FALSE}
predicted_image2 <- 
  select_celltypes(predicted_image2, c("Cell_3302", "Cell_4917", "Cell_2297", 
                                       "Cell_488", "Cell_4362", "Cell_4801", 
                                       "Cell_2220", "Cell_3431", "Cell_533", 
                                       "Cell_4925", "Cell_4719", "Cell_469", 
                                       "Cell_1929", "Cell_310", "Cell_2536", 
                                       "Cell_321", "Cell_4195"), 
                   feature_colname = "Cell.ID", keep = FALSE)
```

Then plot the TSNE again (not interactive). This time we see there are
fewer misclassified cells.

```{r, eval=FALSE}
# TSNE plot
g <- dimensionality_reduction_plot(predicted_image2, plot_type = "TSNE", feature_colname = "Cell.Type")

# plotly::ggplotly(g) # uncomment this code to interact with the plot
```

```{r, echo=FALSE, fig.width = 4, fig.height = 4, out.width = "70%"}
knitr::include_graphics("tSNE2.jpg")
```

# Visualising tissues

In addition to the marker level tissue plots for QC, SPIAT has other
methods for visualising markers and phenotypes in tissues.

## Categorical dot plot

We can see the location of all cell types (or any column in the data) in
the tissue with `plot_cell_categories()`. Each dot in the plot corresponds
to a cell and cells are coloured by cell type. Any cell types present in
the data but not in the cell types of interest will be put in the
category "OTHER" and coloured lightgrey.

```{r, fig.height = 2.5}
my_colors <- c("red", "blue", "darkcyan", "darkgreen")
  
plot_cell_categories(spe_object = formatted_image, 
                     categories_of_interest = 
                       c("Tumour", "Immune1", "Immune2", "Immune3"), 
                     colour_vector = my_colors, feature_colname = "Cell.Type")
```

`plot_cell_categories()` also allows the users to plot the categories layer by
layer when there are too many cells by setting `layered` parameter as `TRUE`. 
Then the cells will be plotted in the order of `categories_of_interest` layer 
by layer. Users can also use `cex` parameter to change the size of the points.

## 3D surface plot

We can visualise a selected marker in 3D with `marker_surface_plot()`. The
image is blurred based on the `num_splits` parameter.

```{r, eval=FALSE}
marker_surface_plot(formatted_image, num_splits=15, marker="Immune_marker1")
```

Due to the restriction of the file size, we have disabled the
interactive plot in this vignette. Here only shows a screen shot. (You
can interactively move the plot around to obtain a better view with the
same code).

```{r, echo=FALSE, out.width = "75%"}
knitr::include_graphics("marker_surface1.jpg")
```

## 3D stacked surface plot

To visualise multiple markers in 3D in a single plot we can use
`marker_surface_plot_stack()`. This shows normalised intensity level of
specified markers and enables the identification of co-occurring and
mutually exclusive markers.

```{r, eval=FALSE}
marker_surface_plot_stack(formatted_image, num_splits=15, markers=c("Tumour_marker", "Immune_marker1"))
```

```{r, echo=FALSE, out.width = "75%"}
knitr::include_graphics("marker_surface2.jpg")
```

The stacked surface plots of the Tumour_marker and Immune_marker1 cells
in this example shows how Tumour_marker and Immune_marker1 are mutually
exclusive as the peaks and valleys are opposite. Similar to the previous
plot, we have disabled the interactive plot in the vignette. (You can
interactively move the plot around to obtain a better view with the same
code.)

# You can access the vignettes for other modules of SPIAT here:

 - [Overview of SPIAT](SPIAT.html)
 - [Data reading and formatting](data_reading-formatting.html)
 - [Basic analysis](basic_analysis.html)
 - [Cell colocalisation](cell-colocalisation.html)
 - [Spatial heterogeneity](spatial-heterogeneity.html)
 - [Tissue structure](tissue-structure.html)
 - [Cellular neighborhood](neighborhood.html)
 
# Reproducibility

```{r}
sessionInfo()
```

# Author Contributions

AT, YF, TY, ML, JZ, VO, MD are authors of the package code. MD and YF
wrote the vignette. AT, YF and TY designed the package.