---
title: "Reading in data and data formatting in SPIAT"
author: "Yuzhou Feng"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    self_contained: yes
    toc_float: true
    toc_depth: 4
package: "`r pkg_ver('SPIAT')`"
bibliography: "`r file.path(system.file(package='SPIAT', 'vignettes'), 'introduction.bib')`"    
vignette: >
  %\VignetteIndexEntry{Reading in data and data formatting in SPIAT}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

First we load the SPIAT library.

```{r message=FALSE}
library(SPIAT)
```

# Reading in data

`format_image_to_spe()` is the main function to read in data to
SPIAT. `format_image_to_spe()` creates a `SpatialExperiment` object which is
used in all subsequent functions. The key data points of interest for
SPIAT are cell coordinates, marker intensities and cell phenotypes for
each cell.

`format_image_to_spe()` has specific options to read in data generated
from inForm, HALO, CODEX and cellprofiler.
However, we advise pre-formatting the data before input to SPIAT so that
accepted by the 'general' option (shown below). This is due to often
inconsistencies in the column names or data formats across different
versions or as a result of different user options when using the other
platforms.

## Reading in data through the 'general' option (RECOMMENDED)

Format "general" allows you to input a matrix of intensities
(`intensity_matrix`), and a vector of `phenotypes`, which should be in
the same order in which they appear in the `intensity_matrix`. They must
be of the form of marker combinations (e.g. "CD3,CD8"), as opposed to cell 
names (e.g. "cytotoxic T cells"), as SPIAT does matching with the
marker names. `phenotypes` is an optional parameter and can be set as `NULL` if 
no phenotypes are available. The user also needs to provide separate vectors 
with the X and Y coordinates of the cells (`coord_x` and `coord_y`). The cells 
must be in the same order as in the `intensity_matrix`. If you have `Xmin`, 
`Xmax`,`Ymin` and `Ymax` columns in the raw data, we advise calculating the 
average to obtain a single X and Y coordinate, which you can then use as input 
to `coord_x` and `coord_y`.

Specifically, if `intensity_matrix` is available, please make sure the `colnames` 
of the intensity matrix are the cell IDs as some SPIAT functions 
(like `identify_bordering_cells()`) require the constructed image object to have 
cell IDs as `rownames` of the colData and `colnames` of the intensity matrix.
If `intensity_matrix` is `NULL`, the function will automatically assign IDs to 
the cells.

Here we use some dummy data to illustrate how to read "general" format.

```{r}
# Construct a dummy marker intensity matrix
## rows are markers, columns are cells
intensity_matrix <- matrix(c(14.557, 0.169, 1.655, 0.054,
                             17.588, 0.229, 1.188, 2.074, 
                             21.262, 4.206,  5.924, 0.021), nrow = 4, ncol = 3)
# define marker names as rownames
rownames(intensity_matrix) <- c("DAPI", "CD3", "CD4", "AMACR")
# define cell IDs as colnames
colnames(intensity_matrix) <- c("Cell_1", "Cell_2", "Cell_3") 

# Construct a dummy metadata (phenotypes, x/y coordinates)
# the order of the elements in these vectors correspond to the cell order 
# in `intensity matrix`
phenotypes <- c("OTHER",  "AMACR", "CD3,CD4")
coord_x <- c(82, 171, 184)
coord_y <- c(30, 22, 38)

general_format_image <- format_image_to_spe(format = "general", 
                                            intensity_matrix = intensity_matrix,
                                            phenotypes = phenotypes, 
                                            coord_x = coord_x,coord_y = coord_y)
```


The formatted image now contains phenotypes, locations, and marker
intensity information of 3 cells. Note that if users want to define cell
IDs, the cell IDs should be defined as the colnames of the intensity
matrix. The order of the rows of the metadata should correspond to the order 
of the colnames of the intensity matrix. The function will automatically assign 
rownames to the `spatialCoords()` and `colData()` of the image (now as a 
`spatialExperiment` object).

Use the following code to inspect the formatted SpatialExperiment object.

```{r}
# phenotypes and cell properties (if available)
colData(general_format_image)
# cell coordinates
spatialCoords(general_format_image)
# marker intensities
assay(general_format_image)
```

## Reading in data pre-formatted by other software

If you prefer to use data directly generated from inForm, HALO, CODEX or 
cellprofiler, these can be specified by `format` param in
`format_image_to_spe()`. We will show examples for the inForm and HALO formats.

For reading in input generated with CODEX or cellprofiler see
the documentations (`?format_image_to_spe`).

### Reading in data from inForm

To read in data from inForm, you need the table file generated by
inForm containing the cell IDs, cell locations, phenotypes (if available) and 
marker intensities. You also need to extract a vector of marker names and marker 
locations ("Nucleus", "Cytoplasm", or "Membrane"). `format_image_to_spe()` uses 
the "Cell X Position" and "Cell Y Position" columns and the "Phenotype" column 
in the inForm raw data. The phenotype of a cell can be a single marker, 
for example, "CD3", or a combination of markers, such as "CD3,CD4". As a 
convention, SPIAT assumes that cells marked as "OTHER" in "inForm" refer to 
cells positive for DAPI but no other marker.
The phenotypes must be based on the markers (e.g. CD3,CD4), rather than
names of cells (e.g. cytotoxic T cells). The names of the cells (e.g. cytotoxic 
T cells) can be added later using the `define_celltypes()` function. 
The following cell properties columns are also required to be present in the 
inForm input file: Entire Cell Area (pixels), Nucleus Area (pixels), Nucleus
Compactness, Nucleus Axis Ratio, and Entire Cell Axis Ratio. If not
present in the raw data, these can be columns with NAs.

To read in inForm data, you need to specify the following parameters:

-   `format`: "inForm"
-   `path`: path to the raw inForm image data file
-   `markers`: names of markers used in the OPAL staining. These must be
    in the same order as the marker columns in the input file, and must match 
    the marker names used in the input file. One of the markers must be "DAPI".
-   `locations`: locations of the markers in cells, either "Nucleus",
    "Cytoplasm" or "Membrane." These must be in the same order as `markers`.
    The locations are used to auto-detect the intensity (and dye)
    columns.

A small example of inForm input is included in SPIAT containing dummy
marker intensity values and all the other required columns (see below).
This example file is just for demonstrating importing a raw data file,
later in the [Inspecting the SpaitalExperiment object](#example-data) 
section we will load a larger preformatted dataset. Users are welcome to 
use this formatting option (`format = 'inForm'`) if it is closer to the 
format of their files.

```{r message=FALSE}
raw_inform_data <- system.file("extdata", "tiny_inform.txt.gz", package = "SPIAT")
markers <- c("DAPI", "CD3", "PD-L1", "CD4", "CD8", "AMACR")
locations <- c("Nucleus","Cytoplasm", "Membrane","Cytoplasm","Cytoplasm",
               "Cytoplasm") # The order is the same as `markers`.
formatted_image <- format_image_to_spe(format="inForm", path=raw_inform_data, 
                                       markers=markers, locations=locations)
```

Alternatively, rather than specifying the `locations`, you can also
specify the specific intensity columns with the parameter 
`intensity_columns_interest` as shown below.

```{r message=FALSE}
raw_inform_data <- system.file("extdata", "tiny_inform.txt.gz", package = "SPIAT")
markers <- c("DAPI", "CD3", "PD-L1", "CD4", "CD8", "AMACR")
intensity_columns_interest <- c(
  "Nucleus DAPI (DAPI) Mean (Normalized Counts, Total Weighting)",
  "Cytoplasm CD3 (Opal 520) Mean (Normalized Counts, Total Weighting)", 
  "Membrane PD-L1 (Opal 540) Mean (Normalized Counts, Total Weighting)",
  "Cytoplasm CD4 (Opal 620) Mean (Normalized Counts, Total Weighting)",
  "Cytoplasm CD8 (Opal 650) Mean (Normalized Counts, Total Weighting)", 
  "Cytoplasm AMACR (Opal 690) Mean (Normalized Counts, Total Weighting)"
  ) # The order is the same as `markers`.
formatted_image <- format_inform_to_spe(path=raw_inform_data, markers=markers,
                     intensity_columns_interest=intensity_columns_interest)
class(formatted_image) # The formatted image is a SpatialExperiment object
dim(colData(formatted_image))
dim(assay(formatted_image))
```

### Reading in data from HALO

To read in data from HALO, you need the table file generated by HALO. 
The biggest difference between inForm and HALO formats is the coding of the cell 
phenotypes. While inForm
encodes phenotypes as the combination of positive markers (e.g.
"CD3,CD4"), HALO uses a binary system where 1 means the cell is positive for
the marker and 0 otherwise.

`format_image_to_spe()` for "HALO" format collapses HALO encoded phenotypes 
into an inForm-like format to create the `Phenotype` column. For example, 
if HALO has assigned a cell a marker status of 1 for CD3 and 1 for CD4, SPIAT
will give it the Phenotype "CD3,CD4". Cells that have a marker status of
1 for DAPI but no other marker are given the phenotype "OTHER".

`format_image_to_spe()` takes the average of the HALO X min and X max
columns for each cell to create the `Cell.X.Position` column. It takes the
average of the Y min and Y max to create the `Cell.Y.Position` column.

To read in HALO data, you need to specify the following parameters:

-   `format`: "HALO"
-   `path`: path to the raw HALO image data file
-   `markers`: names of markers used in the OPAL staining. These must be
    in the same order as the marker columns in the input file, and must match 
    the marker names used in the input file. One of the markers must be DAPI.
-   `locations`: locations of the markers in cells, either "Nucleus",
    "Cytoplasm" or "Membrane." These must be in the order of the `markers`.
    The locations are used to auto-detect the intensity (and dye) columns.
-   `intensity_columns_interest` use if `locations` is not specified. Vector 
    with the names of the columns with the level of each marker. Column names 
    must match the order of the `markers` parameter. 
-   `dye_columns_interest` Use if locations is not specified. Vector of names 
     of the columns with the marker status (i.e. those indicating 1 or 0 for 
     whether the cell is positive or negative for the marker). Column names 
     must match the order of the `markers` parameter.

Users can specify the `locations` to auto-detect the columns as shown
above for inForm. Alternatively, if users want to specify the columns
instead, you can do so with `intensity_columns_interest`, as shown in
the example below. Note that then you also must specify
`dye_columns_interest`. The following cell properties columns are also
required to be present in the HALO input file: Cell Area, Nucleus Area,
Cytoplasm Area. If these are not present in the user's data, we
recommend adding these columns with NA values.

```{r message=FALSE}
raw_halo_data <- system.file("extdata", "tiny_halo.csv.gz", package = "SPIAT")
markers <- c("DAPI", "CD3", "PD-L1", "CD4", "CD8", "AMACR")
intensity_columns_interest <- c("Dye 1 Nucleus Intensity",
                                "Dye 2 Cytoplasm Intensity",
                                "Dye 3 Membrane Intensity",
                                "Dye 4 Cytoplasm Intensity",
                                "Dye 5 Cytoplasm Intensity",
                                "Dye 6 Cytoplasm Intensity")
dye_columns_interest <- c("Dye 1 Positive Nucleus",
                          "Dye 2 Positive Cytoplasm",
                          "Dye 3 Positive Membrane",
                          "Dye 4 Positive Cytoplasm",
                          "Dye 5 Positive Cytoplasm",
                          "Dye 6 Positive Cytoplasm")
formatted_image <- format_halo_to_spe(
  path=raw_halo_data, markers=markers,
  intensity_columns_interest=intensity_columns_interest,
  dye_columns_interest=dye_columns_interest)
class(formatted_image) # The formatted image is a SpatialExperiment object
dim(colData(formatted_image))
dim(assay(formatted_image))
```

# Inspecting the SpaitalExperiment object

## Structure of a SPIAT SpatialExperiment object {#example-data}

In this vignette we will use an inForm data file that's already been
formatted for SPIAT with `format_image_to_spe()`, which we can load with
`data`.

```{r}
data("simulated_image")
```

This is in `SpatialExperiment` format.

```{r}
class(simulated_image)
```

This example data has 5 markers and 4951 cells.

```{r}
dim(simulated_image)
```

`assay()` stores the intensity level of every marker (rows) for every cell
(columns).

```{r}
# take a look at first 5 columns
assay(simulated_image)[, 1:5]
```

`colData()` stores the phenotype and cell properties. Note that the 
`sample_id` column was added by `SpatialExperiment` data structure and can be 
ignored here.

```{r}
# take a look at first 5 rows
colData(simulated_image)[1:5, ]
```

`spatialCoords()` stores cell coordinates. 
```{r}
# take a look at first 5 rows
spatialCoords(simulated_image)[1:5, ]
```

We can check what phenotypes are there.

```{r}
unique(simulated_image$Phenotype)
```

The phenotypes in this example data can be interpreted as follows:

- Tumour_marker = cancer cells
- Immune_marker1,Immune_marker2 = immune cell type 1 
- Immune_marker1,Immune_marker3 = immune cell type 2
- Immune_marker1,Immune_marker2,Immune_marker4 = immune cell type 3 
- OTHER = other cell types

## Nomenclature

In SPIAT We define as **markers** proteins whose levels where queried
by OPAL, CODEX or other platforms.

Examples of markers are "AMACR" for prostate cancer cells, "panCK" for
epithelial tumour cells, "CD3" for T cells or "CD20" for B cells.

The combination of markers results in a specific **cell phenotype**. For
example, a cell positive for both "CD3" and "CD4" markers has the
"CD3,CD4" **cell phenotype**. **The phenotype has to be strictly formatted in such**
**way where each positive marker has to be separated by a comma, with no space in** 
**between, and the order of the positive markers has to be the same as the order **
**in `assay()`.**

Finally, we define a **cell type** as a name assigned by the user to a
cell phenotype. For example, a user can name "CD3,CD4" cells as "helper
T cells". We would refer to "helper T cells" therefore as a **cell
type**.

## Splitting images

In the case of large images, or images where there are two independent
tissue sections, it is recommended to split images into sections defined
by the user. This can be performed with `image_splitter()` after
`format_image_to_spe()`.

```{r}
split_image <- image_splitter(simulated_image, number_of_splits=3, plot = FALSE)
```

## Predicting cell phenotypes

SPIAT can predict cell phenotypes using marker intensity levels with
`predict_phenotypes()`. This can be used to check the phenotypes that have
been assigned by inForm and HALO. It can also potentially be used to
automate the manual phenotyping performed with inForm/HALO. The
underlying algorithm is based on the density distribution of marker
intensities. We have found this algorithm to perform best in OPAL data. Further 
phenotyping methods for other data formats are under development. 

This algorithm does not take into account cell shape or size, so if these are 
required for phenotyping, using HALO or inForm or a machine-learning based 
method is recommended.

`predict_phenotypes()` produces a density plot that shows the cutoff for
calling a cell positive for a marker. If the dataset includes phenotypes
obtained through another software, this function prints to the console
the concordance between SPIAT's prediction and pre-defined phenotypes as
the number of true positives (TP), true negatives (TN), false positives
(FP) and false negatives (FN) phenotype assignments. It returns a table
containing the phenotypes predicted by SPIAT and the actual phenotypes
from inForm/HALO (if available).

```{r, fig.height=6, fig.width = 4}
predicted_image <- predict_phenotypes(spe_object = simulated_image,
                                      thresholds = NULL,
                                      tumour_marker = "Tumour_marker",
                                      baseline_markers = c("Immune_marker1", 
                                                           "Immune_marker2", 
                                                           "Immune_marker3", 
                                                           "Immune_marker4"),
                                      reference_phenotypes = TRUE)
```

We can use `marker_prediction_plot()` to plot the predicted cell
phenotypes and the phenotypes generated from the platforms for comparison.

```{r, fig.width = 9, fig.height=2, out.width="100%"}
marker_prediction_plot(predicted_image, marker="Immune_marker1")
```

The plot shows Immune_marker1+ cells in the tissue. On the left are the
Immune_marker1+ cells defined by the simulated image and on the right are the
Immune_marker1+ cells predicted using SPIAT. Since we know that the simulated 
phenotypes are the truth, we leave the phenotypes as they are.

The next example shows how to replace the original phenotypes with the
predicted ones. Note that for this tutorial, we still use the original
phenotypes.

```{r, eval=FALSE}
predicted_image2 <- predict_phenotypes(spe_object = simulated_image,
                                      thresholds = NULL,
                                      tumour_marker = "Tumour_marker",
                                      baseline_markers = c("Immune_marker1", 
                                                           "Immune_marker2", 
                                                           "Immune_marker3", 
                                                           "Immune_marker4"),
                                      reference_phenotypes = FALSE)
```

```{r, include=FALSE}
predicted_image2 <- predict_phenotypes(spe_object = simulated_image,
                                      thresholds = NULL,
                                      tumour_marker = "Tumour_marker",
                                      baseline_markers = c("Immune_marker1", 
                                                           "Immune_marker2", 
                                                           "Immune_marker3", 
                                                           "Immune_marker4"),
                                      reference_phenotypes = FALSE)
```

## Specifying cell types

SPIAT can define cell types with the `define_celltypes()` function. By
default the new column for cell types is called `Cell.Type`. The cell types can 
be defined based on `Phenotype` column, as well as other columns. 

```{r}
formatted_image <- define_celltypes(
    simulated_image, 
    categories = c("Tumour_marker","Immune_marker1,Immune_marker2", 
                   "Immune_marker1,Immune_marker3", 
                   "Immune_marker1,Immune_marker2,Immune_marker4", "OTHER"), 
    category_colname = "Phenotype", 
    names = c("Tumour", "Immune1", "Immune2", "Immune3", "Others"),
    new_colname = "Cell.Type")
```

# You can access the vignettes for other modules of SPIAT here:

 - [Overview of SPIAT](SPIAT.html)
 - [Quality control and visualisation](quality-control_visualisation.html)
 - [Basic analysis](basic_analysis.html)
 - [Cell colocalisation](cell-colocalisation.html)
 - [Spatial heterogeneity](spatial-heterogeneity.html)
 - [Tissue structure](tissue-structure.html)
 - [Cellular neighborhood](neighborhood.html)
 
# Reproducibility

```{r}
sessionInfo()
```

# Author Contributions

AT, YF, TY, ML, JZ, VO, MD are authors of the package code. MD and YF
wrote the vignette. AT, YF and TY designed the package.