---
title: "shinyDSP internal data processing pipeline explained"
author: 
- name: Seung J. Kim
  affiliation: 
  - Interstitial Lung Disease Lab, London Health Sciences Center
  email: skim823@uwo.ca
- name: Marco Mura
  affiliation:
  - Interstitial Lung Disease Lab, London Health Sciences Center
  - Division of Respirology, Department of Medicine, Western University
  email: marco.mura@lhsc.on.ca
package: "`r pkg_ver('shinyDSP')`"
date: "`r doc_date()`"
output:
  BiocStyle::html_document:
    toc: true
    number_sections: true
    toc_float: 
      smooth_scroll: true
    toc_depth: 2
    self_contained: yes
vignette: >
  %\VignetteIndexEntry{shinyDSP_secondary}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
bibliography: references.bib
link-citations: true
editor_options: 
  markdown: 
    wrap: 80
---
```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = FALSE,
    message = FALSE,
    warning = FALSE,
    echo = TRUE,
    eval = TRUE,
    cache = TRUE,
    fig.align = 'center', out.width="65%"
    
)
```
## Introduction

The purpose of this vignette is to look under the hood and explain what a few underlying functions do to make this app work.

## Data import 

The human kidney data from [Nanostring](https://nanostring.com/products/geomx-digital-spatial-profiler/spatial-organ-atlas/human-kidney/) is loaded from `ExperimentHub()`. Two files are required: a count matrix and matching annotation file. Let's read those files.


```{r load}
library(standR)
library(SummarizedExperiment)
library(ExperimentHub)
library(readr)
library(dplyr)
library(stats)

eh <- ExperimentHub()
AnnotationHub::query(eh, "standR")
countFilePath <- eh[["EH7364"]]
sampleAnnoFilePath <- eh[["EH7365"]]

countFile <- readr::read_delim(unname(countFilePath), na = character())
sampleAnnoFile <- readr::read_delim(unname(sampleAnnoFilePath), na = character())
```

## Variable(s) selection

A key step in analysis is to select a biological group(s) of interest. Let's inspect `sampleAnnoFile`. 

```{r}
colnames(sampleAnnoFile)
sampleAnnoFile$disease_status %>% unique()
sampleAnnoFile$region %>% unique()
```

"disease_status" and "region" look like interesting variables. For example, we might be interested in comparing "normal_glomerulus" to "DKD_glomerulus". The app can create a new `sampleAnnoFile` where two or more variables of interest can be combined into one grouped variable (under "Variable(s) of interest" in the main `side bar`).

```{r new_sampleAnnoFile}
new_sampleAnnoFile <- sampleAnnoFile %>%
    tidyr::unite(
        "disease_region", # newly created grouped variable
        c("disease_status","region"), # variables to combine
        sep = "_"
    )

new_sampleAnnoFile$disease_region %>% unique()
```

Finally a spatial experiment object can be created.
```{r readGeoMx}
spe <- standR::readGeoMx(as.data.frame(countFile),
                         as.data.frame(new_sampleAnnoFile))
```

## Selecting groups to analyze

We merged "disease_status" and "region" to create a new group called, "disease_region". The app allows users to pick any subset of groups of interest. In this example, the four possible groups are "DKD_glomerulus", "DKD_tubule","normal_tubule", and "normal_glomerulus". The following script can subset `spe` to keep certain groups.

```{r selectedTypes}
selectedTypes <- c("DKD_glomerulus", "DKD_tubule", "normal_tubule", "normal_glomerulus")
toKeep <- colData(spe) %>%
    tibble::as_tibble() %>%
    pull(disease_region)

spe <- spe[, grepl(paste(selectedTypes, collapse = "|"), toKeep)]
```

## Applying QC filters

Regions of interest(ROIs) can be filtered out based on any quantitative variable in `colData(spe)` (same as `colnames(new_sampleAnnoFile)`). These options can be found under the "QC" `nav panel`'s `side bar`.  Let's say I want to keep ROIs whose "SequencingSaturation" is at least 85.

```{r filter}
filter <- lapply("SequencingSaturation", function(column) {
    cutoff_value <- 85
    if(!is.na(cutoff_value)) {
        return(colData(spe)[[column]] > cutoff_value)
    } else {
        return(NULL)
    }
})

filtered_spe <- spe[,unlist(filter)]

colData(spe) %>% dim()
colData(filtered_spe) %>% dim()
```

We can see that 14 ROIs have been filtered out. 

## PCA

Two PCA plots (colour-coded by biological groups or batch) for three normalization schemes are automatically created in the "PCA" `nav panel`. Let's take Q3 normalization as an example. 

```{r speQ3}
speQ3 <- standR::geomxNorm(filtered_spe, method = "upperquartile")
speQ3 <- scater::runPCA(filtered_spe)

speQ3_compute <- SingleCellExperiment::reducedDim(speQ3, "PCA")
```

Then, `.PCAFunction()` is called to generate the plot.
```{r .PCAFunction, echo=FALSE}
.PCAFunction <- function(spe, precomputed, colourShapeBy, selectedVar,
                         ROIshapes, ROIcolours) {
    standR::drawPCA(spe, precomputed = precomputed) +

        ## really need as.name() plus !! because of factor()
        ggplot2::geom_point(ggplot2::aes(
            shape = factor(!!as.name(colourShapeBy), levels = selectedVar),
            fill = factor(!!as.name(colourShapeBy), levels = selectedVar)
        ), size = 3, colour = "black", stroke = 0.5) +

        ggplot2::scale_shape_manual(colourShapeBy,
            values = as.integer(unlist(ROIshapes)) # as.integer() crucial
        ) +
        ggplot2::scale_fill_manual(colourShapeBy,
            ## colour() is a base function, so must be avoided
            values = unlist(ROIcolours)
        ) +
        ggplot2::scale_y_continuous(
            labels = scales::number_format(accuracy = 0.1)
        ) +
        ggplot2::scale_x_continuous(
            labels = scales::number_format(accuracy = 0.1)
        ) +
        ggplot2::theme(
            panel.grid.minor = ggplot2::element_blank(),
            panel.grid.major = ggplot2::element_blank(),
            axis.text = ggplot2::element_text(color = "black", size = 16),
            axis.line = ggplot2::element_blank(),
            axis.ticks = ggplot2::element_line(colour = "black"),
            axis.title = ggplot2::element_text(size = 16),
            legend.title = ggplot2::element_text(
                size = 16, vjust = 0.5, hjust = 0.5,
                face = "bold", family = "sans"
            ),
            legend.text = ggplot2::element_text(
                size = 16, vjust = 0.5, hjust = 0,
                face = "bold", family = "sans"
            ),
            plot.margin = grid::unit(c(1, 1, 1, 1), "mm"),
            plot.background = ggplot2::element_rect(
                fill = "transparent",
                colour = NA
            ),
            plot.title = ggplot2::element_text(
                size = 16, hjust = 0.5, face = "bold",
                family = "sans"
            ),
            panel.border = ggplot2::element_rect(
                colour = "black",
                linewidth = 0.4
            ),
            panel.background = ggplot2::element_rect(
                fill = "transparent",
                colour = NA
            ),
            legend.background = ggplot2::element_rect(
                fill = "transparent",
                colour = NA
            ),
            legend.box.background = ggplot2::element_rect(
                fill = "transparent",
                colour = NA
            ),
            legend.key = ggplot2::element_rect(
                fill = "transparent",
                colour = NA
            ),
            legend.position = "bottom",
            aspect.ratio = 1
        ) +
        ggplot2::guides(
            fill = ggplot2::guide_legend(
                nrow = length(selectedVar),
                title.position = "top"
            ),
            shape = ggplot2::guide_legend(
                nrow = length(selectedVar),
                title.position = "top"
            )
        )
}
```

```{r PCAFunction}
.PCAFunction(speQ3, speQ3_compute, "disease_region", selectedTypes, c(21,22,23,24), c("blue","blue","black","black"))
```

## Design matrix

Let's proceed with Q3 normalization. A design matrix is created next. 

```{r design}
design <- model.matrix(~0+disease_region, data = colData(speQ3))

# Clean up column name
colnames(design) <- gsub("disease_region", "", colnames(design))
```

If confounding variables are chosen in the main `side bar`, those would be added to `model.matrix` as `~0 + disease_region + confounder1 + confounder2)`.

## Creating a DGEList

`edgeR` is used to convert `SpatialExperiment` into `DGEList`, filter and estimate dispersion.

```{r convert}
library(edgeR)
dge <- SE2DGEList(speQ3)
keep <- filterByExpr(dge, design)

dge <- dge[keep, , keep.lib.sizes = FALSE]

dge <- estimateDisp(dge, design = design, robust = TRUE)
```

## Comparison between all biological groups

Recall our "selectedTypes" from above: "DKD_glomerulus", "DKD_tubule", "normal_tubule", "normal_glomerulus".

The following code creates all pairwise comparisons between them.
```{r contast}
# In case there are spaces
selectedTypes_underscore <- gsub(" ", "_", selectedTypes)

comparisons <- list()

comparisons <- lapply(
    seq_len(choose(length(selectedTypes_underscore), 2)),
    function(i) {
        noquote(
            paste0(
                utils::combn(selectedTypes_underscore, 2,
                    simplify = FALSE
                )[[i]][1],
                "-",
                utils::combn(selectedTypes_underscore, 2,
                    simplify = FALSE
                )[[i]][2]
            )
        )
    }
)

con <- makeContrasts(
    # Must use as.character()
    contrasts = as.character(unlist(comparisons)),
    levels = colnames(design)
)

colnames(con) <- sub("-", "_vs_", colnames(con))

con
```

## Fitting a linear regression model with `limma`

The app uses `duplicateCorrelation()` "[s]ince we need to make comparisons both within and between subjects, it is necessary to treat Patient as a random effect" [limma user's guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) [@ritchie_limma_2015]. `limma-voom` method is used as `standR` package recommends [@liu_standr_2023]. 

```{r limma}
library(limma)
block_by <- colData(speQ3)[["SlideName"]]

v <- voom(dge, design)
corfit <- duplicateCorrelation(v, design,
    block = block_by
)

v2 <- voom(dge, design,
    block = block_by,
    correlation = corfit$consensus
)

corfit2 <- duplicateCorrelation(v, design,
    block = block_by
)

fit <- lmFit(v, design,
    block = block_by,
    correlation = corfit2$consensus
)

fit_contrast <- contrasts.fit(fit,
    contrasts = con
)

efit <- eBayes(fit_contrast, robust = TRUE)
```

## Tables of top differentially expressed genes
For each contrast (a column in `con`), the app creates a table of top differentially expressed genes sorted by their adjusted P value.

```{r tables}
# Keep track of how many comparisons there are
numeric_vector <- seq_len(ncol(con))
new_list <- as.list(numeric_vector)

# This adds n+1th index to new_list where n = ncol(con)
# This index contains seq_len(ncol(con)) 
# ex. new_list[[7]] = 1 2 3 4 5 6
# This coefficient allows ANOVA-like comparison in toptable()
if (length(selectedTypes) > 2) {
    new_list[[length(new_list) + 1]] <- numeric_vector
}

topTabDF <- lapply(new_list, function(i) {
    limma::topTable(efit,
        coef = i, number = Inf, p.value = 0.05,
        adjust.method = "BH", lfc = 1
    ) %>%
        tibble::rownames_to_column(var = "Gene")
})

# Adds names to topTabDF
if (length(selectedTypes) > 2) {
    names(topTabDF) <- c(
        colnames(con),
        colnames(con) %>% stringr::str_split(., "_vs_") %>%
            unlist() %>% unique() %>% paste(., collapse = "_vs_")
    )
} else {
    names(topTabDF) <- colnames(con)
}

```

`topTabDF` is now a list of tables where the list index corresponds to that of columns in `con`.

```{r tableExample}
colnames(con)
head(topTabDF[[1]])
```

These tables are then shown to the user in the "Tables" `nav panel`.

## Volcano plot and heatmap

No further *serious* data processing is performed at this point. The numbers in `topTabDF` is now parsed to create volcano plots and heatmaps in their respective `nav panel`s.