---
title: "Tidyverse Patterns"
author: "Shian Su"
date: "`r Sys.Date()`"
output: BiocStyle::pdf_document
vignette: >
  %\VignetteIndexEntry{Tidyverse Patterns}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(purrr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(CellBench)
```

# Introduction

This vignette will introduce tidyverse patterns that are useful for making full
use of the CellBench framework. CellBench was developed with tidyverse
compatibility as a fundamental goal. `purrr` provides functional programming
tools to manipulate the methods and lists produced in this framework. `dplyr` is
very useful for working with the `tibble`-based structures produced by 
CellBench. Since the outputs are mostly in `tibble` structure, they are very
easily visualised using `ggplot2`.

For detailed explanations of tidyverse packages please see resources at
https://www.tidyverse.org/learn/, in particular 
[R for Data Science](https://r4ds.had.co.nz).

For quick concise references of tidyverse features and functions I recommend
all of the cheatsheets available at 
https://www.rstudio.com/resources/cheatsheets/.

\newpage
# Functional Programming with purrr

## Methods as Function Objects

In CellBench we require methods to take in only a single argument. The idea is
that all methods within a single pipeline step should take the same kind of
input and produce the same type of output. In practice most methods have
additional parameters that can be tuned, and we may use `purrr::partial()` to
help pre-fill these parameters.

`partial()` takes a function, some variable values, and returns the function
with the specified arguments pre-filled. This reduces the number of free
parameters in your function.

We demonstrate a trivial application of `partial()`:

```{r}
library(CellBench)
library(purrr)

# function to raise number to a power
pow <- function(x, n) {
    x^n
}

pow2 <- partial(pow, n = 2)
pow3 <- partial(pow, n = 3)

pow2(2)
pow3(2)
```

Here `partial()` allowed us to turn a two-parameter function into a single
parameter function. Using `partial()` is good practice compared to the two
alternatives:

1. Writing duplicate functions
2. Writing function wrappers

Consider writing duplicate functions

```{r, eval = FALSE}
pow2 <- function(x) {
    x^2
}

pow3 <- function(x) {
    x^3
}
```

Now say you wanted to add an argument check `stopifnot(is.numeric(x))`, you
would need to edit the code in two places. The chances for errors leading to
inconsistencies increases dramatically with the number of duplications and
changes required.

Consider writing function wrappers

```{r, eval = FALSE}
pow <- function(x, n) {
    x^n
}

pow2 <- function(x) {
    pow(x, 2)
}

pow3 <- function(x) {
    pow(x, 3)
}
```

The issue here is that these functions can contain much more in their bodies
than the simple function call. When working in collaboration or sharing the code
in general, it's not immediately clear that the intention of the derived
functions is only to pre-fill certain variables. Using `partial()` is concise
and unambiguous in purpose.

See also:

* `?partial` in the example section for more ways to use partial
* `?fn_arg_seq` for the CellBench utility to construct a list of functions with
varying parameter values

## Function Composition

When functions have just one argument, they can easily be composed to create
new functions. We use `purrr::compose()` for this, and it takes a series of
functions as input. `compose()` will then return a function that applies
the functions given to it in a right-to-left fashion, such that the right-most
function is applied first and the left-most is applied last in succession.

```{r}
# find the maximum absolute value
max_absolute <- compose(max, abs)

max_absolute(rnorm(100))
```

This is useful for stitching together steps of a pipeline manually, for example
if some methods require normalisation but some do not, then you may write
wrappers as follows

```{r, eval = FALSE}
method1 <- function(x) {
    x <- normalise(x)
    method_func1(x)
}

method2 <- function(x) {
    method_func2(x)
}

method3 <- function(x) {
    x <- normalise(x)
    method_func3(x)
}
```

alternatively you could write

```{r, eval = FALSE}
# identity simply returns its argument, useful here for code consistency
method1 <- compose(method_func1, normalise)
method2 <- compose(method_func2, identity)
method3 <- compose(method_func3, normalise)
```

which is more succinct and likely to be less error-prone. This is useful when
two or more steps in a pipeline would only work in specific combinations, then
these combinations can be fused together to form a single step in the desired
combinations.

\newpage
## Mapping Over Lists

The majority of `purrr`'s functionality revolves around mapping functions over
a list of inputs. This is a represented by the `map()` family of functions that
usually take a list and a function as arguments, returning the result of
applying the function to each element of the list.

`map()` is the primary function which performs basic mapping of lists and 
returns list. It functions almost identically to `lapply`, but is accompanied by
a family of suffixed variants that are useful for various situations.

```{r}
x <- list(1, 2, 3)

map(x, function(x) { x * 2 })
```

One useful variant is `map2()` which takes two lists and applies a two-argument
function to the first elements of both lists, second elements and so on. This
can be used to pass additional variables into the function.

```{r}
# list of random values from different distributions
x <- list(
    rpois(100, lambda = 5),
    rpois(100, lambda = 5),
    rgamma(100, shape = 5),
    rgamma(100, shape = 5)
)

# list of additional parameters
y <- list(
    "mean",
    "median",
    "mean",
    "median"
)

# function that takes values and a mode argument
centrality <- function(x, mode = c("mean", "median")) {
    mode <- match.arg(mode)
    
    if (mode == "mean") {
       mean = mean(x)
    } else if (mode == "median") {
       median = median(x)
    }
}

# using map2 to apply function to two lists
map2(x, y, centrality)
```

\newpage
# Table Manipulation with dplyr

## Operations on the Benchmark tibble

The fundamental `benchmark_tbl()` is derived from the `tibble` object which
acts mostly indentical to a regular `data.frame`. Therefore it is compatible
with the `dplyr` set of table manipulation functions.

```{r}
library(dplyr)
# list of data
datasets <- list(
    set1 = rnorm(500, mean = 2, sd = 1),
    set2 = rnorm(500, mean = 1, sd = 2)
)

# list of functions
add_noise <- list(
    none = identity,
    add_bias = function(x) { x + 1 }
)

res <- apply_methods(datasets, add_noise)
class(res)
res
```

From our results we can filter the rows or manipulat the columns with regular
`dplyr` operations.

```{r}
# filtering rows to only data from set 1
res %>%
    filter(data == "set1")

# filtering rows to only add_bias method
res %>%
    filter(add_noise == "add_bias")

# mutating data column to prepend "data" to data set names
res %>%
    mutate(data = paste0("data", data))
```

## Calculating multiple columns of metrics

We often want to plot two or more metrics against each other, for this purpose it is most useful to have each metric in its own column. The default CellBench model does not appear to support this, but it can be done quite easily using `spread()` from the `tidyr` package.

```{r}
metric <- list(
    mean = mean,
    median = median
)

# simply applying the metrics results in a single column
res %>%
    apply_methods(metric)

# spread metrics across columns
res %>%
    apply_methods(metric) %>%
    spread(metric, result)
```

\newpage
# Plotting with ggplot2

## Basic Plotting

Tibble results are easy to use with ggplot2. For a more extensive introduction to ggplot2 see [R for Data Science: Chapter 3](https://r4ds.had.co.nz/data-visualisation.html). Here we will plot results of our pipelines in a single plot, because the result column is a list-column, it needs to be unnested to produce a "flat" table, see `?tidyr::unnest` for more explanation on unnesting. For convenience, we also use `pipeline_collapse()` from CellBench to concatenate the method names at each stop to produce a single character string representing a pipeline.

```{r}
library(tidyr)
library(ggplot2)

# I prefer my own theme for ggplot2, following theme code is optional
theme_set(theme_bw() + theme(
    plot.title = element_text(face = "plain", size = rel(20/12),
                              hjust = 1/2, margin = margin(t = 10, b = 20)),
    axis.text = element_text(size = rel(14/12)),
    strip.text.x = element_text(size = rel(16/12)),
    axis.title = element_text(size = rel(16/12))
))

scale_colour_discrete <- function(...) scale_colour_brewer(..., palette="Set1")
scale_fill_discrete <- function(...) scale_fill_brewer(... , palette="Set1")
```

```{r}
# pipeline collapse constructs a single string from the pipeline steps,
# unnest expands the list-column of results, transforming the result 
# into a flat table.
collapsed_res <- pipeline_collapse(res) %>%
    unnest()

ggplot(collapsed_res, aes(x = pipeline, y = result)) +
    geom_boxplot()
```


## Facetting

We can "facet" the above plot by sectioning off related graphics. This general idea covered in the [facet section](https://r4ds.had.co.nz/data-visualisation.html#facets) of the previously linked text. This also demonstrates the benefits of having the data in a tibble format.

```{r}
# remember that we have to unnest the data before it's appropriate
# for plotting
ggplot(unnest(res), aes(x = add_noise, y = result)) +
    geom_boxplot() +
    facet_grid(~data)
```