---
title: "20.1 -- Working with Large Data"
author: "Martin Morgan <Martin.Morgan@RoswellPark.org>"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{20.1 -- Working with Large Data}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

Author: [Martin Morgan][]</br>
Date: 26 July, 2019</br >

[Martin Morgan]: mailto: Martin.Morgan@RoswellPark.org

# Motivating example

Attach the [TENxBrainData][] experiment data package

```{r, message = FALSE}
library(TENxBrainData)
```

Load a very large `SummarizedExperiment`

```{r}
tenx <- TENxBrainData()
tenx
assay(tenx)
```

Quickly perform basic operations

```{r}
log1p(assay(tenx))
```

Subset and summarize 'library size' for 1000 cells

```{r}
tenx_subset <- tenx[,1:1000]
lib_size <- colSums(assay(tenx_subset))
hist(log10(lib_size))
```

The data is sparse, with more than 92% of cells equal to 0

```{r}
sum(assay(tenx_subset) == 0) / prod(dim(tenx_subset))
```

# Write or use efficient _R_ code

- The most important step!

Avoid unnecessary copying

```{r}
n <- 50000

## makes a copy of `res1` on each iteration
set.seed(123)
system.time({
    res1 <- NULL
    for (i in 1:n)
        res1 <- c(res1, rnorm(1))
})

## pre-allocation
set.seed(123)
system.time({
    res2 <- numeric(n)
    for (i in 1:n)
        res2[i] <- rnorm(1)
})
identical(res1, res2)

## no need to think about allocation!
set.seed(123)
system.time({
    res3 <- sapply(1:n, function(i) rnorm(1))
})
identical(res1, res3)
```

_Vectorize_ your own scripts
  
```{r}
n <- 2000000

## iteration: n calls to `rnorm(1)`
set.seed(123)
system.time({
    res1 <- sapply(1:n, function(i) rnorm(1))
})

## vectorize: 1 call to `rnorm(n)`
set.seed(123)
system.time( res2 <- rnorm(n) )

identical(res1, res2)
```

_Reuse_ other people's efficient code.

- E.g., `limma::lmFit()` to fit 10,000's of linear models very quickly

Examples in the lab this afternoon, and from Tuesday evening

# Use chunk-wise iteration

Example: nucleotide frequency of mapped reads

## An example without chunking

Working through example data; not really large...

```{r, message = FALSE}
library(RNAseqData.HNRNPC.bam.chr14)
fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
basename(fname)
Rsamtools::countBam(fname)
```

Input into a `GAlignments` object, including `seq` (sequence) of each
read.

```{r, message = FALSE}
library(GenomicAlignments)
param <- ScanBamParam(what = "seq")
galn <- readGAlignments(fname, param = param)
```

Write a function to determine GC content of reads

```{r, message = FALSE}
library(Biostrings)
gc_content <- function(galn) {
    seq <- mcols(galn)[["seq"]]
    gc <- letterFrequency(seq, "GC", as.prob=TRUE)
    as.vector(gc)
}
```

Calculate and display GC content

```{r}
param <- ScanBamParam(what = "seq")
galn <- readGAlignments(fname, param = param)
res1 <- gc_content(galn)
hist(res1)
```

## The same example with chunking

Open file for reading, specifying 'yield' size

```{r}
bfl <- BamFile(fname, yieldSize = 100000)
open(bfl)
```

Repeatedly read chunks of data and calculate GC content

```{r}
res2 <- numeric()
repeat {
    message(".")
    galn <- readGAlignments(bfl, param = param)
    if (length(galn) == 0)
        break
    ## inefficient copy of res2, but only a few iterations...
    res2 <- c(res2, gc_content(galn))
}
```

Clean up and compare approaches

```{r}
close(bfl)
identical(res1, res2)
```

# Use (classical) parallel evaluation

Many down-sides

- More complicated code, e.g., to distribute data
- Relies on, and requires mastery of, supporting infrastructure

Maximum speed-up

- Proportional to number of parallel computations
- In reality
  - Cost of data movement from 'manager' to 'worker'
  - Additional overhead of orchestrating parallel computations

## [BiocParallel][]

```{r, echo=FALSE}
xx <- gc(); xx <- gc()
```

```{r, message = FALSE}
fun <- function(i) {
    Sys.sleep(1)    # a time-consuming calculation
    i               # and then the result
}

system.time({
    res1 <- lapply(1:10, fun)
})

library(BiocParallel)
system.time({
    res2 <- bplapply(1:10, fun)
})

identical(res1, res2)
```

- 'Forked' processes (non-Windows)
  - No need to distribute data from main thread to workers
- Independent processes
- Classic clusters, e.g., _slurm_
- Coming: cloud-based solutions

## [GenomicFiles][]

Parallel, chunk-wise iteration through genomic files. Set up:

```{r, message = FALSE}
library(GenomicFiles)
```

Define a `yield` function that provides a chunk of data for processing

```{r}
yield <- function(x) {
    param <- ScanBamParam(what = "seq")
    readGAlignments(x, param = param)
}
```

Define a `map` function that transforms the input data to the desired result

```{r}
map <- function(x) {
    seq <- mcols(x)[["seq"]]
    gc <- letterFrequency(seq, "GC", as.prob = TRUE)
    as.vector(gc)
}
```

Define a `reduce` function that combines two successive results

```{r}
reduce <- c
```

Perform the calculation, chunk-wise and in parallel

```{r}
library(RNAseqData.HNRNPC.bam.chr14)
fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
bfl <- BamFile(fname, yieldSize = 100000)

res <- reduceByYield(bfl, yield, map, reduce, parallel = TRUE)
hist(res)
```

[BiocParallel]: https://bioconductor.org/packages/BiocParallel
[GenomicFiles]: https://bioconductor.org/packages/GenomicFiles

# Query specific values

# Provenance

```{r}
sessionInfo()
```