---
title: "Handling large H5AD datasets"
subtitle: 'On-disk versus in-memory storage'
author: "Developed by [Gabriel Hoffman](http://gabrielhoffman.github.io/)"
documentclass: article
vignette: >
  %\VignetteIndexEntry{Loading large-scale H5AD datasets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\usepackage[utf8]{inputenc}
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
---

<!---

cd /Users/gabrielhoffman/workspace/repos/dreamlet/vignettes

# rm -rf h5ad_on_disk_cache/ h5ad_on_disk.html

rmarkdown::render("h5ad_on_disk.Rmd")


--->

<style>
body {
text-align: justify}
</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  error = FALSE,
  eval = FALSE,
  tidy = FALSE,
  dev = c("png"),
  cache = TRUE
)
```
# Introduction
As the scale of single cell RNA-seq datasets continues to increase, loading the entire dataset into memory is often impractical.  Dreamlet can take advantage of efficient data storage through the [Bioconductor's](https://bioconductor.org) [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) interface to dramatically reduce memory usage.  There are two general methods for storing large single-cell datasets.  [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/), and therefore `dreamlet`, are compatible with both:

1) __On-disk storage__: The data remains on the physical disk and is only read into memory when requested.  In R there is little data duplication in memory since operations are delayed until the result is requested using `DelayedMatrix`.  Memory usage is minimized at the cost of performance since disk access can be slow.

2) __In-memory storage__:  Single cell read counts are generally sprase with, say, 20% of read counts being non-zero.  The counts can be stored as a `sparseMatrix` that stores only non-zero entries.  Performance is higher since data is in memory, but excessive  memory usage can be a problem.












# On-disk storage: zellkonverter
 [zellkonverter](https://github.com/theislab/zellkonverter) takes advantage of the [H5AD](https://anndata.readthedocs.io/en/latest/fileformat-prose.html) file format built on the [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) format in order to dramatically reduce memory usage while still retaining performance.  The [zellkonverter](https://github.com/theislab/zellkonverter) package uses a [DelayedArray](https://bioconductor.org/packages/DelayedArray/) backend to provide a seamless interface to an on-disk H5AD dataset through the interface of the [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) class.  

`zellkonverter` is an interface for python code.  Note that the python dependencies are installed the first time `readH5AD()` is run, rather than when the package is installed.  So the first run of `readH5AD()` can take a few minutes, but subsequent runs will be fast.   


## Standard usage 
```{r example, eval=FALSE}
library(zellkonverter)
library(SingleCellExperiment)

# Create SingleCellExperiment object that points to on-disk H5AD file
sce <- readH5AD(h5ad_file, use_hdf5 = TRUE)
```

## Merge multiple datasets
```{r example2, eval=FALSE}
# Read a series of H5AD files into a list
# then combine them into a single merged SingleCellExperiment
sce.lst <- lapply(h5ad_files, function(file) {
  readH5AD(file, use_hdf5 = TRUE)
})
sce <- do.call(cbind, sce.list)
```

## Access alternative data
Some software, including [pegasus](https://pegasus.readthedocs.io/en/stable/), store normalized data in the default portion of the file and save raw counts in the `raw/` field.  Here, load a H5AD file generated by pegasus. 
```{r example3, eval=FALSE}
# read raw/ from H5AD file
# raw = TRUE tells readH5AD() to read alternative data
# Must use zellkonverter >=1.3.3
sce_in <- readH5AD(h5ad_file, use_hdf5 = TRUE, raw = TRUE)

# use `raw` as counts
sce <- swapAltExp(sce_in, "raw") # use raw as main
counts(sce) <- assay(sce, "X") # set counts assay to data in X
assay(sce, "X") <- NULL # free X
```
<!---
# extract raw counts
# include Dimnames and colData
sce = altExp( sce_in, "raw", withDimnames=TRUE, withColData=TRUE)

# convert these alternative data to SingleCellExperiment
sce = as(sce, "SingleCellExperiment")

# Copy over reducedDims too
reducedDims(sce) <- reducedDims(sce_in)
--->


# In-memory storage: Seurat
[Seurat](https://satijalab.org/seurat) can also convert and import H5AD files, and then convert to [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/).  The resulting `SingleCellExperiment` object stores data as a `sparseMatrix`.

```{r Seurat, eval=FALSE}
library(SingleCellExperiment)
library(SeuratDisk)

# Convert h5ad file to h5seurat format
Convert(h5ad_file, dest = "h5seurat")

# load Seurat file
obj <- LoadH5Seurat(h5seurat_file)

# Convert Seurat object to SingleCellExperiment
sce <- as.SingleCellExperiment(obj)
```

# Comparing interfaces
The `SingleCellExperiment` interface to `zellkonverter` and `Seurat` hides the backend differences from the typical R user.  The usage of `dreamlet` is the same in both cases. For small to medium datasets, the performance differences should be minimal.  However, for large datasets there can be a substantial difference in performance.  Use of `zellkonverter` and `DelayedMatrix` minimize memory usage at the cost of computationl performance.  Use of `Seurat` and `sparseMatrix` can be faster if there is sufficent memory.  

`aggregateToPseudoBulk()` for computing pseudobulk data uses different backend implements for on-disk `DelayedMatrix` and in-memory `sparseMatrix`.  Both versions are highly optimized compared to a naive implementation.


<!---
Note that a [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) created by [zellkonverter](https://github.com/theislab/zellkonverter) stores the counts as a `DelayedMatrix` so the data remains on disk and are only accessed as needed.  This reduces memory usage substantailly, and `aggregateToPseudoBulk()` uses methods optimized for this data structure.

On the other hand, a [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) created using [Seurat](https://satijalab.org/seurat) and then `as.SingleCellExperiment()` stores count data in memory as a `sparseMatrix`.  This uses less memory than a dense matrix, but substantailly more than a `DelayedMatrix`.  While `aggregateToPseudoBulk()` has been developed to speed up analysis of `sparseMatrix` data, it can be considerablely slower then using the `DelayedMatrix` above.  
--->








# Session Info
<details>
```{r session, echo=FALSE}
sessionInfo()
```
</details>






<!---
on-disk data access to dramatically reduce memory usage.



Storing 18K genes across 2M cells with double precision would require 288 Gb memory.  Even if the dataset is sparse so that 80\% of the entires are zero, loading the entire dataset requires 58 Gb of data.  Since `R` uses copy-by-value, data duplication quickly becomes a problem even for a high memory machine.   

Instead of loading the entire dataset into memory, we adopt an on-disk approach that takes advantage of the [H5AD](https://anndata.readthedocs.io/en/latest/fileformat-prose.html) file format built on the [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) format in order to dramatically reduce memory usage while still retaining high performance.  The [zellkonverter](https://github.com/theislab/zellkonverter) package uses a [DelayedArray](https://bioconductor.org/packages/DelayedArray/) backend to provide a seamless interface to an on-disk H5AD dataset through the interface of the [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) class.  SingleCellExperiment is [Bioconductor's](https://bioconductor.org) core data class for single cell data, and data from other formats (i.e. [Seurat](https://cran.r-project.org/web/packages/Seurat/index.html)) can easily be [converted](https://rdrr.io/cran/Seurat/man/as.SingleCellExperiment.html) to it.  Importantly,  `R`'s copy-by-value is no longer an issue because the data is not duplicated when it is passed to another function, since it remains on-disk.      

This enables any analysis designed for the [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) class to take advantage of seamless on-disk access to large-scale datasets.

--->