---
title: "xenLite: exploration of a class for Xenium demonstration data"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{xenLite: exploration of a class for Xenium demonstration data}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

```{css style settings, echo = FALSE}
blockquote {
    padding: 10px 20px;
    margin: 0 0 20px;
    font-size: 14px;
    border-left: 5px solid #eee;
}
```

# Introduction

## Motivation

Numerous groups are working on methodology for analyzing
new data resources emerging from spatial transcriptomics experiments.
The "computing technology stack" necessary to work
with such data is complex.  The following
[comment](https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/models1.html)
(Aug 16 2024) from the scverse group is characteristic:

> The SpatialData Zarr format, which is described in our design doc, is an extension of the OME-NGFF specification, which makes use of the OME-Zarr, the AnnData Zarr and the Parquet file formats. We need to use these combination of technologies because currently OME-NGFF does not provide all the fundamentals required for storing spatial omics dataset; nevertheless, we try to stay as close as OME-NGFF as possible, and we are contributing to ultimately make spatial omics support available in pure OME-NGFF.

In Bioconductor 3.19, we might be committed to all these technologies (which 
include but do not mention HDF5) plus
R, reticulate, basilisk, and sf to have a widely functional solution.
Attaching the [Voyager](https://bioconductor.org/packages/Voyager) 
package leads to loading over 100 associated packages.
The xenLite package is intended to explore the balance between functionality
and high dependency load specifically for the analysis of outputs
from 10x Xenium experiments.  The XenSPEP class extends
[SpatialExperiment](https://bioconductor.org/packages/SpatialExperiment) to 
include references to parquet
files that manage voluminous geometry data; geometry can also be handled as DataFrame for sufficiently
low volume experiments.

As of 0.0.7 of xenLite, a new class, XenSPEP, is provided to address
parquet representation of cell, nucleus and transcript coordinates.

## Installation

To install in Bioconductor 3.20 or later:
```{r doinst,eval=FALSE}
BiocManager::install("xenLite")
```


## Data

This package is based on publicly available datasets.


- **Human dermal melanoma** [FFPE Human pan tissue dataset](https://www.10xgenomics.com/datasets/xenium-prime-ffpe-human-skin).  For this dataset, we
retrieved the `outs.zip` file from this site and ran `XenSCE::ingestXen`
on the contents, producing a XenSPEP instance which was then saved
in an HDF5-backed representation.  This, along with the parquet
files for cell, nucleus, and transcript coordinates, are zipped together
and placed in an Open Storage Network bucket for retrieval via
`cacheXenPdmelLite`.

- **Human prostate adenocarcinoma** [FFPE Human pan tissue dataset](https://www.10xgenomics.com/datasets/xenium-prime-ffpe-human-prostate).  For this dataset, we
retrieved the `outs.zip` file from this site and ran `XenSCE::ingestXen`
on the contents, producing a XenSPEP instance which was then saved
in an HDF5-backed representation.  This, along with the parquet
files for cell, nucleus, and transcript coordinates, are zipped together
and placed in an Open Storage Network bucket for retrieval via
`cacheXenProstLite`.

- **Human Lung** [FFPE Lung cancer preview](https://www.10xgenomics.com/datasets/preview-data-ffpe-human-lung-cancer-with-xenium-multimodal-cell-segmentation-1-standard), 
Use `example(cacheXenLuad)` to obtain this instance of XenSPEP.

## A quick look

### Hybrid example

In this example, transcript counts are in memory
in a sparse matrix, but geometry information is
handled in parquet.  The `viewSegG2` function
allows selection of two gene symbols, and
cells are colored according to single or double
occupancy.

```{r dolung, fig.width=7, fig.height=7,message=FALSE}
library(xenLite)
pa <- cacheXenLuad()
luad <- restoreZipXenSPEP(pa)
rownames(luad) <- make.names(SummarizedExperiment:::rowData(luad)$Symbol, unique = TRUE)
out <- viewSegG2(luad, c(5800, 6300), c(1300, 1800), lwd = .5, gene1 = "CD4", gene2 = "EPCAM")
legend(5800, 1390, fill = c("purple", "cyan", "pink"), legend = c("CD4", "EPCAM", "both"))
out$ncells
```

### Interactivity

In `inst/app4`, code is provided to work with the primary dermal melanoma
dataset.  A map of the cell coordinates can drive focused exploration.  The
region of interest is shown by a whitened rectangle in the upper left corner.

```{r lkmap,echo=FALSE,fig.cap="Map based on cell centroids."}
knitr::include_graphics("fblmap.png")
```

Cells are colored by quintile of size.  Points where FBL transcripts
are found are plotted as dots.

```{r lkzm,echo=FALSE,fig.cap="Zoom to slider-specified subplot."}
knitr::include_graphics("FBLsketch.png")
```

More work is needed to identify useful exploratory visualizations.

### Large disk-based example

#### HDF5+parquet-backed example: *not ready in xenLite*

We want to be able to accommodate very large numbers of
cells and transcripts without heavy infrastructure
commitments.  

We'll use the prostate adenocarcinoma 5K dataset to demonstrate.
A 900MB zip file will be cached.
```{r doprost1,eval=FALSE}
prost <- cacheXenProstLite()
```
`prost` is the path to a zip file in a BiocFileCache instance.

Create a folder to work in, and unzip.
```{r doprost2,eval=FALSE}
dir.create(xpw <- file.path(tempdir(), "prost_work"))
unzip(prost, exdir = xpw)
dir(xpw)
```

Restore the SpatialExperiment component.
```{r doprost3,eval=FALSE}
prostx <- HDF5Array::loadHDF5SummarizedExperiment(file.path(xpw, "xen_prost"))
prostx
```

# Session information

```{r lksi}
sessionInfo()
```