---
title: "Predicting cell cycle phase using peco"
author: "Joyce Hsiao"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{An example of predicting cell cycle phase using peco}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

## Installation 

To install and load the package, run:

```R
install.packages("devtools")
library(devtools)
install_github("jhsiao999/peco")
```

`peco` uses `SingleCellExperiment` class objects.

```{r}
library(peco)
library(SingleCellExperiment)
library(doParallel)
library(foreach)
```

## Overview

`peco` is a supervised approach for PrEdicting cell cycle phase in a
COntinuum using single-cell RNA sequencing data. The R package provides functions to build training dataset and also functions to use existing training data to predict cell cycle on a continuum.

Our work demonstrated that peco is able to predict continuous cell cylce phase using a small set of cylcic genes:  _CDK1_, _UBE2C_, _TOP2A_, _HISTH1E_, and _HISTH1C_ (identified as cell cycle marker genes in studies of yeast ([Spellman et al., 1998][spellman]) and HeLa cells ([Whitfield et al., 2002][whitfield])).

Below we provide two use cases. Vignette 1 shows how to use the built-training dataset to predict continuous cell cycle. Vignette 2 shows how to make a training datast and build a predictor using training data. 

Users can also view the vigenettes via `browseVignettes("peco")`.


## About the training dataset

`training_human` stores built-in training data of 101 significant cyclic genes. Below are the slots contained in `training_human`:

- `predict.yy`: a gene by sample matrix (101 by 888) that stores predict cyclic expression values. 
- `cellcycle_peco_reordered`: cell cycle phase in a unit circle (angle), ordered from 0 to 2$pi$
- `cellcycle_function`: lists of 101 function corresponding to the top 101 cyclic genes identified in our dataset
- `sigma`: standard error associated with cyclic trends of gene expression
- `pve`: proportion of variance explained by the cyclic trend


```{r}
data("training_human")
```

## Predict cell cycle phase using gene expression data

`peco` is integrated with `SingleCellExperiment` object in Bioconductor. Below shows an example of inputting `SingleCellExperiment` object to perform cell cycle phase prediction. 

`sce_top101genes` includes 101 genes and 888 single-cell samples and one assay slot of `counts`.

```{r}
data("sce_top101genes")
assays(sce_top101genes)
```

Transform the expression values to quantile-normalizesd counts-per-million values. `peco` uses the `cpm_quantNormed` slot as input data for predictions.

```{r}
sce_top101genes <- data_transform_quantile(sce_top101genes)
assays(sce_top101genes)
```

Apply the prediction model using function `cycle_npreg_outsample` and generate prediction results contained in a list object `pred_top101genes`.

```{r}
pred_top101genes <- cycle_npreg_outsample(
    Y_test=sce_top101genes,
    sigma_est=training_human$sigma[rownames(sce_top101genes),],
    funs_est=training_human$cellcycle_function[rownames(sce_top101genes)],
    method.trend="trendfilter",
    ncores=1,
    get_trend_estimates=FALSE)
```

The `pred_top101genes$Y` contains a SingleCellExperiment object with the predict cell cycle phase in the `colData` slot.

```{r}
head(colData(pred_top101genes$Y)$cellcycle_peco)
```

Visualize results of prediction for one gene. Below we choose CDK1 ("ENSG00000170312"). Because
CDK1 is a known cell cycle gene, this visualization serves as a sanity
check for the results of fitting. The fitted function 
`training_human$cellcycle_function[[1]]` was obtained from our training data. 

```{r}
plot(y=assay(pred_top101genes$Y,"cpm_quantNormed")["ENSG00000170312",],
     x=colData(pred_top101genes$Y)$theta_shifted, main = "CDK1",
     ylab = "quantile normalized expression")
points(y=training_human$cellcycle_function[["ENSG00000170312"]](seq(0,2*pi, length.out=100)),
       x=seq(0,2*pi, length.out=100), col = "blue", pch =16)
```

## Visualize cyclic expression trend based on predicted phase

Visualize results of prediction for the top 10 genesone genes. Use `fit_cyclical_many` to estimate cyclic function based on the input data.


```{r}
# predicted cell time in the input data
theta_predict = colData(pred_top101genes$Y)$cellcycle_peco
names(theta_predict) = rownames(colData(pred_top101genes$Y))

# expression values of 10 genes in the input data
yy_input = assay(pred_top101genes$Y,"cpm_quantNormed")[1:6,]

# apply trendfilter to estimate cyclic gene expression trend
fit_cyclic <- fit_cyclical_many(Y=yy_input, 
                                theta=theta_predict)

gene_symbols = rowData(pred_top101genes$Y)$hgnc[rownames(yy_input)]

par(mfrow=c(2,3))
for (i in 1:6) {
plot(y=yy_input[i,],
     x=fit_cyclic$cellcycle_peco_ordered, 
     main = gene_symbols[i],
     ylab = "quantile normalized expression")
points(y=fit_cyclic$cellcycle_function[[i]](seq(0,2*pi, length.out=100)),
       x=seq(0,2*pi, length.out=100), col = "blue", pch =16)
}
```

## Session information

```{r}
sessionInfo()
```

[spellman]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC25624
[whitfield]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC117619