---
title: "SeSAMe User Guide"
shorttitle: "sesame guide"
author: "Wanding Zhou, Timothy J Triche Jr, Hui Shen"
package: sesame
output: rmarkdown::html_vignette
fig_width: 8
fig_height: 6
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{SeSAMe User Guide}
  %\VignetteEncoding{UTF-8}
---

# Installation

From Bioconductor
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("sesame")
```

Development version can be installed from github
```{r, eval=FALSE}
library(devtools)
install_github('zwdzwd/sesameData')
install_github('zwdzwd/sesame')
```

# Data Structure for Signal Intensities

SeSAMe is designed to process Illumina Infinium DNA methylation data. It
currently supports EPIC, HM450 and HM27 platforms. The design includes a
light-weight full exposure of internal signal intensities (essential 
information for users of Illumina methylation array data, as demonstrated
in Zhou et al 2018), which permits sensitive and specific joint inference
on copy number and DNA methylation.

Central to the SeSAMe platform is the `SigSet` data structure,
an S4 class with slots containing signals for six different classes of probes:

1) `II` - Type-II probes;
2) `IR` - Type-I Red channel probes;
3) `IG` - Type-I Grn channel probes;
4) `oobG` - Out-of-band Grn channel probes (matching Type-I Red channel probes
in number);
5) `oobR` - Out-of-band Red channel probes (matching Type-I Grn channel probes
in number);
6) `ctl` - control probes.

For all save control probes, signal intensities are stored as an `Nx2` numeric
matrix, with `N` representing the number of probes in the class. The two
columns of the matrix represent the methylated probe intensity and the 
unmethylated probe intensity. (Previously, this was implemented in an R6
Reference class, `SignalSet`. The current S4 implementation in `SigSet`
complies with Bioconductor guidelines, and for backwards compatibility, the
`signalR6toS4` function transforms a `SignalSet` to a `SigSet`.

```{r}
library(sesameData)
library(sesame)
sset <- sesameDataGet('EPIC.1.LNCaP')$sset
```

For example, printing the SigSet directly shows its content
```{r}
sset
```

Type-II probe signal can be browsed in
```{r}
head(slot(sset, 'II'))
```

or via the getter function
```{r}
head(II(sset))
```

Similarly, signals for Type-I probes (`sset@IR` and `sset@IG`) and out-of-band
probes (`sset@oobG` and `sset@oobR`) can be found in
```{r}
head(IR(sset))
head(oobG(sset))
```
as one can see the probe names (row names) of `IR` always coincide with the
probe names (row names) of `oobG` (and vice versa). This is because the
out-of-band probe signal for red channel probes is in green channel
(and vice versa). Lastly, Control probes are represented in a data frame with
the last column holding the type of the control.
```{r}
head(ctl(sset))
```

# The openSesame pipeline

The openSesame pipeline is composed of noob, nonlinear dye bias correction
and pOOBAH, achieved through:
```{r}
IDATprefixes <- searchIDATprefixes(
    system.file("extdata/", package = "sesameData"))
betas <- openSesame(IDATprefixes)
```
or equivalently
```{r}
betas <- getBetas(dyeBiasCorrTypeINorm(noob(sset)))
```
behind the scene.

# Functionalities

SeSAMe implements stricter QC and preprocessing standards: comprehensive
probe quality masking, bleed-through correction in background subtraction,
nonlinear dye bias correction, stricter nondetection calling and control for
bisulfite conversion based on C/T-extension probes. The package also provides
convenient, performant implementations of typical analysis steps, such as the
inference of gender, age, ethnicity (based on both internal SNP probes and
channel-switching Type-I probes) directly from the data. This allows users to
infer these common covariates if such information is not provided, and to 
check for potential sample swaps when it is provided.  SeSAMe also provides
functionality for calling differential methylation and segmented copy number.

### Read IDATs into SigSet list
```{r}
ssets <- lapply(
    searchIDATprefixes(system.file("extdata/", package = "sesameData")),
    readIDATpair)
```

A simple list of "SigSet"s are returned. One can also just provide a vector
of file paths prefixes (excluding `_Red.idat` and `_Grn.idat`, one prefix for
a pair of IDATs) and call `readIDATpair` directly.

### Background subtraction

Like many other Infinium Methylation-targeted software, SeSAMe implements the
background subtraction based on normal-exponential deconvolution using
out-of-band probes `noob`
([Triche et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3627582/))
and optionally with extra bleed-through subtraction. Signal bleed-through
happens when measurement from one channel affects the measurement in the other
channel. SeSAMe's `noobsb` further removes residual background by regressing
out the green-to-red and red-to-green relationship using Type-I probes.
```R
sset.nb <- noob(sset)
sset.nb <- noobsb(sset)
```

### Dye bias correction

Dye bias refers to the difference in signal intensity between the two color
channel. SeSAMe offers two flavors of dye bias correction: linear scaling
(`dyeBiasCorr`) and nonlinear scaling (`dyeBiasCorrTypeINorm`). Linear scaling
equalize the mean of all probes from the two color channel.
```{r}
library(sesame)
sset.dbLinear <- dyeBiasCorr(sset)
qqplot(
    slot(sset.dbLinear, 'IR'), slot(sset.dbLinear, 'IG'),
    xlab='Type-I Red Signal', ylab='Type-I Grn Signal',
    main='Linear Correction', cex=0.5)
abline(0,1,lty='dashed')
```

Residual dye bias can be corrected using nonlinear quantile interpolation with
Type-I probes.
```{r}
sset.dbNonlinear <- dyeBiasCorrTypeINorm(sset)
```

Under this correction, Type-I Red probes and Type-I Grn probes have the same
distribution of signal.
```{r}
qqplot(
    slot(sset.dbNonlinear, 'IR'), slot(sset.dbNonlinear, 'IG'),
    xlab='Type-I Red Signal', ylab='Type-I Grn Signal',
    main='Nonlinear Correction', cex=0.5)
abline(0,1,lty='dashed')
```

Note that linear scaling does not shift beta values of Type-I probes while
nonlinear scaling does shift beta values of Type-I probes.

### Get betas

Beta values are defined as `methylated signal`/(`methylated signal`
+ `unmethylated signal`). It can be computed using `getBetas` function.
The output is a named vector with probe ID as name.  There are two options
for `getBetas` that affects probe masking. The first is
`quality.mask=TRUE/FALSE` which switches probe quality masking. The quality
masking includes mapping issues, SNPs and non-uniqueness, and is described
in [Zhou et al 2017](https://academic.oup.com/nar/article/45/4/e22/2290930).
`nondetection.mask = TRUE/FALSE` is used to switch masking of nondetection
based on detection P-value. Both masks are recommended to ensure data quality
and defaulted to TRUE.
```{r}
betas <- getBetas(sset)
head(betas)
```

Beta values for Type-I probes can also be obtained by summing up the two
in-band channel and out-of-band channel. This rescues probes with SNP hitting
the extension base and hence switching color channel. More details can be
found in
[Zhou et al 2017](https://academic.oup.com/nar/article/45/4/e22/2290930).
```{r}
betas <- getBetasTypeIbySumChannels(sset)
```

For such probes, extra SNP allele frequencies can be derived by summing up
methylated and umethylated alleles.
```{r}
extraSNPAFs <- getAFTypeIbySumAlleles(sset)
```

### Sample/experiment QC

SeSAMe implements inference of sex, age, ethnicity. These are valuable
information for checking the integrity of the experiment and detecting sample
swaps.

#### Sex

Sex is inferred based on our curated X-linked probes and Y chromosome probes
excluding pseudo-autosomal regions.
```{r}
inferSex(sset)
inferSexKaryotypes(sset)
```

#### Ethnicity

Ethnicity is inferred using a random forest model trained based on both the
built-in SNPs (`rs` probes) and channel-switching Type-I probes.
```{r}
inferEthnicity(sset)
```

#### Age

SeSAMe provides age regression a la the Horvath 353 model.
```{r}
betas <- sesameDataGet('HM450.1.TCGA.PAAD')$betas
predictAgeHorvath353(betas)
```

#### Mean intensity

The mean intensity of all the probes characterize the quantity of input DNA
and efficiency of probe hybridization.
```{r}
meanIntensity(sset)
```

#### Bisulfite conversion control using GCT scores

Infinium platforms are intrinsically robust to incomplete bisulfite conversion
as non-converted probes would fail to hybridize to the target. Residual
incomplete bisulfite conversion can be quantified using GCT score based on
C/T-extension probes. Details of this method can be found in
[Zhou et al. 2017](https://academic.oup.com/nar/article/45/4/e22/2290930).
The closer the score to 1.0, the more complete the bisulfite conversion.
```{r}
bisConversionControl(sset)
```

### Probe retrieval and $\beta$-value visualization

To visualize all probes from a gene
```{r, message=FALSE, fig.width=6, fig.height=5}
betas <- sesameDataGet('HM450.10.TCGA.PAAD.normal')
visualizeGene('DNMT1', betas, platform='HM450')
```

To visualize probes from arbitrary region
```{r, message=FALSE, fig.width=6, fig.height=5}
visualizeRegion(
    'chr19',10260000,10380000, betas, platform='HM450',
    show.probeNames = FALSE)
```

To visualize by probe names
```{r, message=FALSE, fig.width=6}
visualizeProbes(c("cg02382400", "cg03738669"), betas, platform='HM450')
```

### CNV

SeSAMe performs copy number variation in three steps: 1) normalizes the signal
intensity using a copy-number-normal data set; 2) groups adjacent probes into
bins; 3) runs DNAcopy internally to group bins into segments.
```{r, message=FALSE, fig.width=6}
ssets.normal <- sesameDataGet('EPIC.5.normal')
segs <- cnSegmentation(sset, ssets.normal)
```

To visualize segmentation in SeSAMe,
```{r, message=FALSE, fig.width=6}
visualizeSegments(segs)
```

### Cell Composition Deconvolution

SeSAMe estimates leukocyte fraction using a two-component model.This function
works for samples whose targeted cell-of-origin is not related to white blood
cells.
```{r, message=FALSE}
betas.tissue <- sesameDataGet('HM450.1.TCGA.PAAD')$betas
estimateLeukocyte(betas.tissue)
```