---
title: "Working with Non-human Array"
date: "`r BiocStyle::doc_date()`"
package: sesame
output: BiocStyle::html_document
fig_width: 6
fig_height: 5
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{2. Non-human Array}
  %\VignetteEncoding{UTF-8}
---

```{r message=FALSE, warning=FALSE, include=FALSE}
library(wheatmap)
library(dplyr)
options(rmarkdown.html_vignette.check_title = FALSE)
```

SeSAMe provides extensive native support for the Illumina Mouse Methylation
BeadChip array (referred to as the MM285 or MMB array) and the Horvath Mammal40
array (refered to as the Mammal40 array). The MM285 array contains ~285,000
probes covering over 20 design categories including gene promoters, enhancers,
CpGs in synteny to human EPIC array as well as other biology. In SeSAMe, MM285
is used as the product abbreviation to distinguish future platforms including
MM320. This documents describe the procedure to process the MM285 and the
Mammal40 array.

We first load required library and perform sesame data caching (only needed
at new SeSAMe installation).

```{r nh1, message=FALSE, warning=FALSE}
library(sesame)
sesameDataCache()
```

# Species Inference

SeSAMe supports automatic inference of the profiled organism. This is achieved
by the `inferSpecies` function. Usually, users need not call this function
explicitly and only need to specify the code `S` as part of the 2nd argument of
the `openSesame` function. See [the Basics Usage vignette](sesame.html#prep)
for more detail.

The following example downloads an example Mammal40 array IDAT pair and call
`openSesame` function with species inference (note the `S` in the `prep=`
argument).

Download test IDATs 
`GSM4411982_Grn.idat.gz` and `GSM4411982_Red.idat.gz`
from
https://github.com/zhou-lab/InfiniumAnnotationV1/tree/main/Test

```{r nh3, message=FALSE, eval=FALSE}
betas = openSesame(sprintf("~/Downloads/GSM4411982", tmp), prep="SHCDPM")
```

The above code is equivalent to

```{r nh4, eval=FALSE}
## equivalent to the above openSesame call
betas = getBetas(matchDesign(pOOBAH(dyeBiasNL(inferInfiniumIChannel(
    prefixMaskButC(inferSpecies(readIDATpair(
        "~/Downloads/GSM4411982"))))))))
```

As can be seen, `inferSpecies` takes a SigDF as input and outputs an updated
SigDF which contains a species-specific masking and color channel
designation. This information is key to proper preprocessing since knowledge of
the color channel designation and probe hybridization success is the
foundational assumption of many preprocessing algorithms. One may instruct the
function to return the inferred species information by using the
`return.species = TRUE` argument. The following example shows this usage:

```{r nh13, message=FALSE, eval=TRUE}
sdf = sesameDataGet("Mammal40.1.SigDF") # an example SigDF
inferSpecies(sdf, return.species = TRUE)
```

Internally, we used a nonparametric scoring method to infer the most likely
species from a pool of 310 candidate species from Ensembl (v101). The
`return.auc = TRUE` argument allows one to peek into the AUC (Area Under the
Curve) score generated in this inference. The higher the score, the more likely
the data is from the corresponding species. Knowing the scores can help
diagnose misclassifications such as when several candidate species are closely
related and hard to discriminate from data.

```{r nh14, message=FALSE}
## showing the candidate species with top scores
head(sort(inferSpecies(sdf, return.auc = TRUE), decreasing=TRUE))
```

If the user believes that automatic inference gives wrong (most often still
close-related) species, one can force species inference to a target species by
using the `updateSigDF` function. For example, the following code forces the
`SigDF` to be treated as a `mus_musculus` sample. Note this doesn't alter
signal intensity but only change the probe masking and color channel spec (the
view of the data, not the data reading itself).

```{r nh15, eval=FALSE}
sdf_mouse <- updateSigDF(sdf, species="mus_musculus")
```

**CRITICAL:** Since `updateSigDF` function resets the whole mask and col column
of SigDF. One should use this function (and `inferSpecies`) before other
preprocessing functions that sets mask and col.


# Mouse Strain Inference

Like species inference, strain-specific masking and preprocessing is important
for mouse array samples. This is achieved by the `inferStrain` function. The
function is represented by the `T` code in `openSesame`/`prepSesame`. The
following example shows how to use `inferStrain` in `openSesame`. Note the use
of `T` in the prep code.

Download test IDATs 
`204637490002_R05C01_Grn.idat` and `204637490002_R05C01_Red.idat`
from
https://github.com/zhou-lab/InfiniumAnnotationV1/tree/main/Test

```{r nh2, message=FALSE, eval=FALSE}
betas = openSesame("~/Downloads/204637490002_R05C01", prep="TQCDPB")
```

Like `inferSpecies`, one need to call the `inferStrain` function before calling
the standard `noob`, `dyeBiasNL`, etc (by having `T` before `QCDPB` when
calling `openSesame`). Also like `inferSpecies()`, `inferStrain()` will return
a new `SigDF` with col and mask updated to reflect the change of
strain. Optionally, one can also specify `return.strain=TRUE`,
`return.pval=TRUE` or `return.probability=TRUE` to return the inferred strain,
the p-value or the probabilities of all 37 strain candidates.  Internally, the
function converts the beta values to variant allele frequencies. It should be
noted that since variant allele frequency is not always measured by the
M-allele of the probe. SeSAMe flips the $\beta$ values for some probes to
calculate variant allele frequency. The following example shows what
`inferStrain` does to a `SigDF`:

```{r nh9, message=FALSE}
sdf = sesameDataGet("MM285.1.SigDF") # an example dataset
inferStrain(sdf, return.strain = TRUE) # return strain as a string
sdf_after = inferStrain(sdf)   # update mask and col by strain inference
sum(sdf$mask) # before strain inference
sum(sdf_after$mask) # after strain inference
```

Let's visualize the probabilities of all candidate strains using the
`return.probabilities` option:

```{r nh10, fig.width=6, fig.height=4, message=FALSE}
library(ggplot2)
p = inferStrain(sdf, return.probability = TRUE)
df = data.frame(strain=names(p), probs=p)
ggplot(data = df,  aes(x = strain, y = probs)) +
            geom_bar(stat = "identity", color="gray") +
            ggtitle("Strain Probabilities") +
            ylab("Probability") + xlab("") +
            scale_x_discrete(position = "top") +
            theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0),
            legend.position = "none")
```

See also [The Supplemental
Vignette](https://zhou-lab.github.io/sesame/v1.16/supplemental.html#SNP) for
heatmap validation of strain inference.

# Quality Masking

In the MM285 array, we designed multimapping probes to allow querying
transposable elements and other biology. We also exposed probes with
potentially design flaws. These suboptimally designed probes take a new probe
ID prefix ("uk") in addition to the "cg"/"ch"/"rs" typically seen.  By default
the repeat and suboptimally designed probes are masked by `NA`. These masking
is done by the `qualityMask` function (or `Q` in prep codes).  To override
masking these probes, one can use the `resetMask` function (the `0` code in
`openSesame`) to remove the default masking. We recommend using it after the
preprocessing function that depends on reliable/uniquely-mapped probes, but
before detection p-value based masking (e.g. pOOBAH). This way, probes that
fail detection can still be flagged (they should be).

```{r nh7, message=FALSE}
sdf = sesameDataGet('MM285.1.SigDF')
sum(is.na(openSesame(sdf, prep="TQCDPB")))
sum(is.na(openSesame(sdf, prep="TQCD0PB")))
```

# Human-mouse Mixture

UNDER CONSTRUCTION

There are other inferences one can do on the nonhuman arrays, e.g., sex, age
(epigenetic clocks), tissue, copy number alteration etc. These will be
elaborated in [The Inference Vignette](inferences.html).

# Session Info

```{r}
sessionInfo()
```