---
title: "Identifying Copy Number Polymorphisms"
author: "Jacob Carey, Steven Cristiano, and Robert Scharpf"
date: \today
output: BiocStyle::pdf_document
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Identifying Copy Number Polymorphisms}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc} 
---

# Introduction

Identify consensus start and stop coordinates of a copy number
polymorphism

The collection of copy number variants (CNVs) identified in a study
can be encapulated in a GRangesList, where each element is a GRanges
of the CNVs identified for an individual.  (For a study with 1000
subjects, the GRangesList object would have length 1000 if each
individual had 1 or more CNVs.)  For regions in which CNVs occur in
more than 2 percent of study participants, the start and end
boundaries of the CNVs may differ because of biological differences in
the CNV size as well as due to technical noise of the assay and the
uncertainty of the breakpoints identified by a segmentation of the
genomic data.  Among subjects with a CNV called at a given locus, the
`consensusCNP` function identifies the largest region that is copy
number variant in half of these subjects.

# Simulating CNP data

Included in the CNPBayes package are objects of class `SnpArrayExperiment`
and `GRangesList`. We begin by loading the necessary libraries and data.

```{r prelim}
suppressMessages(library(CNPBayes))
suppressMessages(library(SummarizedExperiment))
se <- readRDS(system.file("extdata", "simulated_se.rds", package="CNPBayes"))
grl <- readRDS(system.file("extdata", "grl_deletions.rds", package="CNPBayes"))
```

The object `se` contains log R ratios and B Allele Frequencies, and the object
`grl` is a `GRangesList` of simulated deletions. 

# Finding CNPs

After reading this saved data, we visualize the CNVs.

```{r plot-cnvs}
cnv.region <- consensusCNP(grl[1], max.width=5e6)
i <- subjectHits(findOverlaps(cnv.region, rowRanges(se)))
xlim <- c(min(start(se)), max(end(se)))
par(las=1)
plot(0, xlim=xlim, ylim=c(0, 26), xlab="Mb", ylab="sample index", type="n",
     xaxt="n")
at <- pretty(xlim, n=10)
axis(1, at=at, labels=round(at/1e6, 1), cex.axis=0.8)
rect(start(grl), seq_along(grl)-0.2, end(grl), seq_along(grl)+0.2,
     col="gray", border="gray")
```

Before further analysis can be performed, the log R ratios in a CNV region must
be summarized to a one dimensional object [@cardin]. Within each CNP locus, 
log R ratios at each SNP are summarized by sample. Below are examples of using 
the median and the first principal component to create a one dimensional 
summary.

## CNP Boundaries with Median Summary
To summarize samples within a CNV locus by the median log R ratios, we define 
the largest region that spans 50 percent of the samples using the function
`consensusCNP`.  Using log R ratios of SNPs contained in this region, the
median is taken across samples.  Because the deletions in this example are 
large ($>$ 2Mb), we specify a large value for `max.width` to avoid filtering
these CNVs.

```{r median}
cnv.region <- consensusCNP(grl, max.width=5e6)
i <- subjectHits(findOverlaps(cnv.region, rowRanges(se)))
med.summary <- matrixStats::colMedians(assays(se)[["cn"]][i, ], na.rm=TRUE)
par(las=1)
plot(0, xlim=xlim, ylim=c(0, 26), xlab="Mb", ylab="sample index", type="n",
     xaxt="n")
at <- pretty(xlim, n=10)
axis(1, at=at, labels=round(at/1e6, 1), cex.axis=0.8)
rect(start(grl), seq_along(grl)-0.2, end(grl), seq_along(grl)+0.2,
     col="gray", border="gray")
abline(v=c(start(cnv.region), end(cnv.region)), lty=2)
```

## CNP Boundaries with PCA Summary
Another method for summarizing the log R ratops is by the first principal 
component on the markers for the entire region [@cardin]. A possible 
disadvantage of this approach is that the scale of the loadings makes it more 
difficult to interpret the copy number of the mixture components. Instead, 
the median log R ratio is adequate and retains the original scale. An 
advantage of the principal component method is that the minimum `start` and 
maximum `end` can be used to define the CNV region. The principal component 
method should downweight markers that are not consistent with the Copy Number 
Variation.

```{r pca}
cnv.region2 <- reduce(unlist(grl))
i.pc <- subjectHits(findOverlaps(cnv.region2, rowRanges(se)))
x <- oligoClasses::lrr(se)[i.pc, ]
nas <- rowSums(is.na(x))
na.index <- which(nas > 0)
x <- x[-na.index, , drop=FALSE]
pc.summary <- prcomp(t(x))$x[, 1]
meds.for.pc <- matrixStats::colMedians(x, na.rm=TRUE)
if(cor(pc.summary, meds.for.pc) < 0) pc.summary <- -1*pc.summary

par(las=1)
plot(0, xlim=xlim, ylim=c(0, 26), xlab="Mb", ylab="sample index", type="n",
     xaxt="n")
at <- pretty(xlim, n=10)
axis(1, at=at, labels=round(at/1e6, 1), cex.axis=0.8)
rect(start(grl), seq_along(grl)-0.2, end(grl), seq_along(grl)+0.2,
     col="gray", border="gray")
abline(v=c(start(cnv.region2), 
           end(cnv.region2)), lty=2)
```

Note that boundaries of created by principal component summary method
are wider than those in the above median summary method. In this
example, the boundaries are not significantly different, but in samples
with ragged starts and ends, the PCA method will provide greater
coverage.

## Summary Plots
Finally, we plot the one dimensional summaries.

```{r summary-plots}
par(mfrow=c(1,2), las=1)
plot(med.summary, main="median summary of\nconsensus CNP", cex.main=0.7, pch=20)
plot(pc.summary, main="PC summary of\nreduced CNV ranges", cex.main=0.7, pch=20)
```

## Filtering
Loci with no additions or deletions can be identified as normal distribution. If a loci has log R ratios drawn from a normal distribution, then the loci should not be included in the model fitting process. Removing these unneeded loci can reduce the computational burden. An easy way to identify such loci is by using the Shapiro-Wilk test as a test of normality [@shapiro]. To be conservative, we suggest retaining only those loci which have a *p* value $< 0.1$. 

```{r shapiro}
shapiro.test(med.summary)
```

## Mixture Model
After deciding which one dimensional summary to use and summary of the data is
complete, one can proceed to fitting a `MixtureModel` to the data. Models can
be hierarchical over the batches, or marginal over the batches.  Please refer 
to the CNPBayes vignette for fitting a `MixtureModel`.

# References