---
title: "omicsPrint: detection of data linkage errors in multiple omics studies"
shorttitle: "Verifying sample relationships"
package: "omicsPrint"
author: 
- name: "Maarten van Iterson"
  affiliation: 
  - &id "Department of Medical Statistics and Bioinformatics, Section Moleculair Epidemiology, Leiden University Medical Center,
  Leiden, The Netherlands"  
  email: mviterson@gmail.com
- name: "Davy Cats"
  affiliation: *id  
  email: davycats.dc@gmail.com
- name: "Paul Hop"
  affiliation: *id    
  email: pjhop@gmail.com  
- name: "Bas T. Heijmans"
  affiliation: *id    
  email: b.t.heijmans@lumc.nl
date: "`r Sys.Date()`"
abstract: >
  The Aanalysis of multiple omics datatypes from the same
  individuals is becoming increasingly common. For example,
  several data repositories contain genetic, transcriptomic
  and epigenetic (DNA methylation) measurements on the same
  individuals, e.g., TCGA, Geuvadis, BBMRI/BIOS, etc. However,
  errors in the data linkage are almost inevitable in such
  complex projects (e.g. due to sample mix-ups) and unknown
  family relationships may be present. If not corrected, such
  errors may introduce false positive or false negative
  findings. To streamline necessary quality control, wWe have
  developed a tool to reliably verify the sample relationships
  between and across omics data types using genotype data that
  is directly measured or inferred from for example RNA-seq or
  DNA methylation array data.    	  
output: 
  html_document:
    toc: true
    toc_float:
      collapsed: false
vignette: >
  %\VignetteIndexEntry{omicsPrint: detection of data linkage errors in multiple omics studies}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: omicsPrint.bib
---


```{r setup, include=FALSE}
set.seed(22062017)
library(omicsPrint)
library(BiocStyle)
library(GEOquery)
library(SummarizedExperiment)
```

	
# Within omics sample relationship verification #

Here we illustrate how to detect data linkage errors with the
`r Biocpkg("omicsPrint")` package using both artificially generated data and
experimental data. Several additional vignettes are available that
show the use of the package on further experimental data in different
settings, i.e., 450k DNA methylation and imputed
genotypes.
  

## Create toy data ##

Here we generate a single vector with 100 randomly drawn integers from
the set; 1, 2, 3, representing 100 SNP calls from a single
individual. Three additional individuals are generated by randomly
swapping a certain fraction of the SNPs. Swapping 5 SNPs will
introduce a few mismatches mimicking a situation where the same
individual was measured twice (replicate) but with measurement
error. Swapping 50% of the SNPs will be similar to the difference in
genotypes between parents and offspring. Swapping all SNPs will result
in a situation similar to comparing two unrelated individuals.

```{r, toydata}
swap <- function(x, frac=0.05) {
    n <- length(x)
    k <- floor(n*frac)
    x1 <- sample(1:n,k)
    x2 <- sample(1:n,k) ##could be overlapping
    x[x2] <- x[x1]
    x
}
x1 <- 1 + rbinom(100, size=2, prob=1/3)
x2 <- swap(x1, 0.05) ##strongly related e.g. replicate
x3 <- swap(x1, 0.5) ##related e.g. parent off spring
x4 <- swap(x1, 1) ##unrelated
x <- cbind(x1, x2, x3, x4)
```

Now `x` contains the 100 SNPs for the four individuals using `head` we
can inspect the first six SNPs.
		
```{r, head}
head(x)
```

	
## Running the `allelesharing` algorithm ##

We use Identity by State (IBS) for the set of SNPs to infer sample
relations. See @Abecasis2001, for the description of this approach
applied to genetic data. Briefly, between each sample pair, the
IBS-vector is calculated, which is a measure of genetic
distance between individuals. Next, the vector is summarized by its
mean and variance. A mean of 2 and variance of 0 indicates that the
samples are identical. 

```{r, alleleSharing}
data <- alleleSharing(x, verbose=TRUE)
```

The set of SNPs may contain uninformative SNPs, SNPs of bad quality or
even SNPs could be missing. The following pruning steps are
implemented to yield the most informative set of SNPs (thresholds can
be adjusted see `?alleleSharing`):
       
1. a SNP is remove if missing in >5% of the samples (`callRate = 0.95`)
2. a sample should have at least 2/3 of the SNPs called (`coverageRate = 2/3`)
3. a SNP is can be removed if it violates the Hardy-Weinberg equilibrium (`alpha = 0`).
4. a SNP is can be removed if the minor allele frequency is below the given threshold (`maf = 0`)   

Hardy-Weinberg test-statistics is calculated using a $\chi^2$-test and
Bonferonni multiple testing correction is performed.

	
```{r, data}
data
```
	
By default no relations are assumed except for the self-self
relations.

The output is a `data.frame` containing all pairwise comparisons with
the mean and variance of the IBS over the set of SNPs and the reported
sample relationship, including the identifiers.

## Report mismatches and provide graphical summary ##


Since we provided a list of known relations and assume that the
majority is correct, we can build a classifier to discover
misclassified relations. Linear discriminant analysis is used to
generate a confusion-matrix, which is subsequently used to graphically
represent the classification boundaries and generate an output file
with misclassified sample pairs.
   

```{r, inferrelations, fig.cap = "Scatter-plot of IBS mean and variance with classification boundary for pairwise comparison between the samples without specifying sample relationships using artificially generated data."}
mismatches <- inferRelations(data)
mismatches
```


There is one misclassified sample, namely the replicate that we
introduced but was not a priori specified as an existing
relationship. The true relationship with between sample `x1` and
sample `x2` is an identical relation. Furthermore, we see two sample
pairs with mean IBS of `r round(mismatches$mean, 2)` and variance
`r round(mismatches$var, 2)` which is an indication that also these pairs
share a considerable number of alleles. If known, such relationships
can be specified prior to analysis.
	

```{r, extendedrelations}
relations <- expand.grid(idx = colnames(x), idy= colnames(x))
relations$relation_type <- "unrelated"
relations$relation_type[relations$idx == relations$idy] <- "identical"
relations$relation_type[c(2,5)] <- "identical" ##replicate
relations$relation_type[c(3,7,9,10)] <- "parent offspring"
relations
```

Rerun the allelesharing algorithm now provided with the known
relations.

```{r, addrelations, fig.cap = "Scatter-plot of IBS mean and variance with classification boundaries for pairwise comparison between the samples with specifying sample relationships using artificially generated data."}
data <- alleleSharing(x, relations=relations)
data
mismatches <- inferRelations(data)
mismatches
```

All misclassified relations were resolved.	

# Across omics data type sample relationship verification #


The previous example showed how to perform sample relationship
verification within a single omics data type. If a second set of SNPs
is available obtained from a different omic data type (and the SNPs
are partly overlapping), `r Biocpkg("omicsPrint")` can be used to verify
relationships between omics types, e.g. to establish whether two omics
data types were indeed generated for the same individual in order to
exclude or detect sample mix-ups.

In this artificial example a random subset of 80 SNPs is selected as
the set of SNPs from a different omic type. First running without
providing the known relations.


```{r, xyallelesharing1}
rownames(x) <- paste0("rs", 1:100)
y <- x[sample(1:100, 80),]
data <- alleleSharing(x, y)
```
	
Note that pruning is performed on both data types and automatically a
set of overlapping SNPs (80) is used provided that the rownames of `x`
and `y` are identical (this also holds for sample relations where the
relation identifiers `idx` and `idy` should match with the columnames
of `x` and `y`).

```{r, xyallelesharing2, fig.cap = "Scatter-plot of IBS mean and variance with classification boundary for pairwise comparison between the samples without specifying sample relationships using artificial data."}
data
mismatches <- inferRelations(data)
mismatches
```

Now running with providing the known relations.	

```{r, addrelations2, fig.cap = "Scatter-plot of IBS mean and variance with classification boundaries for pairwise comparison between the samples with specifying sample relationships using artificial data."}
data <- alleleSharing(x, y, relations)
data
mismatches <- inferRelations(data)
mismatches
```

Providing the known, true relationships thus yields no missclassified
sample relationships.

# An example using real world methylation data from a `SummarizedExperiment` #

Here we will show how you could varify sample relationships on
publicly available DNA methylation data. The dataset used here
contains pairs of monozygotic twins. We will extract the beta-value
matrix from GEO
[GSE100940](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100940),
[paper in
press](http://www.sciencedirect.com/science/article/pii/S1875176817300872).
 
First we extract the data from GEO using the
[*GEOquery*](http://bioconductor.org/packages/GEOquery/)-package.

```{r downloadretry, include=FALSE}
library(GEOquery)
library(SummarizedExperiment)
file <- tempfile(fileext = ".txt.gz")
cnt <- 0
value <- -1
while(value != 0  & cnt < 25) {
    value = download.file("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE100nnn/GSE100940/matrix/GSE100940_series_matrix.txt.gz", file)
    cnt <- cnt + 1
}
gset <- getGEO(filename=file, getGPL=FALSE)
```
			
```{r downloaddata, eval=FALSE}
library(GEOquery)
library(SummarizedExperiment)
file <- tempfile(fileext = ".txt.gz")
download.file("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE100nnn/GSE100940/matrix/GSE100940_series_matrix.txt.gz", file)
gset <- getGEO(filename=file, getGPL=FALSE)
```

Next we convert the returned object into a 
[*SummarizedExperiment*](http://bioconductor.org/packages/SummarizedExperiment/):

	
```{r geo2se}
se <- makeSummarizedExperimentFromExpressionSet(gset)
se
```

	
Sample data can be extracted from the `SummarizedExperiment`-object using the
`colData`-function and we can see which pair of twins each sample belongs to 
through the `source_name_ch1` field. Using this knowledge we can construct a 
table of expected relationships:
      
```{r makerelationships}
r <- expand.grid(idx=colnames(se), idy=colnames(se))
r$Xpair <- sapply(strsplit(as.character(colData(se)[r$idx, "source_name_ch1"]),
                           split = "_"), head, 1)
r$Ypair <- sapply(strsplit(as.character(colData(se)[r$idy, "source_name_ch1"]),
                           split = "_"), head, 1)
r$relation_type <- "unrelated"
r$relation_type[r$Xpair == r$Ypair] <- "twin"
r$relation_type[r$idx == r$idy] <- "identical"
head(r)
```

Several probes on the array contain SNPs occurring frequently in different 
populations[@Chen2013; @Zhou2016]. We can use these to verify the expected 
relationships. We have made these data available from inside of this package.

Now we make a selection of CpGs probably affected by polymorphic SNPS
in populations from East Asian, as these samples are from South Korea:

```{r selectcpgs}
data(hm450.manifest.pop.GoNL)
cpgs <- names(hm450.manifest.pop.GoNL[
    mcols(hm450.manifest.pop.GoNL)$MASK.snp5.EAS])
se <- se[cpgs,]
```

Next the beta-values are converted to genotypes using our enhanced
K-means algorithm:

```{r genotyping}
dnamCalls <- beta2genotype(se, assayName = "exprs")
dim(dnamCalls)
dnamCalls[1:5, 1:5]
```

The DNA methylation based genotype calls can be directly supplied to the
allelesharing algorithm to perform the intra-omic sample matching:

```{r allelesharing, dpi=72, fig.cap="Scatter-plot of IBS mean and variance with classification boundaries for pairwise comparison between samples consisting of pairs of monozygotic twins."}
data <- alleleSharing(dnamCalls, relations = r, verbose = TRUE)
mismatches <- inferRelations(data)
mismatches
```

The twins are predicted as being identical to each other. This is not
unexpected as they are monozygotic.

# SessionInfo #

```{r, sessioninfo}
sessionInfo()
```

# Reference #