---
title: "A Brief Introduction to Data Analytics on SeqArray GDS Files"
author: "Xiuwen Zheng (Department of Biostatistics, University of Washington, Seattle)"
date: "Sep 13, 2015"
output:
  html_document:
    theme: spacelab
    toc: yes
  pdf_document:
    toc: yes
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{SeqArray Data Analytics}
  %\VignetteDepends{gdsfmt}
  %\VignetteKeywords{GWAS, whole-genome, sequencing, SNV}
  %\VignetteEngine{knitr::rmarkdown}
---


# Data Processing

## Functions for Data Analysis

**Table 1**: The key functions in the SeqArray package.

| Function     | Description |
|:-------------|:-------------------------------------------|
| seqSetFilter | Sets a filter to sample or variant (i.e., define a subset of data). [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqSetFilter.html)  |
| seqGetData   | Gets data from a sequence GDS file (from a subset of data). [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqGetData.html)  |
| seqApply     | Applies a user-defined function over array margins. [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqApply.html)  |
| seqParallel  | Applies functions in parallel. [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqParallel.html)  |
| seqParallelSetup  | Setups a cluster environment for parallel computing. [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqParallelSetup.html)  |
| --- |   |
| seqNumAllele   | Numbers of alleles per site. [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqNumAllele.html)  |
| seqMissing     | Missing genotype percentages. [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqMissing.html)  |
| seqAlleleFreq  | Allele frequencies. [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqAlleleFreq.html)  |
| seqAlleleCount | Allele counts. [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/seqAlleleCount.html)  |
| ...            | [![](link.png)](http://zhengxwen.github.io/SeqArray/release/help/00Index.html)  |


## Parallel Implementation

The default setting for the analysis functions in the SeqArray package is serial implementation, but users can setup a cluster computing environment manually via `seqParallelSetup()` and distribute the calculations to multiple cores even 100 cluster nodes.

```{r}
library(SeqArray)

# 1000 Genomes, Phase 1, chromosome 22
(gds.fn <- seqExampleFileName("KG_Phase1"))

# open a GDS file
genofile <- seqOpen(gds.fn)

seqSummary(genofile)
```

```{r}
# use 2 cores for the demonstration
seqParallelSetup(2)

# numbers of alleles per site
table(seqNumAllele(genofile))

# reference allele frequencies
summary(seqAlleleFreq(genofile, ref.allele=0))

# close the cluster environment
seqParallelSetup(FALSE)
```



# Integration with SNPRelate

The [SNPRelate](http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html) package is developed to accelerate two key computations in genome-wide association studies: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures. The kernels of SNPRelate are written in C/C++ and have been highly optimized for multi-core symmetric multiprocessing computer architectures. The genotypes in SeqArray format are converted to categorical dosages of reference alleles (0,1,2,NA), which are the data format used in the SNPRelate pacakge.

```{r}
library(SNPRelate)
```


## LD-based SNP pruning

It is suggested to use a pruned set of SNPs which are in approximate linkage equilibrium with each other to avoid the strong influence of SNP clusters in principal component analysis and relatedness analysis.

```{r}
set.seed(1000)

# may try different LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2)
names(snpset)
head(snpset$chr22)  # snp.id

# get all selected snp id
snpset.id <- unlist(snpset)
```


## Principal Component Analysis

```{r}
# Run PCA
pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2)

# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
```

Population information are available:
```{r}
pop.code <- factor(seqGetData(genofile, "sample.annotation/Population"))
head(pop.code)

popgroup <- list(
    EastAsia = c("CHB", "JPT", "CHS", "CDX", "KHV", "CHD"),
	European = c("CEU", "TSI", "GBR", "FIN", "IBS"),
	African  = c("ASW", "ACB", "YRI", "LWK", "GWD", "MSL", "ESN"),
	SouthAmerica = c("MXL", "PUR", "CLM", "PEL"),
	India = c("GIH", "PJL", "BEB", "STU", "ITU"))

colors <- sapply(levels(pop.code), function(x) {
	for (i in 1:length(popgroup)) {
		if (x %in% popgroup[[i]])
			return(names(popgroup)[i])
	}
	NA
	})
colors <- as.factor(colors)
legend.text <- sapply(levels(colors), function(x) paste(levels(pop.code)[colors==x], collapse=","))
legend.text
```

```{r fig.width=5, fig.height=5, fig.align='center'}
# make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    Population = pop.code,
    stringsAsFactors = FALSE)
head(tab)

# draw
plot(tab$EV2, tab$EV1, pch=20, cex=0.75, main="1KG Phase 1, chromosome 22",
    xlab="eigenvector 2", ylab="eigenvector 1", col=colors[tab$Population])
legend("topleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75)
```


## Relatedness Analysis

For relatedness analysis, Identity-By-Descent (IBD) estimation in [SNPRelate](http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html) can be done by the method of moments (MoM) (Purcell et al., 2007).

```{r}
# YRI samples
sample.id <- seqGetData(genofile, "sample.id")
YRI.id <- sample.id[pop.code == "YRI"]
```

```{r fig.width=5, fig.height=5, fig.align='center'}
# Estimate IBD coefficients
ibd <- snpgdsIBDMoM(genofile, sample.id=YRI.id, maf=0.05, missing.rate=0.05, num.thread=2)

# Make a data.frame
ibd.coeff <- snpgdsIBDSelection(ibd)
head(ibd.coeff)

plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1), xlab="k0", ylab="k1", main="YRI samples (MoM)")
lines(c(0,1), c(1,0), col="red", lty=2)
```


## Identity-By-State Analysis

For $n$ study individuals, `snpgdsIBS()` can be used to create a $n \times n$ matrix of genome-wide average IBS pairwise identities. To perform cluster analysis on the $n \times n$ matrix of genome-wide IBS pairwise distances, and determine the groups by a permutation score:

```{r fig.width=5, fig.height=5, fig.align='center'}
set.seed(1000)
ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2))
```

Here is the population information we have known:
```{r fig.width=10, fig.height=5, fig.align='center'}
# Determine groups of individuals by population information
rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(colors[pop.code]))

plot(rv$dendrogram, leaflab="none", main="1KG Phase 1, chromosome 22",
    edgePar=list(col=rgb(0.5,0.5,0.5,0.75), t.col="black"))
legend("bottomleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75, ncol=4)
```


```{r}
# close the GDS file
seqClose(genofile)
```


# Integration with SeqVarTools

An R/Bioconductor package [SeqVarTools](http://www.bioconductor.org/packages/release/bioc/html/SeqVarTools.html) is available on Bioconductor, which defines S4 classes and methods for other common operations and analyses on SeqArray datasets.



# Resources

1. CoreArray C++ project: [http://corearray.sourceforge.net/](http://corearray.sourceforge.net/)
2. gdsfmt R package: [http://github.com/zhengxwen/gdsfmt](http://github.com/zhengxwen/gdsfmt), [http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html](http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html)
3. SeqArray R package: [http://github.com/zhengxwen/SeqArray](http://github.com/zhengxwen/SeqArray), [http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html](http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html)
4. SNPRelate R package: [http://github.com/zhengxwen/SNPRelate](http://github.com/zhengxwen/SNPRelate), [http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html](http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html)


# Session Information

```{r}
sessionInfo()
```



# References

1. Purcell S, Neale B, Todd-Brown K, et al., 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81(3):559-75.