--- title: "A Brief Introduction to Data Analytics on SeqArray GDS Files" author: "Xiuwen Zheng (Department of Biostatistics, University of Washington, Seattle)" date: "Sep 13, 2015" output: html_document: theme: spacelab toc: yes pdf_document: toc: yes toc_depth: 3 vignette: > %\VignetteIndexEntry{SeqArray Data Analytics} %\VignetteDepends{gdsfmt} %\VignetteKeywords{GWAS, whole-genome, sequencing, SNV} %\VignetteEngine{knitr::rmarkdown} --- # Data Processing ## Functions for Data Analysis **Table 1**: The key functions in the SeqArray package. | Function | Description | |:-------------|:-------------------------------------------| | seqSetFilter | Sets a filter to sample or variant (i.e., define a subset of data). [](http://zhengxwen.github.io/SeqArray/release/help/seqSetFilter.html) | | seqGetData | Gets data from a sequence GDS file (from a subset of data). [](http://zhengxwen.github.io/SeqArray/release/help/seqGetData.html) | | seqApply | Applies a user-defined function over array margins. [](http://zhengxwen.github.io/SeqArray/release/help/seqApply.html) | | seqParallel | Applies functions in parallel. [](http://zhengxwen.github.io/SeqArray/release/help/seqParallel.html) | | seqParallelSetup | Setups a cluster environment for parallel computing. [](http://zhengxwen.github.io/SeqArray/release/help/seqParallelSetup.html) | | --- | | | seqNumAllele | Numbers of alleles per site. [](http://zhengxwen.github.io/SeqArray/release/help/seqNumAllele.html) | | seqMissing | Missing genotype percentages. [](http://zhengxwen.github.io/SeqArray/release/help/seqMissing.html) | | seqAlleleFreq | Allele frequencies. [](http://zhengxwen.github.io/SeqArray/release/help/seqAlleleFreq.html) | | seqAlleleCount | Allele counts. [](http://zhengxwen.github.io/SeqArray/release/help/seqAlleleCount.html) | | ... | [](http://zhengxwen.github.io/SeqArray/release/help/00Index.html) | ## Parallel Implementation The default setting for the analysis functions in the SeqArray package is serial implementation, but users can setup a cluster computing environment manually via `seqParallelSetup()` and distribute the calculations to multiple cores even 100 cluster nodes. ```{r} library(SeqArray) # 1000 Genomes, Phase 1, chromosome 22 (gds.fn <- seqExampleFileName("KG_Phase1")) # open a GDS file genofile <- seqOpen(gds.fn) seqSummary(genofile) ``` ```{r} # use 2 cores for the demonstration seqParallelSetup(2) # numbers of alleles per site table(seqNumAllele(genofile)) # reference allele frequencies summary(seqAlleleFreq(genofile, ref.allele=0)) # close the cluster environment seqParallelSetup(FALSE) ``` # Integration with SNPRelate The [SNPRelate](http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html) package is developed to accelerate two key computations in genome-wide association studies: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures. The kernels of SNPRelate are written in C/C++ and have been highly optimized for multi-core symmetric multiprocessing computer architectures. The genotypes in SeqArray format are converted to categorical dosages of reference alleles (0,1,2,NA), which are the data format used in the SNPRelate pacakge. ```{r} library(SNPRelate) ``` ## LD-based SNP pruning It is suggested to use a pruned set of SNPs which are in approximate linkage equilibrium with each other to avoid the strong influence of SNP clusters in principal component analysis and relatedness analysis. ```{r} set.seed(1000) # may try different LD thresholds for sensitivity analysis snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2) names(snpset) head(snpset$chr22) # snp.id # get all selected snp id snpset.id <- unlist(snpset) ``` ## Principal Component Analysis ```{r} # Run PCA pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2) # variance proportion (%) pc.percent <- pca$varprop*100 head(round(pc.percent, 2)) ``` Population information are available: ```{r} pop.code <- factor(seqGetData(genofile, "sample.annotation/Population")) head(pop.code) popgroup <- list( EastAsia = c("CHB", "JPT", "CHS", "CDX", "KHV", "CHD"), European = c("CEU", "TSI", "GBR", "FIN", "IBS"), African = c("ASW", "ACB", "YRI", "LWK", "GWD", "MSL", "ESN"), SouthAmerica = c("MXL", "PUR", "CLM", "PEL"), India = c("GIH", "PJL", "BEB", "STU", "ITU")) colors <- sapply(levels(pop.code), function(x) { for (i in 1:length(popgroup)) { if (x %in% popgroup[[i]]) return(names(popgroup)[i]) } NA }) colors <- as.factor(colors) legend.text <- sapply(levels(colors), function(x) paste(levels(pop.code)[colors==x], collapse=",")) legend.text ``` ```{r fig.width=5, fig.height=5, fig.align='center'} # make a data.frame tab <- data.frame(sample.id = pca$sample.id, EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector Population = pop.code, stringsAsFactors = FALSE) head(tab) # draw plot(tab$EV2, tab$EV1, pch=20, cex=0.75, main="1KG Phase 1, chromosome 22", xlab="eigenvector 2", ylab="eigenvector 1", col=colors[tab$Population]) legend("topleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75) ``` ## Relatedness Analysis For relatedness analysis, Identity-By-Descent (IBD) estimation in [SNPRelate](http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html) can be done by the method of moments (MoM) (Purcell et al., 2007). ```{r} # YRI samples sample.id <- seqGetData(genofile, "sample.id") YRI.id <- sample.id[pop.code == "YRI"] ``` ```{r fig.width=5, fig.height=5, fig.align='center'} # Estimate IBD coefficients ibd <- snpgdsIBDMoM(genofile, sample.id=YRI.id, maf=0.05, missing.rate=0.05, num.thread=2) # Make a data.frame ibd.coeff <- snpgdsIBDSelection(ibd) head(ibd.coeff) plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1), xlab="k0", ylab="k1", main="YRI samples (MoM)") lines(c(0,1), c(1,0), col="red", lty=2) ``` ## Identity-By-State Analysis For $n$ study individuals, `snpgdsIBS()` can be used to create a $n \times n$ matrix of genome-wide average IBS pairwise identities. To perform cluster analysis on the $n \times n$ matrix of genome-wide IBS pairwise distances, and determine the groups by a permutation score: ```{r fig.width=5, fig.height=5, fig.align='center'} set.seed(1000) ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2)) ``` Here is the population information we have known: ```{r fig.width=10, fig.height=5, fig.align='center'} # Determine groups of individuals by population information rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(colors[pop.code])) plot(rv$dendrogram, leaflab="none", main="1KG Phase 1, chromosome 22", edgePar=list(col=rgb(0.5,0.5,0.5,0.75), t.col="black")) legend("bottomleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75, ncol=4) ``` ```{r} # close the GDS file seqClose(genofile) ``` # Integration with SeqVarTools An R/Bioconductor package [SeqVarTools](http://www.bioconductor.org/packages/release/bioc/html/SeqVarTools.html) is available on Bioconductor, which defines S4 classes and methods for other common operations and analyses on SeqArray datasets. # Resources 1. CoreArray C++ project: [http://corearray.sourceforge.net/](http://corearray.sourceforge.net/) 2. gdsfmt R package: [http://github.com/zhengxwen/gdsfmt](http://github.com/zhengxwen/gdsfmt), [http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html](http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html) 3. SeqArray R package: [http://github.com/zhengxwen/SeqArray](http://github.com/zhengxwen/SeqArray), [http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html](http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html) 4. SNPRelate R package: [http://github.com/zhengxwen/SNPRelate](http://github.com/zhengxwen/SNPRelate), [http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html](http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html) # Session Information ```{r} sessionInfo() ``` # References 1. Purcell S, Neale B, Todd-Brown K, et al., 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81(3):559-75.