--- title: "yriMulti -- HapMap YRI population, multiassay interfaces" author: "Vincent J. Carey, stvjc at channing.harvard.edu" date: "February 2015" output: BiocStyle::pdf_document: toc: yes number_sections: yes BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes vignette: > %\VignetteIndexEntry{yriMulti -- HapMap YRI population, multiassay interfaces} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r bib, echo = FALSE, results = 'hide'} library(knitcitations) library(bibtex) allbib = read.bibtex("allbib.bib") suppressPackageStartupMessages({ library(yriMulti) library(dsQTL) library(geuvPack) #library(biocMultiAssay) library(erma) library(VariantAnnotation) library(gQTLstats) library(Homo.sapiens) }) ``` # Introduction The EBV-transformed B-cells from Yoruban donors are assayed for genotype and various genomic features in a number of prominent studies. This package helps work with relevant datasets and data structures as use cases for biocMultiAssay package development. A particular concern is accommodation of distributed genotype data, in this case, based on the 1000 genomes VCF files in an S3 bucket. # Basic data resources ## Expression data We will use the RNA-seq expression data in the `r Biocpkg("geuvPack")` package. ```{r fakeload1,echo=FALSE,results="hide"} if (!exists("geuFPKM")) data(geuFPKM) ``` ```{r lkexpr,eval=FALSE} library(geuvPack) data(geuFPKM) ```{r do1} geuFPKM ``` ## Methylation data We have added 450k data from `r citet(allbib[["Banovich2014"]])` paper to the yriMulti package. ```{r fakeload2,echo=FALSE,results="hide"} if (!exists("banovichSE")) data(banovichSE) ``` ```{r lkmul,eval=FALSE} library(yriMulti) data(banovichSE) ```{r do2} banovichSE ``` ## DnaseI hypersensitivity data ```{r lkdhsf,echo=FALSE,results="hide"} if (!exists("DHStop5_hg19")) data(DHStop5_hg19) ``` ```{r lkdhs,eval=FALSE} library(dsQTL) data(DHStop5_hg19) ``` ```{r lkddd} DHStop5_hg19 ``` ## Genotype data We take advantage of a function (`gtpath`) that generates paths to S3-resident VCF from the 1000 genomes project. ```{r lkgeno,cache=TRUE} litvcf = readVcf(gtpath(20), param=ScanVcfParam(which=GRanges("20", IRanges(3.7e7,3.701e7))), genome="hg19") ``` ```{r outca} litvcf length(colnames(litvcf)) length(intersect(colnames(litvcf), colnames(banovichSE))) length(intersect(colnames(litvcf), colnames(geuFPKM))) length(intersect(colnames(litvcf), colnames(DHStop5_hg19))) ``` # Some computations focused on methylation-expression association The `yriMulti` package is currently a scratch-pad for some integrative infrastructure thoughts. ## Gene-centric selection With `mexGR`, a GRanges instance is formed with methylation scores for CpG near a gene. The assay data are placed in the mcols, with one range devoted to the expression measures. ```{r lkmex,cache=TRUE} m1 = mexGR(banovichSE, geuFPKM, symbol="ORMDL3") m1 mcols(m1)[1:4,1:4] table(mcols(m1)$type) ``` ## DNA-methylation association with expression `bindelms` computes the regressions of the selected gene's expression values on the methylation scores. We have options to transform the expression value (parameter `ytx` is a function) and can indicate the radius around the gene coding region to search for CpG (parameter `gradius`). We'll examine a region around gene BRCA2 for a CpG whose methylation score is negatively associated with BRCA2 expression. ```{r lkbi,cache=TRUE} b1 = bindelms(geuFPKM, banovichSE, symbol="BRCA2", ytx=log, gradius=20000) b1 mcols(b1)[1:3,] summary(mcols(b1)$t) mintind = which.min(mcols(b1)$t) mincpg = names(b1)[mintind] mincpg ``` ```{r lkevm,fig=TRUE} plotEvM(b1) ``` # Using the biocMultiAssay infrastructure ## Construction of the MultiAssayExperiment We will use a unified object design to reproduce the BRCA2 display just obtained. We need a list of relevant objects and a phenodata component. ```{r lkbcma, eval=FALSE} library(biocMultiAssay) myobs = list(geuvRNAseq=geuFPKM, yri450k=banovichSE, yriDHS=DHStop5_hg19) cold = colData(geuFPKM) suppressWarnings({ mm = MultiAssayExperiment(myobs, as.data.frame(cold)) }) mm ``` ## Restriction by range We compute the BRCA2 'gene range'. ```{r lkbrc, eval=FALSE} library(erma) brr = range(genemodel("BRCA2")) brr ``` Subset the multiassay structure to features in the vicinity of this range. ```{r dorsub, eval=FALSE} .subsetByRanges = function(ma, r) { el2 = lapply(elist(ma)@listData, function(x)subsetByOverlaps(x,r)) names(el2) = names(elist(ma)) # needed? ma@elist = new("elist", listData=el2) ma } newmm = .subsetByRanges(mm, brr+20000) newmm ``` We now have all the relevant features and samples. In fact we have more genes than we really wanted. But we will proceed with this selection. ## All pairwise regressions We will introduce a formula idiom to specify a collection of models of interest. `allLM_pw` is defined in `r Biocpkg("biocMultiAssay")`. ```{r lkpwl, eval=FALSE} pp = allLM_pw(geuvRNAseq~yri450k, newmm, ytx=log) names(pp) summary(pp[[1]][[1]]) which.min(unlist(pp[[2]])) # not BRCA2 but FRY ``` The formula idiom can be used to isolate assays and features. ```{r lkf,fig=TRUE, eval=FALSE} pwplot(geuvRNAseq~yri450k, ENSG00000139618.9~cg20073910, newmm, ytx=log) ``` ## Integrating dense genotypes in a VcfStack instance We set up a reference to a collection of VCF. We'll use 1000 genomes VCF for chr21, chr22, and chrY. ```{r mkvs, eval=FALSE} library(gQTLstats) pa = paths1kg(paste0("chr", c(21:22,"Y"))) library(Homo.sapiens) # necessary? stopifnot(requireNamespace("GenomeInfoDb")) ob = VcfStack(pa, seqinfo(Homo.sapiens)) ob ``` Now we set up a region of interest and bind it to the stack. This yields an instance of `RangedVcfStack`, for which we have `samples`, `features`, and `assay` methods defined in gQTLstats. ```{r mkrvs, eval=FALSE} myr = GRanges("22", IRanges(20e6,20.01e6)) ob = bindRanges(ob, myr) hasInternetConnectivity = function() !is.null(nsl("www.r-project.org")) if (hasInternetConnectivity()) lka = assay(ob) myobs = list(geuvRNAseq=geuFPKM, yri450k=banovichSE, yriDHS=DHStop5_hg19, yriGeno=ob) cold = colData(geuFPKM) suppressWarnings({ mm = MultiAssayExperiment(myobs, as.data.frame(cold)) }) mm ```

References

```{r results='asis',echo=FALSE} bibliography() #style="markdown") ```