--- title: "CCAFE Vignette" author: "Hayley Stoneman" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{CCAFE Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%" ) ``` # Introduction *Motivation* Growth in the field of genetic and genomic research has vastly increased the amount of available data which is often made publicly available through summary statistics. While there are many opportunities and new methods to use summary-level genetic data, it is often limited by the statistics that are made available. Many post-hoc analyses of disease data require case and control allele frequencies (AFs), which are not always published. We present a framework to derive the case and control AFs from Genome Wide Association Study (GWAS) summary statistics using the whole sample (case and control aggregated) AF, odds ratio, and case and control sample sizes, and compare it to a previously published method that uses the standard error (SE), odds ratio, and case and control sample sizes to recapitulate case and control AFs. *Results* In simulations and real data, we find estimating case and control AFs using the whole sample AF is highly accurate across all settings. When applying this method to the Pan-UK Biobank we find high concordance with the known AFs. Conversely, we find that in Pan-UK Biobank and simulations including covariates, deriving case and control AFs from the SE underestimates the minor AF (MAF) for higher MAFs. To enable estimation of case and control AFs using SE, we use gnomAD v3.1.2 AFs as a proxy for true AFs to estimate and correct for bias fit. While estimating the case control AF using the whole sample AF is preferred due to its high accuracy, estimating from the SE can be used more broadly since the SE can be calculated from the p-value and beta estimate, which are more commonly provided. Both methods expand the utility of publicly available genetic summary statistics, can be used when different statistics are reported, and promote the reusability of genomic data.

This document introduces you to the functions in CCAFE and gives small examples of how they can be used with summary statistics.

Find more details and documentation on GitHub: (https://wolffha.github.io/CCAFE_documentation/)

To jump forward to individual function specifics: [**CaseControl_AF**](#CaseControl_AF) -- [fast forward to example](#a-quick-demo-of-CaseControl_AF) [**CaseControl_SE**](#CaseControl_SE) -- [fast forward to example](#a-quick-demo-of-CaseControl_SE)


To view details on using GWAS VCF formatted data and Bioconductor packages to prepare your data for use in CCAFE: [Fast forward to demo](#VCF) # Installation To install this package from BioConductor: ```{r, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("CCAFE") ``` To install this package from GitHub: ```{r, eval=FALSE} if(!require(devtools, quietly = TRUE)) { install.packages("devtools") } devtools::install_github("https://github.com/wolffha/CCAFE") ``` ## Overview of CCAFE Functions CCAFE contains two functions: * **CaseControl_AF()** * **CaseControl_SE()** The two functions are both used to reconstruct case and control allele frequencies (AFs) from genetic summary statistics. The user will select which function to use based on the available summary statistics. **CaseControl_AF()** has the following required parameters: * Number of cases * Number of controls * Odds Ratio (OR) or beta coefficient * **AF** for the whole sample (cases and controls combined) **CaseControl_SE()** has the follwoing required parameters: * Number of cases * Number of controls * Odds Ratio (OR) or beta coefficient * **SE** of the log(OR) for each variant *Code adapted from ReACt GroupFreq function available here: (https://github.com/Paschou-Lab/ReAct/blob/main/GrpPRS_src/CountConstruct.c)* # CaseControl_AF {#CaseControl_AF} ## CaseControl_AF() input * **data**: a dataframe with each row being a variant and columns for total AF and OR * **N_case**: an integer for the number of case samples * **N_control**: an integer for the number of control samples * **OR_colname**: a string containing the exact column name in *data* with the odds ratios * **AF_total_colname**: a string containing the exact column name in *data* with the total AFs ## CaseControl_AF() output Returns the original *data* dataframe with two additional columns: * AF_case * AF_control The number of rows is equal to the number of variants. # CaseControl_SE {#CaseControl_SE} ## CaseControl_SE() input CaseControl_SE has the following required inputs: * **data**: a dataframe where each row is a variant and columns for the OR, SE, chromosome, and position * **N_case**: an integer for the number of case samples * **N_control**: an integer for the number of control samples * **OR_colname**: a string containing the exact column name in *data* with the odds ratios * **SE_colname**: a string containing the exact column name in *data* with the standard errors * **position_colname**: a string containing the exact column name in *data* with the positions of the variants * **chromosome_colname**: a string containing the exact column name in *data* with the chromosome of the variants. Note, sex chromosomes can be either characters ('X', 'x', 'Y', 'y') or numeric where X=23 and Y=24 * **sex_chromosomes**: boolean, TRUE if variants from sex chromosome(s) are included in the dataset * **do_correction**: boolean, TRUE if data is provided to correct the estimates using proxy MAFs * **remove_sex_chromosomes**: boolean, TRUE if variants on sex chromosomes should be removed. This is only necessary if *sex_chromosomes* == TRUE and the number of XX/XY individuals per case and control sample is NOT known CaseControl_SE has the following optional inputs: If *sex_chromosomes* == TRUE and *remove_sex_chromosomes* == FALSE, the following inputs are required: * **N_XX_case**: the number of XX chromosome case individuals * **N_XX_control**: the number of XX chromosome control individuals * **N_XY_case**: the number of XY chromosome case individuals * **N_XY_control**: the number of XY chromosome control individuals If *do_correction* == TRUE, then data must be provided that includes harmonized data with proxy MAFs * **correction_data**: a dataframe with the following EXACT column names: CHR, POS, proxy_MAF, containing data for variants harmonized between the observed and proxy datasets ## CaseControl_SE() output Returns the *data* dataframe with three additional columns: * MAF_case * MAF_control * MAF_total Each columns contains the estimated MAF in the cases, controls, and whole sample, respectively. The number of rows is equal to the number of variants If *do_correction* == TRUE, three additional columns are included in the dataframe: * MAF_case_adj * MAF_control_adj * MAF_total_adj Containing MAFs adjusted using the the proxy MAFs to model the expected bias. *NOTE:* This method assumes we are estimating the minor allele frequency (MAF). The minor allele may or may not be the effect variant used to calculate AF for the GWAS summary statistics. For additional details see vignette titled "CCAFE Extra Details" # Examples These examples use the provided sample data which is a subset of 500 variants from chromosome 1 of the Pan-UKBB diabetes GWAS in non-Finnish European individuals. ```{r, echo=TRUE,message=FALSE} library(CCAFE) library(tidyverse) ``` ## A quick demo of CaseControl_AF() {#a-quick-demo-of-CaseControl_AF} Here is a quick demonstration of using the CaseControl_AF function in the CCAFE package using the sample data. This code block loads the library. It then loads the sample data and ensures that it is in the required format (dataframe). It then runs the methods using the required parameters (data, N_case, N_control, OR_colname, AF_total_colname) and prints out the first few lines of the resulting returned dataframe. ```{r CaseControl_AF example} # load the data data("sampleDat") sampleDat <- as.data.frame(sampleDat) results_af <- CaseControl_AF(data = sampleDat, N_case = 16550, N_control = 403923, OR_colname = "OR", AF_total_colname = "true_maf_pop") head(results_af) ``` We can then plot the estimates from this example using the provided known AFs. ```{r, eval=TRUE} # plot the results # first need minor allele frequency to compare to true_maf_case results_af$MAF_case <- sapply(results_af$AF_case, function(x) ifelse(x > 0.5, 1-x, x)) results_af <- as.data.frame(results_af) ggplot(results_af, aes(x = true_maf_case, y = MAF_case)) + geom_point() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") + theme_bw() + ggtitle("Example CaseControl_AF()") ``` ## A quick demo of CaseControl_SE() {#a-quick-demo-of-CaseControl_SE} Here is a quick demonstration of using the CaseControl_SE function in the CCAFE package using the sample data. This code block loads the library. It then loads the sample data and ensures that it is in the required format (dataframe). It then runs the methods using the required parameters (data, N_case, N_control, OR_colname, SE_colname, chromosome_colname, position_colname, do_correction, sex_chromosomes) and prints out the first few lines of the resulting returned dataframe. ```{r CaseControl_SE example - no correction} # load the data data("sampleDat") # always ensure data is a dataframe before inputting to CCAFE methods sampleDat <- as.data.frame(sampleDat) # First run without correction results_se_noCorr <- CaseControl_SE(data = sampleDat, N_case = 16550, N_control = 403923, OR_colname = "OR", SE_colname = "SE", chromosome_colname = "CHR", position_colname = "POS", do_correction = FALSE, sex_chromosomes = FALSE) head(results_se_noCorr) ``` We can then plot the estimates from this example using the provided known AFs. ```{r, eval=TRUE} # plot the results ggplot(results_se_noCorr, aes(x = true_maf_case, y = MAF_case)) + geom_point() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") + theme_bw() + coord_cartesian(ylim = c(0,.5)) + ggtitle("Example CaseControl_SE() No Correction") ``` Here is a quick demonstration of using the CaseControl_SE function in the CCAFE package using the sample data, but also performing the provided bias correction. This code block loads the library. It then loads the sample data and ensures that it is in the required format (dataframe). It then runs the methods using the required parameters (data, N_case, N_control, OR_colname, SE_colname, chromosome_colname, position_colname, do_correction, sex_chromosomes) and prints out the first few lines of the resulting returned dataframe. ```{r CaseControl_SE example - correction} # load the data data("sampleDat") # always ensure data is a dataframe before inputting to CCAFE methods sampleDat <- as.data.frame(sampleDat) corr_data <- data.frame(CHR = sampleDat$CHR, POS = sampleDat$POS, proxy_MAF = sampleDat$gnomad_maf) # now run with correction results_se_corr <- CaseControl_SE(data = sampleDat, N_case = 16550, N_control = 403923, OR_colname = "OR", SE_colname = "SE", chromosome_colname = "CHR", position_colname = "POS", do_correction = TRUE, correction_data = corr_data, sex_chromosomes = FALSE) head(results_se_corr) ``` Again, we can plot the estimates (corrected) and compare them to the provided known AFs and see reduced bias. ```{r, eval=TRUE} # plot the results ggplot(results_se_corr, aes(x = true_maf_case, y = MAF_case_adj)) + geom_point() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") + theme_bw() + ggtitle("Example CaseControl_SE() With Correction") ``` # Integration with Bioconductor ## Tabular data This software package aims to increase the utility of existing, publicly available genomic summary data from genome-wide association studies (GWAS). Summary statistics are still widely underutilized, despite their wide availability. As such, Bioconductor packages with methods specifically designed to analyze these statistics are still being developed. The most commonly used repository of these summary statistics is the NHGRI GWAS Catalog (https://www.ebi.ac.uk/gwas/). From the GWAS catalog, a user can download full summary statistics through the FTP sites. The majority of these summary statistics are zipped plain text (usually space or tab delimited) files. The GWAS Catalog itself has recommended a tabular, plain-text format as its recommended GWAS summary statistic format. As such, this tabular data is most simply read into R as a dataframe structure. ## VCF data {#VCF} While the many users will download the GWAS files and load them into R, there are other ways to obtain GWAS summary statistics, including through the IEU Open GWAS Project (https://gwas.mrcieu.ac.uk/). The IEU Open GWAS Project instead uses the GWAS VCF format (https://doi.org/10.1101/2020.05.29.115824) to store GWAS summary statistics and ensure alignment to the hg19 reference genome. We can use an assortment of Bioconductor packages and data structures to work with GWAS VCF data. First we will make use of the Bioconductor package *VariantAnnotation* to handle VCF data ```{r, eval=FALSE} BiocManager::install("VariantAnnotation") ``` ```{r} suppressWarnings(suppressPackageStartupMessages({ library(VariantAnnotation) library(tidyverse) })) ``` We will use the VCF from this GWAS of Type 2 Diabetes https://doi.org/10.1038/s41588-018-0084-1. A subset of 10,000 variants has been stored within the package and can be loaded. ```{r} data("vcf_sample") ``` We can see the name of the GWAS dataset (refers to the IEU GWAS database ID name) ```{r} samples(header(vcf_sample)) ``` To understand what the data format, we can look view the VCF object ```{r} vcf_sample ``` The columns that we want for use in CCAFE are: From the rowRanges object: * seqnames (chromosome) * ranges (position) * REF * ALT * RSID From the geno object: * ES (effect size of ALT) * SE * AF (allele frequency of ALT) We then want to convert this VCF formatted dataset into a simple dataframe to use with CCAFE. ```{r} # first we will get the info from GRanges object (position, RSID) meta <- as.data.frame(ranges(vcf_sample)) meta <- meta[,c(1, 4)] colnames(meta) <- c("Position", "RSID") # get the chromosome (as a) meta$Chromosome <- as.vector(seqnames(rowRanges(vcf_sample))) # now we can also get the meta data (REF, ALT) from the GRanges object meta <- cbind(meta, mcols(vcf_sample)[,c(2,3)]) rownames(meta) <- seq(1, nrow(meta)) # now we will get the info from the geno object geno_dat <- data.frame( beta = unlist(geno(vcf_sample)$ES), SE = unlist(geno(vcf_sample)$SE), AF = unlist(geno(vcf_sample)$AF) ) df_data <- cbind(meta, geno_dat) head(df_data) ``` For CCAFE we will use OR, so let's create this column by exponentiating the effect estimate ```{r} df_data$OR <- exp(df_data$beta) ``` Let's now apply the CCAFE methods to get case and control specific AFs First, since we have total AF, we'll use CaseControl_AF ```{r} df_data <- CaseControl_AF(data = df_data, N_case = 48286, N_control = 250671, OR_colname = "OR", AF_total_colname = "AF") head(df_data) ``` We can look at a comparison of the AFs Create a dataframe to plot ```{r} plotdata_AF <- df_data %>% dplyr::select(AF, AF_case, AF_control) plotdata_AF$AF_total <- (plotdata_AF$AF_case*48286 + plotdata_AF$AF_control*250671)/(298857) colnames(plotdata_AF)[1] <- "AF(known)" plotdata_AF_long <- pivot_longer(plotdata_AF, cols = colnames(plotdata_AF)) ``` Plot the results ```{r} ggplot(plotdata_AF_long, aes(x = value, fill = name)) + geom_boxplot(alpha = 0.5, outliers = FALSE) + facet_wrap(~name, nrow = 4) + theme_bw() ``` These case and control AFs can now be used in further downstream analyses. For example, if the data was harmonized with a reference panel, then Summix2 (https://www.bioconductor.org/packages/release/bioc/html/Summix.html) could be used to estimate substructure in the GWAS sample (total sample, cases, or controls). The AFs could also be used in case-case GWAS (see https://wolffha.github.io/CCAFE_documentation/articles/CCAFE_CCGWAS.html) or certain meta-analysis software (e.g. https://doi.org/10.1038/s41598-022-12185-6). ### VCF Conversion Function For ease of use of CCAFE with VCF files, we have included a function 'CCAFE_convertVCF()'. It performs the steps outlined above, assuming the user has loaded their VCF into R. ```{r CCAFE_convertVCF Demo} df_data_2 <- CCAFE_convertVCF(vcf_sample) ``` ```{r} df_data_2 <- CaseControl_AF(data = df_data_2, N_case = 48286, N_control = 250671, OR_colname = "OR", AF_total_colname = "AF") head(df_data_2) ``` # Session Information ```{r} sessionInfo() ```