--- title: "Workflow example" author: "Christian Arnold, Pooja Bhat, Judith Zaugg" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('SNPhood')`" abstract: > This workflow vignette shows how to use the *SNPhood* package in a real-world example. For this purpose, you will use the *SNPhoodData* package for a more complex analysis to illustrate most of the features from *SNPhood*. Importantly, you will also learn in detail how to work with a *SNPhood* object and what its main functions and properties are. The vignette will be continuously updated whenever new functionality becomes available or when we receive user feedback. vignette: > %\VignetteIndexEntry{Workflow example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- # Example Workflow In the following example, you will use data from the *SNPhoodData* package to address the question how many of the previously identified H3K27ac quantitative trait loci (QTLs) for individuals from the Yoruban (YRI) population [1] show allele-specific signal in individuals of European origin (CEU). ```{r -1 to indicate the number of lines that should be parsed. Since here we work with only 178 hQTLs on chr21 we keep all of them for the analysis. There are a few parameters that we have to adjust: First, we want to pool datasets because there are two replicates for each individual, and combining the datasets will give us more power, for example to detect allelic biases (parameter *poolDatasets*). You also have to carefully check if the start positions in the user regions file are 0-based or 1-based because a shift of one base pair will result in non-sensical results for the genotype distribution of the hQTLs that are determined automatically during the analysis. In this case, start coordinates are 1-based, which is also the default for the parameter *zeroBasedCoordinates*. Our the data are mapped allele-specifically, so perform allele-specific analysis you have to ensure to set the parameter *readGroupSpecific* to TRUE. Lastly, we adjust the size of each bin within the region and select a smaller value than the default one (*binSize* = 25 instead of 50). ```{r changeParameters, echo=TRUE} # Verify that you do not have zero-based coordinates par.l$zeroBasedCoordinates par.l$readGroupSpecific par.l$poolDatasets = TRUE par.l$binSize = 25 ``` You are almost done with the preparation, all that is left is to create a data frame to tell *SNPhood* which data to use for the analysis and some additional meta information. For this, you can use another helper function: *collectFiles*. The argument *patternFiles* specifies the folder and file name of input files; wildcards are allowed. ```{r createFileList, echo=TRUE} patternBAMFiles = paste0(dirname(files[3]), "/*.bam") (files.df = collectFiles(patternFiles = patternBAMFiles, verbose = TRUE)) ``` Finally, you assign the names of the individuals in the column "inidivual" to make pooling of the datasets possible. The column input can be ignored because there is no negative control in this analysis due to the fact that the analysis is done allele-specifically and currently, input normalization does not work in this setting (see the main vignette for details). The column genotype can also be ignored for now because (will be integrated later): ```{r assignIndividualID, echo=TRUE} files.df$individual = c("GM10847", "GM10847", "GM12890", "GM12890") files.df ``` ## Quality control Now, you ready for the analysis. Before executing the full pipeline you will first do a quick quality control step to make sure that our datasets do not have any artefacts. For this, you will investigate the correlation of the read counts for our regions among the different datasets. The correlation coefficients should be very high among replicate samples and relatively high among different samples. For this, you run the main function *analyzeSNPhood* with a special argument and afterwards employ the function *plotCorrelationDatasets*. Note that you temporarily reset the value of the parameter *poolDatasets* to also check the correlation among the replicates samples. ```{r qualityTestPrep, echo=TRUE, results="hide"} par.l$poolDatasets = FALSE SNPhood.o = analyzeSNPhood(par.l, files.df, TRUE, verbose = TRUE) ``` You can now run the correlation analysis on the *SNPhood* object and plot it directly (we could also plot it to a PDF file): ```{r qualityTest, echo=TRUE} SNPhood.o = plotAndCalculateCorrelationDatasets(SNPhood.o, fileToPlot = NULL) corrResults = results(SNPhood.o, type = "samplesCorrelation") mean(corrResults$corTable[lower.tri(corrResults$corTable)]) ``` The correlation values are indeed very high among the replicate samples and also quite high among the individuals. The mean correlation coefficient among the datasets is 0.91, which is very high (note that only the lower triangle matrix of the correlation matrix is used for this, to not bias the analysis). There does not seem to be a problem with the datasets. ## Executing the main function Now you can execute the full pipeline by setting the parameter *onlyPrepareForDatasetCorrelation* to FALSE again (the default). The execution of the function may take a few minutes and may issue some warnings e.g. indicating missing data and "correcting" inconsistent parameter settings: ```{r runAnalysis, echo=TRUE, results="hide"} par.l$poolDatasets = TRUE SNPhood.o = analyzeSNPhood(par.l, files.df, onlyPrepareForDatasetCorrelation = FALSE, verbose = TRUE) ``` The warnings indicate that 4 duplicate regions have been removed as well as that the parameter *normAmongEachOther* has been set to FALSE because, in this case, reads are read group(allele)-specific (parameter *readGroupSpecific* is set to TRUE). ## Working with a *SNPhood* object: Extracting and manipulating information and metadata Now you can start analyzing the results already, as the main pipeline finished. To familiarize ourselves with the *SNPhood* object let's first take a look at some of its properties and helper functions. Generally, there are four "entities" stored in a *SNPhood* object: regions, datasets, bins, and read groups. For all of them, methods to extract and alter the stored information exist (see below). ```{r SNPhoodObj, echo=TRUE, results="hide"} SNPhood.o ## Retrieve number of regions, datasets, bins, and read groups nRegions(SNPhood.o) nDatasets(SNPhood.o) nBins(SNPhood.o) nReadGroups(SNPhood.o) ## Retrieve general annotation of SNPhood object with all its different elements names(annotation(SNPhood.o)) annotation(SNPhood.o)$regions annotation(SNPhood.o)$files ## Retrieve the parameters that were used for the analysis head(parameters(SNPhood.o)) names(parameters(SNPhood.o)) ## Retrieve annotation of regions head(annotationRegions(SNPhood.o)) ## Retrieve annotation of bins annotationBins(SNPhood.o) head(annotationBins2(SNPhood.o,fullAnnotation = TRUE)) SNP_names = c("rs7275860","rs76473124") head(annotationBins2(SNPhood.o, regions = 1:10, fullAnnotation = FALSE)) annotationBins2(SNPhood.o, regions = SNP_names, fullAnnotation = TRUE) ## Retrieve annotation of datasets annotationDatasets(SNPhood.o) annotationReadGroups(SNPhood.o) ## Extract counts after the binning # Extract one count matrix from the paternal read group from the first dataset head(counts(SNPhood.o, type = "binned", readGroup = "paternal", dataset = 1)) # Extract count matrices from all read groups from the first dataset str(counts(SNPhood.o, type = "binned", readGroup = NULL, dataset = 1)) # Extract count matrices from all read groups from the first dataset (using its name) DataSetName <- annotationDatasets(SNPhood.o)[1] str(counts(SNPhood.o, type = "binned", readGroup = NULL, dataset = DataSetName)) # Extract count matrices from all read groups from the all dataset str(counts(SNPhood.o, type = "binned", dataset = NULL)) ## Similarly, you can also extract counts before the binning head(counts(SNPhood.o, type = "unbinned", readGroup = "paternal", dataset = 1)) ## If you had enrichments instead of counts, you would employ the enrichments method in analogy to counts enrichment(SNPhood.o, readGroup = "paternal") ``` You can modify the information stored in a *SNPhood* object: ```{r SNPhoodObj2, echo=TRUE, results="hide"} SNPhood.o ## Rename regions, datasets, bins, and read groups mapping = as.list(paste0(annotationRegions(SNPhood.o),".newName")) names(mapping) = annotationRegions(SNPhood.o) SNPhood_mod.o = renameRegions(SNPhood.o, mapping) mapping = list("Individual1", "Individual2") names(mapping) = annotationDatasets(SNPhood.o) SNPhood_mod.o = renameDatasets(SNPhood.o, mapping) mapping = list("Bin1_NEW") names(mapping) = annotationBins(SNPhood.o)[1] SNPhood_mod.o = renameBins(SNPhood.o, mapping) mapping = list("a", "b", "c") names(mapping) = annotationReadGroups(SNPhood.o) SNPhood_mod.o = renameReadGroups(SNPhood.o, mapping) ## Delete regions, datasets, and read groups (deleting bins is still in development) SNPhood_mod.o = deleteRegions(SNPhood.o, regions = 1:5) SNPhood_mod.o = deleteRegions(SNPhood.o, regions = c("rs9984805", "rs59121565")) SNPhood_mod.o = deleteDatasets(SNPhood.o, datasets = 1) SNPhood_mod.o = deleteDatasets(SNPhood.o, datasets = "GM12890") # For read groups, we currently support only a name referral SNPhood_mod.o = deleteReadGroups(SNPhood.o, readGroups = "paternal") ## Merge read groups SNPhood_merged.o = mergeReadGroups(SNPhood.o) nReadGroups(SNPhood_merged.o) annotationReadGroups(SNPhood_merged.o) ``` ## Visualizing counts and enrichment Let's first visualize the number of overlapping reads for the regions before (*plotRegionCounts*) and after binning (*plotBinCounts*) datasets and read groups. The two functions have a number of arguments to make the visualization as flexible as possible. See the help pages for details, we here only touch upon a few parameters: ```{r visualizeCounts, echo=TRUE, results="hide"} plotBinCounts(SNPhood.o, region = 2) plotBinCounts(SNPhood.o, region = 2, plotGenotypeRatio = TRUE, readGroups = c("paternal","maternal")) plotRegionCounts(SNPhood.o, regions = 1:5, plotRegionBoundaries = TRUE, sizePoints = 2, plotRegionLabels = TRUE, mergeReadGroupCounts = TRUE) plotRegionCounts(SNPhood.o, regions = NULL, plotChr = "chr21", sizePoints = 2) ``` ## Testing for and visualizing allelic biases ```{r allelicBias, echo=TRUE} # Run the analysis, perform no time-consuming background calculation for now SNPhood.o = testForAllelicBiases(SNPhood.o, readGroups = c("paternal", "maternal"), verbose = FALSE) # Extract the results of the analysis, again using the results function names(results(SNPhood.o, type = "allelicBias")) head(results(SNPhood.o, type = "allelicBias", elements = "pValue")[[1]], 4) ``` You can now visualize the results to get an overview of the allelic bias in our dataset. You may start by visualizing the results of the allelic bias analysis across regions or a user-defined genomic range such as the full chromosome 21. For this, the function *plotAllelicBiasResultsOverview* is employed, which either plots the minimum or median p-value for each region in the selected genomic region. Note that p-values are uncorrected for multiple testing at this stage, an issue that you can address by calculating a permutation-based background distribution to assess the false discovery rate (FDR). For more details, see the main vignette and the argument *calcBackgroundDistr* of the function *testForAllelicBiases*. ```{r allelicBias2, echo=TRUE} plotAllelicBiasResultsOverview(SNPhood.o, regions = NULL, plotChr = "chr21", signThreshold = 0.05, pValueSummary = "min") plotAllelicBiasResultsOverview(SNPhood.o, regions = 3:5, plotRegionBoundaries = TRUE, plotRegionLabels = TRUE, signThreshold = 0.01, pValueSummary = "min") ``` The first plot indicates that a considerable amount of regions seems to show an allelic bias. You can then look at the results for the allelic bias analysis in detail for a specific dataset and region with the function *plotAllelicBiasResults*: ```{r allelicBias3, echo=TRUE, fig.height=7} plots = plotAllelicBiasResults(SNPhood.o, region = 2, signThreshold = 0.05, readGroupColors = c("blue", "red", "gray")) plots = plotAllelicBiasResults(SNPhood.o, region = 7, signThreshold = 0.05, readGroupColors = c("blue", "red", "gray")) ``` While the first plot shows an example of a region for which no allelic bias can be found, the second one shows significant allelic bias across many bins. In this case, the allelic bias might be caused by the genotype, as paternal and maternal reads have different genotypes (see the legend, A versus C). ## Cluster analyses *SNPhood* provides some clustering functionalities to cluster SNPs based on their local neighbourhood. Let's try them out. First, you will cluster the counts for the paternal read group (allele) from the first dataset using 2 and 5 clusters, respectively. The function *plotAndClusterMatrix* returns an object of class *SNPhood* so that the clustering results can be accessed directly for subsequent visualization. ```{r clusterCountMatrix, echo=TRUE} SNPhood.o = plotAndClusterMatrix(SNPhood.o, readGroup = "paternal", nClustersVec = 2, dataset = 1, verbose = FALSE) SNPhood.o = plotAndClusterMatrix(SNPhood.o, readGroup = "paternal", nClustersVec = 5, dataset = 1, verbose = FALSE) str(results(SNPhood.o, type = "clustering", elements = "paternal"), list.len = 3) ``` You can also plot only a subset of the clusters to remove clusters with invariant regions. Here, let's only plot clusters 2 to 5 instead of all of them (that is, 1 to 5): ```{r clusterCountMatrix2, echo=TRUE} SNPhood.o = plotAndClusterMatrix(SNPhood.o, readGroup = "paternal", nClustersVec = 5, dataset = 1, clustersToPlot = 2:5, verbose = FALSE) ``` As you can see, most of the regions have now been removed because they belonged to cluster 1. To summarize the cluster results and to more easily detect the patterns, you can also calculate the average enrichment across bins per cluster using the function *plotClusterAverage*. Note that this function returns the plots only: ```{r summarizeClusters, echo=TRUE} p = plotClusterAverage(SNPhood.o, readGroup = "paternal", dataset = 1) ``` The warning can be ignored here, it simply informs us that multiple plots have been generated even though no file name has been specified. Thus, you may only seem the last plot. ## Genotype analyses Next you can integrate the genotype from external sources for our regions. For this, you first create a data frame to specify which datasets to integrate genotypes with, the path to the VCF file that provides the genotypes, and the name of the column in the VCF file that corresponds to the dataset. ```{r associategenotype, echo=TRUE} (mapping = genotypeMapping = data.frame(samples = annotationDatasets(SNPhood.o), genotypeFile = rep(fileGenotypes, 2), sampleName = c("NA10847", "NA12890") )) SNPhood.o = associateGenotypes(SNPhood.o, mapping) ``` Let's take a look at the imported genotypes to check the uniformity across datasets: ```{r plotGenotypesPerSNP, echo=TRUE} p = plotGenotypesPerSNP(SNPhood.o, regions = 1:20) ``` ## Combined cluster and genotype analyses An additional feature that might be useful when performing clustering on several individuals is to group individuals for each SNP according to their genotype, which we divide into "strong" and "weak" genotypes. They are determined based on the signal obtained for the different genotypes. For this, we use the function *plotAndCalculateWeakAndStrongGenotype*. This function can only be executed when read groups have been merged (i.e., when only one read group is present). That's easy to fix: ```{r strongAndWeakGenotype2, echo=TRUE} SNPhood_merged.o = mergeReadGroups(SNPhood.o) SNPhood_merged.o = plotAndCalculateWeakAndStrongGenotype(SNPhood_merged.o, normalize = TRUE, nClustersVec = 3) ``` The first plot shows the results for the strong genotype, the second one for the weak genotype. Note that you now have two *SNPhood* objects in your workspace: One without and one with merged read groups. Because merging read groups is irreversible (see ?mergeReadGroups), we keep both objects in the workspace. You can now combine clustering and genotype one more time and perform the clustering only on the signal averaged across all high-genotype individuals at each SNP to increase the signal to noise ratio (analogous to the analysis in Figure 6 in Grubert et al. [1]): ```{r plotGenotypePerCluster, echo=TRUE} p = plotGenotypesPerCluster(SNPhood_merged.o, printPlot = FALSE, printBinLabels = TRUE) p[[1]] ``` ## How to continue? From here on, possibilities are endless, and you can further investigate patterns and trends in the data! We hope that the *SNPhood* package is useful for your research and encourage you to contact us if you have any question or feature request! # Bug Reports, Feature Requests and Contact Information We value all the feedback that we receive and will try to reply in a timely manner. Please report any bug that you encounter as well as any feature request that you may have to SNPhood@gmail.com. # References [1] Grubert, F., Zaugg, J. B., Kasowski, M., Ursu, O., Spacek, D. V., Martin, A. R., ... & Snyder, M. (2015). Genetic Control of Chromatin States in Humans Involves Local and Distal Chromosomal Interactions. Cell, 162(5), 1051-1065.