--- vignette: > %\VignetteIndexEntry{Integrating TCGA Data - RTCGA.data - The Family of R Packages with Data from The Cancer Genome Atlas Study} %\VignetteEngine{knitr::rmarkdown} title: "RTCGA.data - The Family of R Packages with Data from The Cancer Genome Atlas Study" author: "Marcin Kosinski, Przemyslaw Biecek" date: "`r Sys.Date()`" output: html_document: theme: readable highlight: tango fig_width: 17 fig_height: 10 toc: true toc_depth: 4 keep_md: true number_sections: true --- The following article presents RTCGA.data: a family of R packages with data from The Cancer Genome Atlas Project (TCGA) study. TCGA is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing [1]. We converted selected datasets from this study into few separate packages that are hosted on one [GitHub repository](https://github.com/mi2-warsaw/RTCGA.data). These R packages make selected datasets easier to access and manage. Data sets in RTCGA.data packages are large and cover complex relations between clinical outcomes and genetic background. These packages will be useful for at least three audiences: biostatisticians that work with cancer data; researchers that are working on large scale algorithms, for them RTGCA data will be a perfect blasting site; teachers that are presenting data analysis method on real data problems. ```{r, echo=FALSE, message=FALSE, warning = FALSE} library(knitr) opts_chunk$set(comment="", message=FALSE, warning = FALSE, tidy.opts=list(keep.blank.line=TRUE, width.cutoff=150),options(width=150), eval = FALSE) ``` ```{r} ``` # Motivation The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes [1]. TCGA data are available through Firehose Broad GDAC portal [1]. One can select cancer type (cohort) and data type (e.g. clinical, RNA expression, methylation, ..) and download a `tar.gz` file with compressed data. When working with many cancer types we find this approach burdensome: - If one requires to download datasets containing i.e. information about genes' expressions for all available cohorts types (TCGA collected data for more than 30 various cancer types) one would have to go through click-to-download process many times, which is inconvenient and time-consuming. - Clinical datasets from TCGA project are not in a standard tidy data format, which is: one row for one observation and one column for one variable. They are transposed what makes work with those data burdensome. That becomes more onerous when one would like to investigate many clinical datasets. - Datasets containing information on some data types (e.g. gene's mutations) are not in one easy-to-handle file. Every patient has it's own file, what for many potential researchers may be an impassable barrier. - Data governance for many datasets for various cohorts saved in different folders with strange (default after untarring) names may be exhausting and uncomfortable for researchers that are not very skilled in data management or data processing. For these reasons we prepared selected datasets from TCGA project in an easy to handle and process way and embed them in 4 separate R packages. All packages can be installed from BioConductor by evaluating the following code: ```{r, eval=FALSE} source("https://bioconductor.org/biocLite.R") biocLite("RTCGA.clinical") biocLite("RTCGA.rnaseq") biocLite("RTCGA.mutations") biocLite("RTCGA.cnv") ``` # Patient's barcode as a key to merge data A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. An illustration on what each part of the patient's barcode can be found on \newline \ [https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode](https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode). # How to work with RTCGA.data family **RTCGA.data** family contains 4 packages: - `RTCGA.clinical` package containing clinical datasets from TCGA. Each cohort contains one dataset prepared in a tidy format. Each row, marked with patients' barcode, corresponds to one patient. Clinical data format is explained here https://wiki.nci.nih.gov/display/TCGA/Clinical+Data+Overview - `RTCGA.rnaseq` package containing genes' expressions datasets from TCGA. Each cohort contains one dataset with over 20 thousand of columns corresponding to genes' expression. Rows correspond to patients, that can be matched with patient's barcode. Genes' expressions data format is explained here https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2 - `RTCGA.mutations` package containint genes' mutations datsets from TCGA. Each cohort contains one dataset with extra column specifying patient's barcode which enables to distinguish which rows correspond to which patient. Mutations' data format is explained here https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification. - `RTCGA.cnv` package containing copy number (the number of copies of a given gene per cell) variation datasets from TCGA. More detailed information about datasets included in **RTCGA.data** family are shown in Table below. ```{r, echo=FALSE, eval = FALSE, results='asis'} library(dplyr) library(XML) library(stringi) readHTMLTable("http://gdac.broadinstitute.org/") -> df info <- df[[39]][,1:3] # this function produces information correspoding to Table nr 1 show_dims_RTCGA <- function( package_name ){ data(package = paste0("RTCGA.", package_name))$results[,3] %>% sapply( function(element){ #get(element, envir = .GlobalEnv) %>% dim() data(list=element, package = paste0("RTCGA.", package_name), envir = .GlobalEnv) get(element, envir = .GlobalEnv) %>% dim() -> res rm(list = element, envir = .GlobalEnv) return(res) }) %>% t -> df df_dims <- data.frame( "Cohort" = stri_extract_all_regex(row.names(df), pattern = paste0("[^\\.",package_name,"]")) %>% lapply( stri_flatten) %>% unlist, package_name = paste0(df[,1]," x ",df[,2]) ) names(df_dims)[2] <- package_name return(df_dims) } library(RTCGA.clinical) library(RTCGA.rnaseq) library(RTCGA.mutations) library(RTCGA.cnv) library(pander) left_join( x = info , y = show_dims_RTCGA("clinical"), by = "Cohort") %>% left_join( y = show_dims_RTCGA("cnv"), by="Cohort") %>% left_join( y = show_dims_RTCGA("mutations"), by="Cohort") %>% left_join( y = show_dims_RTCGA("rnaseq"), by="Cohort") %>% pandoc.table() ``` After installation, one can load any package from **RTCGA.data** family with commands ```{r, eval = FALSE} library(RTCGA.clinical) library(RTCGA.rnaseq) library(RTCGA.mutations) library(RTCGA.cnv) ``` and one can check what datasets are available (or aboved Table) with commands ```{r, eval = FALSE} ?clinical ?rnaseq ?mutations ?cnv ``` The data loading proceeds in a regular way. Simply type ```{r, eval = FALSE} data(cohort.package) ``` where `cohort` corresponds to a specific Cohort of patients and `package` corresponds to the one of four packages from **RTCGA.data** family. # Examples of applications ## The Kaplan-Meier estimate of the survival curves with the clinical data **RTCGA.data** family is excellent when one researches in a field of survival analysis and genomics. Survival times for patients are included in clinical datasets. The following example plots Kaplan-Meier [5] estimates of the survival functions for patients suffering from LUAD cancer, divided into stages of the cancer. ```{r, eval = FALSE} library(dplyr) library(RTCGA.clinical) #library(devtools);biocLite("mi2-warsaw/RTCGA.tools") library(RTCGA.tools) library(survival) library(survMisc) LUAD.clinical %>% mutate( patient.vital_status = ifelse(LUAD.clinical$patient.vital_status %>% as.character() =="dead",1,0), barcode = patient.bcr_patient_barcode %>% as.character(), times = ifelse( !is.na(patient.days_to_last_followup), patient.days_to_last_followup %>% as.character() %>% as.numeric(), patient.days_to_death %>% as.character() %>% as.numeric() ), stage = RTCGA.tools::mergeStages(LUAD.clinical$patient.stage_event.pathologic_stage) ) %>% rename( therapy = patient.drugs.drug.therapy_types.therapy_type ) %>% filter( !is.na(times) ) -> LUAD.clinical.selected LUAD.clinical.selected %>% survfit( Surv(times, patient.vital_status) ~ stage, data = .) %>% survMisc:::autoplot.survfit( titleSize=12, type="CI" ) %>% .[[2]] -> km_plot_luad km_plot_luad ``` The Kaplan-Meier estimate of the survival curve for the LUAD cancer. ## The Cox proportional hazards model with the genes' mutations data In a simple way one can use previously selected data to merge them with genes' mutations data and to compute Cox proportional hazards model [9]. Moreover, the goodness of fit can be checked with the plot of Martingale Residuals. ```{r, eval = FALSE, warning=FALSE} library(RTCGA.mutations) library(ggthemes) LUAD.clinical.selected %>% left_join( y = LUAD.mutations %>% filter( Hugo_Symbol == "TP53") %>% mutate( barcode = barcode %>% as.character %>% tolower %>% substr(1,12) ) %>% select( barcode, Variant_Classification), by = "barcode") %>% mutate( Variant_Classification = divideTP53(Variant_Classification) ) -> LUAD.clinical.mutations.selected (coxph(Surv(times, patient.vital_status)~ as.factor(stage)+Variant_Classification, data = LUAD.clinical.mutations.selected) -> LUAD.coxph) ``` ```{r, eval = FALSE} qplot(predict(LUAD.coxph, type="lp"),residuals(LUAD.coxph))+ theme_tufte(base_size=20)+ xlab("Linear combinations")+ ylab("Martingale residuals")+ geom_hline(yintercept=0, col ="orange", size = 3) ``` Martingale residuals vs. linear combination of the independent variables for the LUAD cancer's Cox proportional hazard model. ## The Principal Components Analysis for the rnaseq data One can also perform a Principal Components Analysis, after binding rnaseq data for few random cancer types like below. It can be seen that genes' expressions amongs those cancers vary and samples group in view of cancer type. ```{r, eval = FALSE} rbind(ACC.rnaseq, CHOL.rnaseq, GBM.rnaseq, PCPG.rnaseq, UVM.rnaseq) -> rnaseq_sample # which columns contain only zeros rnaseq_sample[,-1] %>% colSums() -> rnaseq_col_sums which(rnaseq_col_sums == 0) -> columns_with_only0 # pca rnaseq_sample[, -c(1,columns_with_only0+1)] %>% prcomp( scale = TRUE ) -> PCA # labels for pca lapply(list(ACC.rnaseq, CHOL.rnaseq, GBM.rnaseq, PCPG.rnaseq, UVM.rnaseq), nrow) -> rnaseq_nrow mapply(rep, c("ACC.rnaseq", "CHOL.rnaseq", "GBM.rnaseq", "PCPG.rnaseq", "UVM.rnaseq"), rnaseq_nrow) %>% unlist -> rnaseq_pca_labels # biplot #library(devtools);install_github("vqv/ggbiplot") library(ggbiplot) rownames(PCA$rotation) <- 1:nrow(PCA$rotation) ggbiplot(PCA, obs.scale = 1, var.scale = 1, groups = rnaseq_pca_labels, ellipse = TRUE, circle = TRUE, var.axes=FALSE) + theme(legend.direction = 'horizontal', legend.position = 'top') -> biplot_rnaseq biplot_rnaseq ``` The biplot for 2 main components of the principal component analysis of genes' expressions data for 5 various cancer types ```{r, eval=FALSE} [1] http://cancergenome.nih.gov/ [2] http://gdac.broadinstitute.org/ [3] http://cran.r-project.org/bin/windows/Rtools/ [4] https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode [9] Cox D. R., (1972) \textit{Regression models and life-tables (with discussion)}, Journal of the Royal Statistical Society Series B 34:187-220. [5] Kaplan, E. L.; Meier, P. (1958). "Nonparametric estimation from incomplete observations". J. Amer. Statist. Assn. 53 (282): 457-481. JSTOR 2281868. ```