--- title: "RTCGA package tutorial" author: "Marcin Kosinski" date: "`r Sys.Date()`" output: html_document: theme: readable highlight: tango fig_width: 17 fig_height: 10 toc: true toc_depth: 4 keep_md: true number_sections: true vignette: > %\VignetteIndexEntry{Integrating TCGA Data - RTCGA Tutorial} %\VignetteEngine{knitr::rmarkdown} --- ```{r, echo=FALSE} library(knitr) opts_chunk$set(comment="", message=FALSE, warning = FALSE, tidy.opts=list(keep.blank.line=TRUE, width.cutoff=150),options(width=150)) ``` # Introduction The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care. The RTCGA package offers an easy interface for downloading and integrating variety of the TCGA data using patient barcode key. This allows for easier data acquisition facilitating development of science and improvement of patients' treatment. Furthermore, the RTCGA package transforms the TCGA data to a tidy form which is convenient to use with R statistical package. # RTCGA package More detailed information about this package can be found here [https://github.com/MarcinKosinski/RTCGA](https://github.com/MarcinKosinski/RTCGA). ## Installation of the RTCGA package To get started, install the latest version of **RTCGA** from Bioconductor: ```{r, eval=FALSE} source("http://bioconductor.org/biocLite.R") biocLite("RTCGA") ``` or use for development version: ```{r, eval=FALSE} if (!require(devtools)) { install.packages("devtools") library(devtools) } biocLite("MarcinKosinski/RTCGA") ``` Make sure you have [rtools](http://cran.r-project.org/bin/windows/Rtools/) installed on your computer, if you are trying devtools on Windows. # Light data management and manipulations Below is an example of how to use `RTCGA` package to download `ACC` cohort data that contains: `clinical` data, `mutations` data, and `rnaseq v2` data. Furthermore, it is shown how to easily unzip those data and how to read them into tidy format. ## Adrenal Cortex Cancer (Adrenocortical carcinoma - ACC) data downloading We will download data from the one of the newest release date. ```{r} library(RTCGA) releaseDate <- tail( checkTCGA('Dates'), 2 )[1] # if server doesn't respond, just try # date <- "2015-06-01" ``` We will need a folder into which we will download data. ```{r, echo=4} if(file.exists("data")){ unlink("data", recursive = TRUE, force = TRUE) } dir.create("data") ``` ### Clinical data Let us download clinical data. Simply use this command ```{r} downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate ) ``` ### Rnaseq v2 data Let us download rnaseq v2 data. Simply use this command ```{r} downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate, dataSet = "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level" ) # one can check all available dataSets' names with # checkTCGA('DataSets') ``` ### Mutations data Let us download genes' mutations data. Simply use this command ```{r} downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate, dataSet = "Mutation_Packager_Calls.Level" ) ``` ## `untarFile` and `removeTar` parameters By default `untarFile` and `removeTar` parameters are set to `TRUE` which means that after a desired file is downloaded it is untarred and then the no longer needed `*.tar.gz` file is removed. When one used `downloadTCGA()` function with those parameters set to `FALSE` the that's the way how those files can be automatically untarred and then removed. ### Untarring data Let us use the `untar()` function to untar all downloaded sets. ```{r, warning=FALSE, results='hide', eval=FALSE} list.files( "data/") %>% file.path( "data", .) %>% sapply( untar, exdir = "data/" ) ``` ### Removing no longer needed `tar.gz` files After datasets are untarred, the `tar.gz` files ar no longer needed and can be deleted. ```{r, results='hide', eval=FALSE} list.files( "data/") %>% file.path( "data", .) %>% grep( pattern = "tar.gz", x = ., value = TRUE) %>% sapply( file.remove ) ``` ## Shortening directories of downloaded files Because the path to rnaseq data has more thatn 256 digits we need to shorten that directory so that R can **notice** the existance of this file. ```{r} list.files( "data/") %>% file.path( "data", .) %>% grep("rnaseq", x = ., value = TRUE) %>% file.rename( to = substr(.,start=1,stop=50)) ``` # Reading TCGA data to the tidy format ## Clinical data All downloaded clinical datasets for all cohorts are available in `RTCGA.clinical` package. The process is described here: [http://mi2-warsaw.github.io/RTCGA.data/](http://mi2-warsaw.github.io/RTCGA.data/). Clinical data format is explained [here](https://wiki.nci.nih.gov/display/TCGA/Clinical+Data+Overview). Below is just a single code on how to read clinical data for BRCA. ```{r} list.files("data/") %>% grep("Clinical", x = ., value = TRUE) %>% file.path("data", .) -> folder folder %>% list.files() %>% grep("clin.merged", x = ., value=TRUE) %>% file.path(folder, .) %>% readTCGA(path = ., "clinical") -> BRCA.clinical dim(BRCA.clinical) ``` ## Rnaseq v2 data All downloaded rnaseq datasets for all cohorts are available in `RTCGA.rnaseq` package. The process is described here: [http://mi2-warsaw.github.io/RTCGA.data/](http://mi2-warsaw.github.io/RTCGA.data/). rnaseq data format is explained [here](https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2). Below is just a single code on how to read rnaseq data for BRCA. ```{r} list.files("data/") %>% grep("rnaseq", x = ., value = TRUE) %>% file.path("data", .) -> folder folder %>% list.files() %>% grep("illumina", x = ., value=TRUE) %>% file.path(folder, .) %>% readTCGA(path = ., "rnaseq") -> BRCA.rnaseq dim(BRCA.rnaseq) ``` ## Mutations data All downloaded mutations datasets for all cohorts are available in `RTCGA.mutations` package. The process is described here: [http://mi2-warsaw.github.io/RTCGA.data/](http://mi2-warsaw.github.io/RTCGA.data/). Mutations data format is explained [here](https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification). Below is just a single code on how to read mutations data for BRCA. ```{r, results='hide'} list.files("data/") %>% grep("Mutation", x = ., value = TRUE) %>% file.path("data", .) -> folder folder %>% readTCGA(path = ., "mutations") -> BRCA.mutations ``` ```{r} dim(BRCA.mutations) ``` # Information about TCGA project datasets ## Codes and counts for each cohort ```{r, eval = TRUE, results='asis'} # library(devtools) # install_github('Rapporter/pander') if( require(pander) ){ infoTCGA() %>% pandoc.table() } ``` ## Available cohorts names ```{r, eval = TRUE} (cohorts <- infoTCGA() %>% rownames() %>% sub("-counts", "", x=.)) ``` ## Dates of release ```{r, eval = TRUE} checkTCGA('Dates') ``` ## Names of avaialable DataSets ```{r} checkTCGA('DataSets', 'ACC', releaseDate) %>% length() ``` ```{r, echo=FALSE, results='hide'} unlink("data", recursive = TRUE, force = TRUE) ```