---
title: "RTCGA package tutorial"
author: "Marcin Kosinski"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: readable
    highlight: tango
    fig_width: 17
    fig_height: 10
    toc: true
    toc_depth: 4
    keep_md: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Integrating TCGA Data - RTCGA Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
---


```{r, echo=FALSE}
library(knitr)
opts_chunk$set(comment="", message=FALSE, warning = FALSE, tidy.opts=list(keep.blank.line=TRUE, width.cutoff=150),options(width=150))
```


# Introduction

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care. 

The RTCGA package offers an easy interface for downloading and integrating variety of the TCGA data using patient barcode key. This allows for easier data acquisition facilitating development of science and improvement of patients' treatment. Furthermore, the RTCGA package transforms the TCGA data to a tidy form which is convenient to use with R statistical package.


# RTCGA package

More detailed information about this package can be found
here [https://github.com/MarcinKosinski/RTCGA](https://github.com/MarcinKosinski/RTCGA).


## Installation of the RTCGA package
To get started, install the latest version of **RTCGA** from Bioconductor:
```{r, eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("RTCGA")
```
or use for development version:
```{r, eval=FALSE}
if (!require(devtools)) {
    install.packages("devtools")
    library(devtools)
}
biocLite("MarcinKosinski/RTCGA")
```

Make sure you have [rtools](http://cran.r-project.org/bin/windows/Rtools/) installed on your computer, if you are trying devtools on Windows.

# Light data management and manipulations

Below is an example of how to use `RTCGA` package to download `ACC` cohort data that contains: `clinical` data, `mutations` data, and `rnaseq v2` data. Furthermore, it is shown how to easily unzip those data and how to read them into tidy format.


## Adrenal Cortex Cancer (Adrenocortical carcinoma - ACC) data downloading

We will download data from the one of the newest release date.

```{r}
library(RTCGA)
releaseDate <- tail( checkTCGA('Dates'), 2 )[1]
# if server doesn't respond, just try
# date <- "2015-06-01"
```


We will need a folder into which we will download data.
```{r, echo=4}
if(file.exists("data")){
    unlink("data", recursive = TRUE, force = TRUE)
}
dir.create("data")
```


### Clinical data

Let us download clinical data. Simply use this command

```{r}
downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate )
```

### Rnaseq v2 data

Let us download rnaseq v2 data. Simply use this command

```{r}
downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level" )
# one can check all available dataSets' names with
# checkTCGA('DataSets')
```


### Mutations data

Let us download genes' mutations data. Simply use this command

```{r}
downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "Mutation_Packager_Calls.Level" )
```

## `untarFile` and `removeTar` parameters

By default `untarFile` and `removeTar` parameters are set to `TRUE` which
means that after a desired file is downloaded it is untarred and then the no longer needed `*.tar.gz` file is removed. When one used `downloadTCGA()` function with those parameters set to `FALSE` the that's the way how those files can be automatically untarred and then removed.
### Untarring data

Let us use the `untar()` function to untar all downloaded sets.

```{r, warning=FALSE, results='hide', eval=FALSE}

list.files( "data/") %>% 
   file.path( "data", .) %>%
   sapply( untar, exdir = "data/" )

```

### Removing no longer needed `tar.gz` files

After datasets are untarred, the `tar.gz` files ar no longer needed and can be deleted.

```{r, results='hide', eval=FALSE}
list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep( pattern = "tar.gz", x = ., value = TRUE) %>%
   sapply( file.remove )
```

## Shortening directories of downloaded files

Because the path to rnaseq data has more thatn 256 digits we need to shorten
that directory so that R can **notice** the existance of this file.

```{r}
list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep("rnaseq", x = ., value = TRUE) %>%    
   file.rename( to = substr(.,start=1,stop=50))
```

# Reading TCGA data to the tidy format

## Clinical data

All downloaded clinical datasets for all cohorts are available in `RTCGA.clinical` package. The process is described here: [http://mi2-warsaw.github.io/RTCGA.data/](http://mi2-warsaw.github.io/RTCGA.data/). Clinical data format is explained [here](https://wiki.nci.nih.gov/display/TCGA/Clinical+Data+Overview). Below is just a single code on how to read clinical data for BRCA.

```{r}
list.files("data/") %>%
    grep("Clinical", x = ., value = TRUE) %>%
    file.path("data", .)  -> folder

folder %>%
    list.files() %>%
    grep("clin.merged", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "clinical") -> BRCA.clinical

dim(BRCA.clinical)
```


## Rnaseq v2 data

All downloaded rnaseq datasets for all cohorts are available in `RTCGA.rnaseq` package. The process is described here: [http://mi2-warsaw.github.io/RTCGA.data/](http://mi2-warsaw.github.io/RTCGA.data/). rnaseq data format is explained [here](https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2). Below is just a single code on how to read rnaseq data for BRCA.

```{r}
list.files("data/") %>%
    grep("rnaseq", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>%
    list.files() %>%
    grep("illumina", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "rnaseq") -> BRCA.rnaseq

dim(BRCA.rnaseq)
```


## Mutations data

All downloaded mutations datasets for all cohorts are available in `RTCGA.mutations` package. The process is described here: [http://mi2-warsaw.github.io/RTCGA.data/](http://mi2-warsaw.github.io/RTCGA.data/). Mutations data format is explained [here](https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification). Below is just a single code on how to read mutations data for BRCA.

```{r, results='hide'}
list.files("data/") %>%
    grep("Mutation", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>% 
    readTCGA(path = ., "mutations") -> BRCA.mutations
```
```{r}
dim(BRCA.mutations)
```

# Information about TCGA project datasets


## Codes and counts for each cohort

```{r, eval = TRUE, results='asis'}
# library(devtools)
# install_github('Rapporter/pander')
if( require(pander) ){
infoTCGA() %>%
    pandoc.table()
}
```


## Available cohorts names

```{r, eval = TRUE}
(cohorts <- infoTCGA() %>% 
   rownames() %>% 
   sub("-counts", "", x=.))
```

## Dates of release

```{r, eval = TRUE}
checkTCGA('Dates')
```

## Names of avaialable DataSets


```{r}
checkTCGA('DataSets', 'ACC', releaseDate) %>%
    length()
```


```{r, echo=FALSE, results='hide'}
unlink("data", recursive = TRUE, force = TRUE)
```