---
title: "Integrative analysis workshop with TCGAbiolinks and ELMER - Get DATA"
author: "Tiago Chedraoui Silva, Simon Coetzee, Dennis Hazelett, Ben Berman, Houtan Noushmehr"
date: "`r Sys.Date()`"
output: 
  html_document:
    self_contained: true
    number_sections: no
    theme: flatly
    highlight: tango
    mathjax: null
    toc: true
    toc_float: true
    toc_depth: 2
    css: style.css
bibliography: bibliography.bib    
vignette: >
  %\VignetteIndexEntry{Integrative analysis workshop with TCGAbiolinks and ELMER - Get DATA}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE,hide=TRUE, message=FALSE,warning=FALSE}
devtools::load_all(".")
```

# Introduction

In this section, we will learn to search and download DNA methylation (epigenetic) and 
gene expression (transcription) data from the newly created 
[NCI Genomic Data Commons (GDC) portal](https://portal.gdc.cancer.gov/) 
and prepare them into a Summarized Experiment object. 

The figure below highlights the workflow part which will be covered in this section.
![Part of the workflow covered in this section](figures/workflow_TGCAbiolinks.png)



# Downloading data

## Loading required libraries

```{r libs, eval=TRUE, message=FALSE,warning=F}
library(TCGAbiolinks)
library(SummarizedExperiment)
library(DT)
library(dplyr)
```

## Gene expression

```{r tcgabiolinks-exp, eval=FALSE}
query.exp <- GDCquery(project = "TCGA-LUSC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ",
                   barcode = c("TCGA-34-5231-01","TCGA-77-7138-01"))
GDCdownload(query.exp)
exp <- GDCprepare(query = query.exp,
                          save = TRUE, 
                          save.filename = "Exp_LUSC.rda",
                          summarizedExperiment = TRUE)
```

```{r tcgabiolinks-exp-obj, eval=TRUE}
exp

colData(exp) %>% as.data.frame %>% datatable(options = list(scrollX = TRUE), rownames = TRUE)
assay(exp)[1:5,] %>% datatable (options = list(scrollX = TRUE), rownames = TRUE)
rowRanges(exp)
```

## DNA methylation


This subsection describes how to download DNA methylation using the Bioconductor 
package [TCGAbiolinks](http://bioconductor.org/packages/TCGAbiolinks/) [@TCGAbiolinks]
from [NCI Genomic Data Commons (GDC) portal](https://portal.gdc.cancer.gov/).
In this example, we will download DNA methylation data (Infinium HumanMethylation450 platform)
for two TCGA-LUSC (TCGA Lung Squamous Cell Carcinoma) samples.
GDCquery function will search in the GDC database for the information required to 
download the data, this information is used by the `GDCdownload` function 
which will request the files to GDC, those files will be compacted into a 76 MB tar.gz file.
After the download is completed  `GDCdownload` will uncompress the tar.gz file and move 
its files to a folder; the default is GDCData/(Project)/(source)/(data.category)/(data.type)), 
in our example, it will be `GDCdata/TCGA-LUSC/harmonized/DNA_Methylation/Methylation_Beta_Value/`

![Data saved after GDCdownload is executed](figures/folder_structure.png)


Finally, `GDCprepare` transforms the downloaded data into a
[summarizedExperiment](http://bioconductor.org/packages/SummarizedExperiment/)
object [@huber2015orchestrating] or a data frame. If
*SummarizedExperiment* is set to TRUE, TCGAbiolinks will add to the
object molecular sub-type information, which was defined by The Cancer Genome
Atlas (TCGA) Research Network reports (the full list of papers can be
seen in [TCGAquery\_subtype section](http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#tcgaquery_subtype-working-with-molecular-subtypes-data.) in TCGAbiolinks vignette), and clinical information. 

```{r tcgabiolinks-met, eval=FALSE}
query.met <- GDCquery(project = "TCGA-LUSC", 
                      data.category = "DNA Methylation",
                      platform = "Illumina Human Methylation 450", 
                      barcode = c("TCGA-34-5231-01A-21D-1818-05","TCGA-77-7138-01A-41D-2043-05"))
GDCdownload(query.met)
met <- GDCprepare(query = query.met,
                  save = TRUE, 
                  save.filename = "DNAmethylation_LUSC.rda",
                  summarizedExperiment = TRUE)
```

The object created is a Sum
```{r tcgabiolinks-met-obj, eval=TRUE}
met

colData(met) %>% as.data.frame %>% datatable(options = list(scrollX = TRUE), rownames = TRUE)
assay(met)[1:5,] %>% datatable (options = list(scrollX = TRUE), rownames = TRUE)
rowRanges(met)
```

# Session Info

```{r sessioninfo, eval=TRUE}
sessionInfo()
```

# Bibliography