---
title: "Introduction to iGC"
author: "iGC Developers"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{Introduction to iGC}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
This document guides one through all available functions of the `iGC` package. Package iGC aims to analyze gene expression (GE) and copy number alteration (CNA) (iGC) concurrently.


Traditional CNA analysis method is to investigate different types of samples and integrate their results by Venn diagrams. Challenges arise, however, when the low reproducibility and inconsistency are observed across multiple platforms. To address these issues, iGC tests gene expression profiles and copy number variation simultaneously.

For more information about the method iGC uses, please refer to our publication: Yi-Pin Lai, Liang-Bo Wang, Liang-Chuan Lai, Mong-Hsun Tsai, Tzu-Pin Lu, Eric Y Chuang. iGC--an integrated analysis package of Gene expression and Copy number alteration, *Bioinfomatics* (publication pending).


## Installation

iGC is on Bioconductor and can be installed following standard installation procedure.

```{r, eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("iGC")
```

To use,

```{r}
library(iGC)
```


## General Workflow

The general workflow can be summarized as follows,

![](pics/analysis_workflow.png)

Basically there are four steps, corresponding to four R functions, to complete the analysis:

1. Specify sample and their data relations
2. Read and organize gene expression files
3. Read and organize CNA files based on gene regions
4. Identify CNA-driven differentially expressed genes


## Data Source

As shown in the workflow, samples of *paired* CNA and gene expression data are
required for the analysis. Since iGC reads CNA and gene expression data at gene
level or chromosome location, data sources are platform and technology agnostic,
that is, data from either microarrary or next-generation sequencing are
acceptable. However, it also implies some standard preprocessing steps are
required to convert the raw readings into interpretable data.

### Gene expression

For each sample, the data should have at least two columns, gene and expression:

```raw
GENE  Expression
A     0.1
B     -0.5
C     0.4
```

iGC reads all the samples' gene expression by `create_gene_exp` and make a joint
gene expression table,

```raw
GENE  SampleA  SampleB ...
A         0.1     0.5
B        -0.5     2.1
C         0.4      NA
```

That means, if such joint expression table exists, one could skip this step.

### CNA

iGC accepts two kinds of the CNA data format, chromsome location based and gene 
based, which are handled by `create_gene_cna` and `direct_gene_cna` 
respectively. For chromosome location based CNA records, each sample's data
should has at least the following columns,

```raw
Choromsome  Start  End    Expression
1           1000   5030          2.5
10          10     2560         -1.3
X           12345  14200         3.3
```

iGC will convert this format of records into gene based format by looking up the
genome reference, which currently supports hg19 only (see Q&A for more info and
future developemnt plan).

For gene based CNA records, data format resembles the gene expression data,

```raw
GENE   Expression
A             2.5
B             2.5
C            -1.3
```

In the end, iGC reads all samples' CNA records and make a joint CNA table.
Different from the gene expression processing, CNA records are additionally
converted to CNA gain/loss events by given thresholds.

```raw
GENE  SampleA  SampleB  ...
A           0        0
B          -1       -1
C           0        1
```

If such joint table already existed one could skip this step as well.

### Custom reader function

To support various kinds of data formats, all `create_gene_exp`, 
`create_gene_cna`, and `direct_gene_cna` accept an arugment `read_fun`, a 
function to read sample data, to customize how the data should be read. Please 
refer to their documentation to find out the implementation details.


## Usage Example

Here we show an example using the data shipped together with iGC.

### Example Data Source

To demonstrate the usage of the `iGC` package, the package comes with 50 breast
cancer samples, which are selected from a microarray dataset of total 523 breast
cancer samples from TCGA level 3.

For each sample, it contains a paired gene expression (GE) and the copy number
(CN) data. The GE data was conducted by Agilent G4502A platform and the CN data
was generated from Genome Wide SNP platform.


### Sample Description Generation

First, a sample description table is created to connect sample names with their 
data. It can be stored as a CSV file with three columns: 
`Sample`, `CNA_filepath`, and `GE_filepath`.

```{r}
sample_desc_pth <- system.file("extdata", "sample_desc.csv", package = "iGC")
sample_desc <- create_sample_desc(sample_desc_pth)
```

Alternatively, one can pass three separate character vectors to create the same
table,

```{r, eval=FALSE}
sample_desc <- create_sample_desc(
  sample_names = sample_desc$Sample,
  cna_filepaths = sample_desc$CNA_filepath,
  ge_filepaths = sample_desc$GE_filepath
)
```

```{r}
head(sample_desc)
```


### Joint Gene Expression Table

Second, we join all samples' gene expression as one table.

```{r}
gene_exp <- create_gene_exp(sample_desc, progress = FALSE)
```

`create_gene_exp` comes with a builtin reader function to read in the gene
expression data. However for formats that it fails to recognize, one can define
one's own reader function through `read_fun`,

```{r, eval=FALSE}
gene_exp <- create_gene_exp(
  sample_desc,
  read_fun = read.table,
  progress = TRUE, progress_width = 60,
  # arugments passed to the customized read_fun (here is read.table)
  header = FALSE,
  skip = 2,
  na.strings = "null",
  colClasses = c("character", "double")
)
```

Note that arguments `header`, `skip`, `na.srings`, and `colClasses` are not used
by `create_gene_exp` but passed to `read.table`, the custom reader function
defined here, directly.

We select the expression of gene TP53, BRCA1, and NFKB1 for the first 9 samples
(first column contains gene names).

```{r}
gene_exp[GENE %in% c('TP53', 'BRCA1', 'NFKB1'), 1:10, with=FALSE]
```


### Joint CNA Status Table Mapped onto Gene Locations

Thirdly, the CNA data is read, collected, and mapped on to human gene locations
using genome reference hg19. Each gene will be evaluated as CNA-gain (1),
CNA-loss (-1), and neutral (0). Threshold can be set to tune the level of CNA
determined as gain or loss.

Here we set the threshold of 2.4 for gain and 1.6 for loss events.

```{r}
my_cna_reader <- function(cna_filepath) {
  cna <- data.table::fread(cna_filepath, sep = '\t', header = TRUE)
  cna[, .(Chromosome, Start, End, Segment_Mean)]
}

gain_loss = log2(c(2.4, 1.6)) - 1
gene_cna <- create_gene_cna(
  sample_desc,
  gain_threshold = gain_loss[1], loss_threshold = gain_loss[2],
  read_fun = my_cna_reader,
  progress = FALSE
)
gene_cna[GENE %in% c('TP53', 'BRCA1', 'NFKB1'), 1:10, with=FALSE]
```

#### Parallelization

For performance issues, one can enable parallelization to boost the process.

```{r, eval=FALSE}
# Change 4 to match one's total CPU cores
doMC::registerDoMC(cores = 4)
gene_cna <- faster_gene_cna(
  sample_desc, gain_loss[[1]], gain_loss[[2]], parallel = TRUE
)
```

#### Read gene information directly from data

If one's CNA data already contains gene information, try `direct_gene_cna`.

### CNA-driven Differentially Expressed Genes Identification

Lastly, run `find_cna_driven_gene` to identify differentially expressed genes
driven by CNAs from samples with both proprocessed GE `gene_exp` and CNA data
`gene_cna` that were obtained from the previous steps.

A threshold for proportion of the copy number changed samples (gain or loss) is
given to select CNA-driven genes.

```{r}
cna_driven_genes <- find_cna_driven_gene(
  gene_cna, gene_exp,
  gain_prop = 0.15, loss_prop = 0.15,
  progress = FALSE, parallel = FALSE
)
head(cna_driven_genes$gain_driven)
head(cna_driven_genes$loss_driven)
head(cna_driven_genes$both)
```

For example, BRCA1 appears to be CNA loss-driven in this dataset, and its
expression is lower in samples of CNA loss than samples of CNA neutral.

```{r}
cna_driven_genes$loss_driven[GENE %in% c('BRCA1')]
```


## Q and As

#### Q: Why required to use the bundled hg19 human genome reference?

In the early phase of development, iGC required a special data structure for
genome reference hence one is bundled. Now no such special structure is
required, so we plan to relax such constraint in the coming-up release and user
will be able to pass in other references available on Bioconductor.

Currently the modified `hg19DBNM` contains RefSeq transcripts of hg19 from UCSC
Genome Browser. The transcripts with NM marker ID and protein coding, were
selected.