--- title: "fCI" author: "Shaojun Tang" output: BiocStyle::html_document: toc: true --- # `fCI` `fCI` (f-divergence Cutoff Index), identifies DEGs by computing the difference between the distribution of fold-changes for the control-control and remaining (non-differential) case-control gene expression ratio data.: `fCI` provides several advantages compared to existing methods. Firstly, it performed equally well or better in finding DEGs in diverse data types (both discrete and continuous data) from various omics technologies compared to methods that were specifically designed for the experiments. Secondly, it fulfills an urgent need in the omics research arena. The increasingly common proteogenomic approaches enabled by rapidly decreasing sequencing costs facilitates the collection of multi-dimensional (i.e. proteogenomics) experiments, for which no efficient tools have been developed to find co-regulation and dependences of DEGs between treatment conditions or developmental stages. Thirdly, fCI does not rely on statistical methods that require sufficiently large numbers of replicates to evaluate DEGs. Instead fCI can effectively identify changes in samples with very few or no replicates. * **REQUIRED** (Equivalent to an `ERROR` in `R CMD check`). This means the package is missing something critical and it cannot be accepted into Bioconductor until the issue is fixed. * **RECOMMENDED** (Equivalent to a `WARNING` in `R CMD check`). We strongly encourage that these issues be fixed. In the weeks leading up to a Bioconductor release we will ask package authors to fix these issues. * **CONSIDER**: Not necessarily something bad, just something we wanted to point out. package authors don't need to take action on these, but they can. These are analagous to `NOTE`s in `R CMD check`, though unlike `NOTE`s, there is no expectation that they will increase in severity over time. # Using `fCI` `fCI` can also be run interactively: ```{r} library(fCI) suppressWarnings(library(fCI)) suppressWarnings(library(psych)) ``` # Installing `fCI` `fCI` should be installed as follows: ```{r eval=FALSE} source("http://bioconductor.org/biocLite.R") biocLite("fCI") library(fCI) ``` ## Dependency Checks * **Checking if other packages can import this one...** Checks to make sure that there will be no import problems if another package imports your package. (`REQUIRED`) * **Checking to see if we understand object initialization....** Reports if it can't figure out how objects were initialized (`CONSIDER`). # Examples on how to use `fCI` `fCI` can be used as the follows: ## Environment Setup: * **Load the example sample data** provided by fCI package under data/. This data contains gene/protein expression values with columns representing samples/lanes/replicates, and rows representing genes. * **In this example**, a total of 3 control replicates (column 1 through 3) and 3 case replicates (columns 4 through 6) is shown. ## Load the data ```{r, results='asis'} fci.data=data.frame(matrix(sample(3:100, 1043*6, replace=TRUE), 1043,6)) ``` ## Finding Differentially Expressed Genes: * To find differentially expressed genes** using column 1 to 3 as controls and 4 to 6 as case replicates, execute the following function call : ```{r} suppressWarnings(library(gtools)) targets=fCI.call.by.index(c(1,2,3), c(4,5,6), fci.data) head(targets) ``` ## Result Interpretation * The output will be the genes (ids) that are differentially expressed and its frequency (a ratio tells how often each gene is shown to be differentially expressed in all fCI combinations). *In general, the higher the frequency (the ratio) tells us how likely a gene is differentially expressed * For example, A ratio of 75% means the gene under study is shown to be a dysregulated target in 3 out of 4 fCI pairwise analysis. *As fCI is coded using object oritented programming, all computations are based on object manipulation. *You will be able to perform very versatile analysis by changing the software parameters and altering the options in computing the dysresgulated genes. ## Illustrate the fCI analysis in details * First, create an object -- fci, which holds the gene expression levels, default model parameters. ```{r} fci=new("NPCI") ``` * next, you need to provide gene/protein expression values for subsequent analysis. method 1): assign an existing data frame (of gene expression values) in R console directly to the object, for example : ```{r eval=FALSE} fci@sample.data.normalized=fci.data** ``` * or simply provide the pathname for the file (tab-delimited file with rownames) that contains the expression values. ```{r eval=FALSE} **fci@sample.data.file="c://home//fci_data.txt"** ``` ## Assign data to fCI object assign the built-in dataset to the object ```{r} if(dim(fci.data)[1]>0){ fci@sample.data.normalized=fci.data } ``` after you assign the data to the object, you should initialize the object fci (which will remove genes with zero expression values and intialize the object parameters) and then assign the fci's control replicates (fci@wt.index) with two sample column ids (i.e. 1 & 2) for constructing empirical null distribution, and assign fci's control-case replicates (fci@df.index) with two sample column ids (i.e. 1 & 5) for control-case distribution. ```{r} if(dim(fci.data)[1]>0){ fci=initialize(fci) fci@wt.index=c(1,2) fci@df.index=c(1,4) } ``` next, you will perform the formal fCI analysis by calling the following functions to the object. ```{r} if(dim(fci.data)[1]>0){ fci =populate(fci ) fci =compute(fci ) fci =summarize(fci ) } ``` after the divergence was computed, fCI will generate the final results and saved to the 'result' field, which included the total number of differentially expressed genes, optimal fold cutoff, and divergence value. In addition,you will be able to see the differentially expressed gene ids (indicated by row numbers) and figures about the empirical-null vs control-case distribution. since fCI uses object oriented programming, you can easily change the field s of the objects, such as fold cutoff list, the column ids for emprical null distributions (and/or control-case distribution) to evaluate differential expression on the samples of interest. After changes are made, rerun step 6 to compute new analysis results. ```{r} if(dim(fci.data)[1]>0){ fci@fold.cutoff.list=list(seq(from=1.2, to=5, by=0.2)) fci@wt.index=c(2,3) fci@df.index=c(2,5) } ``` ## multi-dimensional data if the dataset contains multiple control replicates and case replicates, you don't need to form the these combinations and perform fCI individually. Instead, you could invoke fci on a top-level function that will automatically perform the analysis on given control replicate column ids and case/experimental column ids. Given the dataset that contains 3 control replicates and 3 case replicates, fci will form possible 3 control-control (empirical null) distributions, namely 1-2, 1-3 and 2-3 to construct empirical null ratio distribution, and 9 control-case ratio distribution, namely 1-4, 1-5, 1-6, 2-4, 2-5, 2-6, 3-4, 3-5, and 3-6. Then fci will pick one distribution from the control-control, and one distribution fro control-case (i.e., 1-2 & 1-5) to form a valid fci analysis. There are a total of 3*9=27 unique fci analysis for this example. ```{r} fci=new("NPCI") filename="" if(file.exists("../inst/extdata/Supp_Dataset_part_2.txt")){ filename="../inst/extdata/Supp_Dataset_part_2.txt" }else if(file.exists("../../inst/extdata/Supp_Dataset_part_2.txt")){ filename="../../inst/extdata/Supp_Dataset_part_2.txt" } if(nchar(filename)>3){ fci=find.fci.targets(fci, c(1,2,3), c(4,5,6), filename, use.normalization=FALSE) result=show.targets(fci) head(result, 20) } ``` results from analysis on step will be identical to results shown in step 1). However, you have more flexibility changing the parameters, and printing the figures for sample of interest. ## Normalization fCI enables users lot of flexibilities to perform differential expression. For example, the users could choose to normalize the each replicate's gene expression based on total-library normalization, trimed-normalization, median normalization and so on. In addition, fCI enables two fundamental options for divergence estimation. The first method is the helliger distance estimation (the default option) which assumes the log-ratio expression values to follow the gaussian distribution. The second method is the cross-entropy which relax the condition. ```{r eval=FALSE} library(fCI) fci=normalization(fci) ``` ## identify a specific sample for differential expression ```{r} if(dim(fci@sample.data.normalized)[1]>100 & dim(fci@sample.data.normalized)[2]>3){ fci@wt.index=c(1,2) fci@df.index=c(1,4) fci@method.option=1 fci =populate(fci ) fci =compute(fci ) fci =summarize(fci ) } ``` ## Time-course analysis besides performing differential expression analysis using transcriptomic and/or proteomic data, fCI enables the users to perform jointly analysis using multi-dimensional data. By multi-dimensional, we refer to data that has been generated for multiple related samples, i.e. time course, different tissues, cell types or in cases where both transcriptomic and proteomic data are available. ```{r} if(nchar(filename)>3){ fci=new("NPCI") fci=find.fci.targets(fci, c(1,2,3), c(4,5,6), "../inst/extdata/Supp_Dataset_part_2.txt", use.normalization=FALSE) result=show.targets(fci) head(result, 20) } ```