\name{gage} \Rdversion{1.1} \alias{gage} \alias{gagePrep} \alias{gageSum} \title{ GAGE (Generally Applicable Gene-set Enrichment) analysis } \description{ Run GAGE analysis to infer gene sets (or pathways, functional groups etc) that are signficantly perturbed relative to all genes considered. GAGE is generally applicable to essentially all microarray dta independent of data attributes including sample size, experimental layout, study design, and all types of heterogeneity in data generation. gage is the main function; gagePrep is the functions for the initial data preparation, especially sample pairing; gageSum carries out the final meta-test summarization. } \usage{ gage(exprs, gsets, ref = NULL, samp = NULL, set.size = c(10, 500), same.dir = TRUE, compare = "paired", rank.test = FALSE, use.fold = TRUE, FDR.adj = TRUE, weights = NULL, full.table = FALSE, saaPrep = gagePrep, saaTest = gs.tTest, saaSum = gageSum, ...) gagePrep(exprs, ref = NULL, samp = NULL, same.dir = TRUE, compare = "paired", rank.test = FALSE, use.fold = TRUE, weights = NULL, full.table = FALSE, ...) gageSum(p.results, ref = NULL, mstat = NULL, setsizes = NULL, same.dir = TRUE, compare = "paired", use.fold = TRUE, weights = NULL, full.table = FALSE, ...) } \arguments{ \item{exprs}{ an expression matrix or matrix-like data structure, with genes as rows and samples as columns. } \item{gsets}{ a named list, each element contains a gene set that is a character vector of gene IDs or symbols. For example, type \code{head(kegg.gs)}. A gene set can also be a "smc" object defined in PGSEA package. Please make sure that the same gene ID system is used for both \code{gsets} and \code{exprs}. } \item{ref}{ a numeric vector of column numbers for the reference condition or phenotype (i.e. the control group) in the exprs data matrix. Default ref = NULL, all columns are considered as target experiments. } \item{samp}{ a numeric vector of column numbers for the target condition or phenotype (i.e. the experiment group) in the exprs data matrix. Default samp = NULL, all columns other than ref are considered as target experiments. } \item{set.size}{ gene set size (number of genes) range to be considered for enrichment test. Tests for too small or too big gene sets are not robust statistically or informative biologically. Default to be set.size = c(10, 500). } \item{same.dir}{ boolean, whether to test for changes in a gene set toward a single direction (all genes up or down regulated) or changes towards both directions simultaneously. For experimentally derived gene sets, GO term groups, etc, coregulation is commonly the case, hence same.dir = TRUE (default); In KEGG, BioCarta pathways, genes frequently are not coregulated, hence it could be informative to let same.dir = FALSE. Although same.dir = TRUE could also be interesting for pathways. } \item{compare}{ character, which comparison scheme to be used: 'paired', 'unpaired', '1ongroup', 'as.group'. 'paired' is the default, ref and samp are of equal length and one-on-one paired by the original experimental design; 'as.group', group-on-group comparison between ref and samp; 'unpaired' (used to be '1on1'), one-on-one comparison between all possible ref and samp combinations, although the original experimental design may not be one-on-one paired; '1ongroup', comparison between one samp column at a time vs the average of all ref columns. For PAGE-like analysis, the default is compare='as.group', which is the only option provided in the original PAGE method. All other comparison schemas are set here for direct comparison to gage. } \item{rank.test}{ rank.test: Boolean, whether do the optional rank based two-sample t-test (equivalent to the non-parametric Wilcoxon Mann-Whitney test) instead of parametric two-sample t-test. Default rank.test = FALSE. This argument should be used with respect to argument saaTest. } \item{use.fold}{ Boolean, whether to use fold changes or t-test statistics as per gene statistics. Default use.fold= TRUE. } \item{FDR.adj}{ Boolean, whether to do adjust for multiple testing as to control FDR (False dicovery rate). Default to be TRUE. } \item{weights}{ a numeric vector to specify the weights assigned to pairs of ref-samp. This is needed for data with both technical replicates and biological replicates as to count for the different contributions from the two types of replicates. This argument is also useful in manually paring ref-samp for unpaired data, as in pairData function. function. Default to be NULL. } \item{full.table}{ Boolean, whether to output the full table of all individual p-values from the pairwise comparisons of ref and samp. This option is only effective for over 10 ref-samp pairs, all individual p-values will be output when the number is less than or equal than 10. Default to be FALSE. } \item{saaPrep}{ function used for data preparation for single array based analysis, including sanity check, sample pairing, per gene statistics calculation etc. Default to be gagePrep, i.e. the default data preparation routine for gage analysis. } \item{saaTest}{ function used for gene set tests for single array based analysis. Default to be gs.tTest, which features a two-sample t-test for differential expression of gene sets. Other options includes: gs.zTest, using one-sample z-test as in PAGE, or gs.KSTest, using the non-parametric Kolmogorov-Smirnov tests as in GSEA. The two non-default options should only be used when rank.test = FALSE. } \item{saaSum}{ function used for summarization of the results from single array analysis (i.e. pairwise comparison between ref and samp). This function should include a meta-test for a global p-value or summary statistis and a FDR adjustment for multi-testing issue. Default to be gageSum, i.e. the default data summarization routine for gage analysis. } \item{p.results }{ a numeric matrix of p-values for up-regulation (greater than) or down-regulation (less than) tests, i.e. the 2nd or 3rd list element returned by calling saaTest function. Gene sets are rows, samp-ref pairs are columns } \item{mstat }{ vector of test statistics mean for individual gene sets, the 4th list element returned by calling saaTest function. } \item{setsizes }{ vector of effective set size (number of genes) individual gene sets, the 5th list element returned by calling saaTest function. } \item{\dots}{ other arguments to be passed into the optional functions for saaPrep, saaTest and saaSum. } } \details{ We proposed a single array analysis (i.e. the one-on-one comparison) approach with GAGE. Here we made single array analysis a general workflow for gene set analysis. Single array analysis has 4 major steps: Step 1 sample pairing, Step 2 per gene tests, Step 3 gene set tests and Step 4 meta-test summarization. Correspondingly, this new main function, gage, is divided into 3 relatively independent modules. Module 1 input preparation covers step 1-2 of single array analysis. Module 2 corresponds to step 3 gene set test, and module 3 to step 4 meta-test summarization. These 3 modules become 3 argument functions to gage, saaPrep, saaTest and saaSum. The modulization made gage open to customization or plug-in routines at each steps and fully realize the general applicability of single array analysis. More examples will be included in a second vignette to demonstrate the customization with these modules. } \value{ The result returned by gage function is either a single data matrix (same.dir = FALSE, test for two-directional changes) or a named list of two data matrix (same.dir = TRUE, test for single-direction changes) for the results of up- ($greater) and down- ($less) regulated gene sets. Each data matrix here has gene sets as rows sorted by global p-values, with multiple statistics columns, including: \item{P.geomean }{geometric mean of the individual p-values from multiple single array based gene set tests} \item{stat.mean }{mean of the individual statistics from multiple single array based gene set tests} \item{P.erlang }{gloal p-value or summary of the individual p-values from multiple single array based gene set tests. This is the default p-value being used.} \item{q.BH }{FDR q-value adjustment of the global p-value using the Benjamini & Hochberg procedure implemented in multtest package. This is the default q-value being used.} \item{set.size }{the effective gene set size, i.e. the number of genes included in the gene set test} \item{other columns }{optional columns of the individual p-values from multiple single array based gene set tests} The result returned by \code{gagePrep} is a data matrix derived from \code{exprs}, but ready for column-wise gene est tests. In the matrix, genes are rows, and columns are the per gene test statistics from the ref-samp pairwise comparison. The result returned by \code{gageSum} is almost identical to the results of \code{gage} function, except without the optional columns of individual p-values and without row sorting by global p-values. } \references{ Luo, W., Friedman, M., Shedden K., Hankenson, K. and Woolf, P GAGE: Generally Applicable Gene Set Enrichment for Pathways Analysis. BMC Bioinformatics 2009, 10:161 More information at: http://sysbio.engin.umich.edu/~luow/downloads.php } \author{ Weijun Luo } \seealso{ \code{\link{gs.tTest}}, \code{\link{gs.zTest}}, and \code{\link{gs.KSTest}} functions used for gene set test; \code{\link{gagePipe}} and \code{\link{heter.gage}} function used for multiple GAGE analysis in a batch or combined GAGE analysis on heterogeneous data } \examples{ data(gse16873) cn=colnames(gse16873) hn=grep('HN',cn, ignore.case =TRUE) dcis=grep('DCIS',cn, ignore.case =TRUE) data(kegg.gs) data(go.gs) #kegg test for 1-directional changes gse16873.kegg.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis) #go.gs with the first 1000 entries as a fast example. gse16873.go.p <- gage(gse16873, gsets = go.gs, ref = hn, samp = dcis) str(gse16873.kegg.p) head(gse16873.kegg.p$greater[, 1:5]) head(gse16873.kegg.p$less[, 1:5]) #kegg test for 2-directional changes gse16873.kegg.2d.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, same.dir = FALSE) head(gse16873.kegg.2d.p[, 1:5]) ###alternative ways to do GAGE analysis### #with unpaired samples gse16873.kegg.unpaired.p <- gage(gse16873, gsets = kegg.gs, ref = hn, samp = dcis, compare = "unpaired") #other options to tweak includes: #saaTest, use.fold, rank.test, etc. Check arguments section above for #details and the vignette for more examples. } \keyword{htest} \keyword{multivariate} \keyword{manip}