% \VignetteIndexEntry{KEGGprofile: Application Examples}
% \VignettePackage{KEGGprofile}
\documentclass[12pt]{article}
\textwidth 6.75in
\textheight 9.5in
\topmargin -.875in
\oddsidemargin -.06in
\evensidemargin -.06in

\usepackage{hyperref}

\hypersetup{
    colorlinks=true, %set true if you want colored links
    linktoc=all,     %set to all if you want both sections and subsections linked
    linkcolor=blue,  %choose some color if you want links to stand out
}

\begin{document}
\title{KEGGprofile: Application Examples}
\author{Shilin Zhao}
\maketitle
\begin{abstract}
Abstract: In this vignette, we demonstrate the application of KEGGprofile
as an annotation and visualization tool in analysis of multi-types
and multi-groups high-throughput expression data. Superior to existing
approaches, KEGGprofile combined the KEGG pathway map with expression
profiles of genes in that pathway and facilitated more detailed analysis
about the specific function changes inner pathway or temporal correlations
in different genes and samples. Here we introduce the data preparation
and functions used for pathway gene expression profile visualization.
\end{abstract}

\section{Introduction}

KEGG is a database resource for understanding high-level functions
and utilities of the biological system, such as the cell, from genomic
and molecular-level information (\href{http://www.kegg.jp/kegg/}{http://www.kegg.jp/kegg/}).
The KEGG pathway database is composed by a lot of pathway maps focused
on different biological functions, including metabolism, signal transduction,
cellular process, and disease. It is now a prominent reference knowledge
base for integration and interpretation of large-scale molecular data
sets generated by high-throughput experimental technologies. 

There are plenty of tools developed for KEGG pathway mapping or function
annotation. But most of them are limited in finding significant enriched
pathways for selected genes. To further analysis the function changes
inner pathway, some tools were developed to map selected genes in
pathway map, such as Color Pathway in KEGG mapper tools\cite{2}.
When direct comparing of two different samples, such as disease and
normal persons, the gene expression changes in each KEGG pathway could
be visualized, which would be helpful in understanding the function
changes between samples. 

With the development of high-throughput experimental technologies,
the systematic analysis for complicated biological questions, such
as drug stimulation, disease progression and cell differentiation,
often contains multi-types or multi-groups of data, including the
expression in transcriptome and proteome, in disease and normal samples,
and in different time points. However, none of the currently available
software for KEGG pathway mapping could be used for visualization
of the expression profiles in such complicated data analysis. The
only solution is to generate multiple pathway maps, each map for each
time point, or illustrate the expression profile manually, which is
inconvenient in visualization and ambiguous in function analysis. 

To address this problem, we developed an R package KEGGprofile, which
provided an easy and automatic pipeline for analyze and visualization
for multi-types and multi-groups expression data. With this package,
the expression profile of genes and the annotation in KEGG pathway
maps could be integrated together. Then the researcher could directly
focus on the function changes inner pathway or expression correlation
between different types of data. This would be a valuable tool for
systematic profiling or time series data analysis. 


\section{Example usage}


\subsection{Data preparation}

The NCBI gene IDs (such as 67040, 93683) is used in KEGG database
to represent genes in the pathway. We need to transform the identifiers
in our expression data into NCBI gene IDs. After the transformation,
KEGGprofile could be generally applicable for genomics, transcriptomics
and proteomics data. A previously published data of proteome and phosphoproteome
analysis in different cell phase was taken as an example\cite{1}.

We have prepared an example data in the data directory. Then it can
be import into R environment with:

<<load-the-data, tidy=T,cache=TRUE>>=
library(KEGGprofile)
data(pro_pho_expr)
data(pho_sites_count)
ls()
colnames(pro_pho_expr)
pro_pho_expr[1:3,1:4]
@

The pro\_pho\_expr is a data.frame with expression profiles. The column
1-6 are proteome data and column 7-12 are phosphoproteome data. The
6 time points are G1, G1/S, Early S, Late S, G2, Mitosis. For the
phosphorylation sites mapping to the same gene, the one with largest
variation in 6 time points are kept. The pho\_sites\_count is a data.frame
with number of phosphorylation sites quantified for each gene. 

Here the NCBI gene IDs should be row.names of all the data.frame. If your expression data is not in NCBI gene IDs, you need to first convert it. We provided a function called 'convertId' to do it.

<<convertIdExample, tidy=T,cache=FALSE>>=
example(convertId)
@

Besides, The package requires original KEGG pathway maps as backgrounds
and KGML (KEGG XML) fi{}les to extract the gene locations in the pathway
maps. These files can be downloaded from KEGG website (\href{http://www.kegg.jp/kegg/}{http://www.kegg.jp/kegg/})
and we also provide a function called \textquoteleft{}download\_KEGGfile\textquoteright{}
to do so. Now download the pathway map and KGML file for human pathway
'04110' to the work directory:

<<download_needed_files,tidy=T,cache=TRUE>>=
download_KEGGfile(pathway_id="04110",species='hsa')
@

Here the pathway\_id could be set as \textquoteleft{}all\textquoteright{},
and then the entire pathway ids for human would be extracted from
the KEGG.db package and the related files would be downloaded.

\subsection{Find enriched pathways}

The function 'find\_enriched\_pathways' could be used to find enriched
pathways for interested genes. The interested genes could be selected
in several methods, such as genes response to specific stimulation,
or genes with negative correlation between disease and normal samples.
And the result of statistic tests could also be used. Then the selected
genes would be annotated with KEGG pathway database and hypergeometric
tests were used to estimate the significance of enrichment. Besides,
a criterion for number of annotated genes in the pathway could also
be used for pathway selection.

There is a very important parameter 'download\_latest' in 'find\_enriched\_pathway' function. As the KEGG.db package was only updated until 2012, we can download the lateset genes and pathways links from KEGG database when 'download\_latest' was set as TRUE. It is very important when the users were interested in some non model organisms which were imported into KEGG after 2012.

Here we used the proteins highly phosphorylated as candidates for
annotation. The number of phosphorylation sites quantified larger
than 10 was set as a criterion.

<<find-enriched-pathways,tidy=T,eval=TRUE,cache=TRUE>>=
genes<-row.names(pho_sites_count)[which(pho_sites_count>=10)]
pho_KEGGresult<-find_enriched_pathway(genes,species='hsa')
pho_KEGGresult[[1]][,c(1,5)]
@

Then we compared the correlations between proteins and phospholations for these enriched in highly phosphorylated proteins pathways.

<<pathways_cprrelation,tidy=T,eval=TRUE,cache=TRUE>>=
plot_pathway_cor(gene_expr=pro_pho_expr,kegg_enriched_pathway=pho_KEGGresult)
@

As the example data here was from an research in different
cell phase, the Cell cycle pathway (pathway id 04110) was further
visualized.

\subsection{Visualization of expression profile on KEGG maps}

In each KEGG pathway map, genes are represented by a polygon and biological
relations between genes such as activation or phosphorylation are
represented by lines. The function 'plot\_pathway' could be used to
integrate the expression profiles in the pathway map instead of the
original gene polygon. There are two visualization methods to represent
gene expression profiles: \textquotedblleft{}background\textquotedblright{}
and \textquotedblleft{}lines\textquotedblright{}. The first one is
applicable for analysis with only one sample or one type of data,
which divides the gene polygon into several sub-polygons to represent
different time points. And each sub-polygon has a specific background
color to represent expression changes in that time point. 

We used the phosphoproteome changes in 6 time points as a example.
Firstly a function 'col\_by\_value' was used to transform the expression
difference between samples into specific color. After that, we can
use ``plot\_profile'' function to visualiz the gene expression profile
in the KEGG pathway. A pathway map named 'hsa04110\_profile\_bg.png'
would be generated at the working directory.


<<Visualization-bg,tidy=T,fig.keep='none',eval=TRUE,cache=TRUE>>=
## the phosphoproteome data
pho_expr<-pro_pho_expr[,7:12]
temp<-apply(pho_expr,1,function(x) length(which(is.na(x))))
pho_expr<-pho_expr[which(temp==0),]
## transform the expression difference into specific color
col<-col_by_value(pho_expr,col=colorRampPalette(c('green','black','red'))(1024),range=c(-6,6))
## visualization by method 'bg'
temp<-plot_pathway(pho_expr,type="bg",bg_col=col,text_col="white",magnify=1.2,species='hsa',database_dir=system.file("extdata",package="KEGGprofile"),pathway_id="04110")
@


The second method plots lines with different colors in the gene polygon
to represent different samples or different types of data. The dynamic
changes of lines are determined by the profiles of genes in different
time points. The background colors could also be added to the pathway
map to provide more biological information, such as p values and subcellular
locations.

The proteome and phosphoproteome changes were used as example for
method 'lines'. Firstly the function 'col\_by\_value' was used to
transform the number of phosphorylation sites quantified for each
gene into specific color as the background for each gene polygon.
Then the ``plot\_profile'' function was performed and a pathway
map named 'hsa04110\_profile\_lines.png' would be generated at the
working directory.

<<Visualization-lines,tidy=T,fig.keep='none',eval=TRUE,cache=TRUE>>=
## transform the number of phosphorylation sites into specific color
col<-col_by_value(pho_sites_count,col=colorRampPalette(c('white','khaki2'))(4),breaks=c(0,1,4,10,Inf)) ## visualization by method 'lines'
temp<-plot_pathway(pro_pho_expr,type="lines",bg_col=col,line_col=c("brown1","seagreen3"),groups=c(rep("Proteome",6),rep("Phosphoproteome",6)),magnify=1.2,species='hsa',database_dir=system.file("extdata",package="KEGGprofile"),pathway_id="04110",max_dist=5)
@


In this section, we just used the background colors of gene polygon
to represent the number of phosphorylation sites. In fact, the colors
for gene name (text\_col) and gene polygon border (border\_col) could
also be determined by function 'col\_by\_value' and represent some
other important biological information, such as subcellular locations,
correlation between samples.
Here we just demonstrated the application of gene expression data. In fact Compound data was also supported by KEGGprofile. You can see the examples in 'plot\_pathway' function for more details.


\section{More details}

To make the visualization process more easier, the function 'plot\_pathway'
is in fact a wrapper function for download\_KEGGfile, parse\_XMLfile
and plot\_profile functions. Firstly, the existence of KEGG pathway
map files (.xml and .png) would be checked in the database\_dir. If
not, the download\_KEGGfile function would be used to download the
files. Then the function parse\_XMLfile would be used to parse xml
file to get a matrix containing the genes in this pathway, and their
names, locations etc. At last, the function 'plot\_profile' would
be used to generate the pathway map.


\bibliographystyle{plain}
\addcontentsline{toc}{section}{\refname}\bibliography{Ref}


\end{document}