%\VignetteIndexEntry{GeneGA}
%\VignetteDepends{hash, GeneRfold, seqinr}
%\VignetteKeywords{GeneGA, GeneFoldGA}
%\VignettePackage{GeneGA}

\documentclass[a4paper]{article}

\usepackage{times}
\usepackage{natbib}

\title{Optimizing gene expression with {\tt GeneGA}}
\author{Zhenpeng Li}

\begin{document}
\bibliographystyle{plainnat}

\maketitle

\tableofcontents

\section{Introduction}

Biological engineering has driven the demand of achieving high-level expression of heterologous genes. There are many factors that can influence the gene expression, and these factors can be divided into two categories, one relating to the synonymous mutaions of gene, such as codon bias, mRNA secondary structure and the other having no relationship with synonymous mutations, such as expression vectors design, gene dosage and promoter strength. Codon bias and folding energy have been deemed as two main mechanisms of synonymous mutations to modulate the protein abundance\citep{tuller2010translation}. A recent study of expression of a diverse library of GFP gene in E.coli concluded that mRNA folding and associated rates of translation initiation play a predominant role in shaping expression levels of individual genes,  whereas codon bias influences global translation efficiency and cellular fitness\citep{kudla2009coding}. Many tools have been developed to optimize gene for increasing its expression level, such as OPTIMIZER, GeneDesign and Gene Designer, while almost all of them merely consider codon bias to optimize genes. Here, we put forward a framework to optimize gene considering both codon bias and mRNA secondary structure using Genetic algorithm. The {\tt GeneGA} package includes the information of highly expressed genes of almost 200 genomes and can be used to optimize the expression level of a gene for heterologous gene expression using rules that have been found or to explore the rules dominating gene expression.
\par
\section{Implementation}

{\tt GeneGA} uses genetic algorithm to optimize the relationship between codon bias and mRNA secondary structure. Codon adaption index(CAI) is used to quantify codon bias,  which can be computed by {\it cai} function in {\tt seqinr} package, while minimum free energy is used to quantify mRNA secondary structure, which can be computed by {\it fold} function in {\tt GeneRfold} package. Certain region can be specified to optimize the relationship between codon bias and mRNA secondary structure, while codons in the other regions will be replaced by their correspondence most preference codons. Meanwhile, {\tt GeneGA} also has the option to let the user specify the ramp region\citep{tuller2010evolutionarily}, i.e. the first 30-50 codons of genes, which has been suggested to have low translation efficiency and serve as an optimal and robust means to reduce ribosomal traffic jams. When ramp and the specified region are intersecting, the intersectant region will be optimized to have lower CAI and higher minimum free energy,  while the other region will be optimized to have higher CAI and higher minimum free energy.
\\
The GA procedure is as follows:
\\
1) Generating a population
\\
At the start, the specified sequence is translated to amino acid sequence, then popSize random sequences are generated by sampling the synonymous codon of each amino acid.
\\
2) Calculating the objective function values
\\
Calculate the value of objective function for each member of the population.
\\
If the ramp is not considered or the end of the ramp region does not lie in the selected region while ramp is considered:
\\
$E=R(CAI)^2+R(MFE)^2$,
\\
If the end of the ramp region lies in the selected region:
\\
$E=R(1/CAI_1)^2+R(CAI_2)^2+R(MFE)^2$,
\\
If the end of the ramp region lies after the position of selected region:
\\
$E=R(1/CAI)^2+R(MFE)^2$,
\\
In the formulas, R(X) represents the rank number of X in the population by increasing order. CAI and MFE denote the CAI value and minimum free energy of the member in the  population respectively, while $CAI_1$ and $CAI_2$ denote the intersectant region of ramp and selected region and the region that is not intersected respectively.
\\
3) Selection
\\
Compute the expect number of each sequence based on the objective function values, the number of that sequence in the new population is determined by the integer part of expect number, while the digit part of expect number will be undergone roulette algorithm to determine its number in the new population.
\\
4) Crossover
\\
With probability crossoverRate, two member of the population exchange their sequences at random chosen point.
\\
5) Mutation
\\
With probability mutationChance, each codon of sequence will change its codon by random sampling from its synonymous codons.
\par
\section{Functions and examples}

Users are free to choose the factors to optimize the gene. The function {\it GeneGA} considers both codon bias and mRNA secondary structure to optimize their relationship, {\it GeneFoldGA} only takes mRNA secondary structure into account and result in the largest minimum free energy of the mRNA or selected region, while {\it GeneCodon} merely optimzes the codon bias of gene. Two {\it show} methods are provided to display the results of {\it GeneGA} and {\it GeneFoldGA}, meanwhile, two {\it plotGeneGA} methods can be used to visualize the variation of maximum(minimum) and mean overall evaluation values and variable values during the optimizing progress. Moreover, the package also contains {\it wSet}, which is a data frame with 200 genomes on 64 codons. Users can also compute w table by themselves using specified highly expressed genes of given species or tissue and use the w table by adding it to {\it wSet}. For example, on the assumption that "EGFP.fasta" is file containing highly expressed gene. By using the  following codes, w table can be computed:
<<echo=TRUE>>=
library(GeneGA)
seqfile=system.file("sequence", "EGFP.fasta", package="GeneGA")
aa=read.fasta(seqfile)
rscu=uco(unlist(aa), index="rscu")
w_value=rscu # w_value is the w table we need computing
names(w_value)=names(rscu)
names=sapply(names(rscu), function(x)translate(s2c(x)))
amino=hash()
for(i in unique(names)){
	amino[[i]]=max(rscu[which(names==i)])
}
for(i in 1:length(names)){
	w_value[i]=rscu[[names(rscu)[i]]]/amino[[translate(s2c(names(rscu)[i]))]]
}
@

Taking Enhanced Green Fluorescent Protein(EGFP) as an example, we use {\it GeneGA} to optimize EGFP by both considering its codon bias and mRNA secondary structure. The procedure is as follows:
\\
1) Input the gene sequence. Users can input the sequence as string directly or read the sequence of fasta format, take fasta format sequence as example:
<<echo=TRUE>>=
seqfile=system.file("sequence", "EGFP.fasta", package="GeneGA")
seq=unlist(getSequence(read.fasta(seqfile), as.string=TRUE))
@
2) Implementation of the {\it GeneGA}, the region is specified between 1 and 60. Users can flexibly adjust the parameters to archive an ideal result. Generally, longer region needs larger popSize and iters, while larger crossoverRate and mutationChance can archive a sooner convergence of results.
<<echo=TRUE>>=
GeneGA.result=GeneGA(sequence=seq, popSize=40, iters=150, crossoverRate=0.4, mutationChance=0.1, region=c(1,60), organism="ec", showGeneration=FALSE)
@
3) Display the results and plot the variation of maximum(minimum) and mean overall evaluation values and variable values during the optimizing progress.
<<echo=TRUE>>=
show(GeneGA.result)
@
\begin{center}
<<fig=TRUE, echo=TRUE>>=
plotGeneGA(GeneGA.result, type=1)
@
\end{center}
@
\begin{center}
<<fig=TRUE, echo=TRUE>>=
plotGeneGA(GeneGA.result, type=2)
@
\end{center}
@
\begin{center}
<<fig=TRUE, echo=TRUE>>=
plotGeneGA(GeneGA.result, type=3)
@
\end{center}
@
\section{Installation notes}
The {\tt GeneGA} package depends on four other R packages: one Bioconductor package and three CRAN packages. Make sure these R packages and Vienna RNA Package(http://www.tbi.univie.ac.at/~ivo/RNA/) be installed before installing {\tt GeneGA}.
\par

\bibliography{GeneGA}

\end{document}