%\VignetteIndexEntry{Validate genomic data with "DupChecker" package}
%\VignettePackage{dupchecker}
%\VignetteEngine{knitr::knitr}

\documentclass[12pt]{article}
\usepackage{hyperref}

\author{Quanhu Sheng$^{*}$, Yu Shyr, Xi Chen \\[1em] \small{Center for 
Quantitative Sciences, Vanderbilt University, Nashville, USA} \\ 
\small{\texttt{$^*$shengqh (at) gmail.com}}}

\title{DupChecker: a bioconductor package for checking high-throughput 
genomic data redundancy in meta-analysis}

\begin{document}

\maketitle

\begin{abstract}
Meta-analysis has become a popular approach for high-throughput genomic data 
analysis because it often can significantly increase power to detect biological 
signals or patterns in datasets. However, when using public-available databases 
for meta-analysis, duplication of samples is an often encountered problem, 
especially for gene expression data. Not removing duplicates could lead false 
positive finding, misleading clustering pattern or model over-fitting issue, 
etc in the subsequent data analysis.

We developed a Bioconductor package Dupchecker that efficiently identifies 
duplicated samples by generating MD5 fingerprints for raw data. A real data 
example was demonstrated to show the usage and output of the package.

Researchers may not pay enough attention to checking and removing duplicated 
samples, and then data contamination could make the results or conclusions 
from meta-analysis questionable. We suggest applying DupChecker to examine 
all gene expression data sets before any data analysis step.

In this vignette, we demonstrate the application of DupChecker as a package for 
checking high-throughput genomic data redundancy in meta-analysis. 
DupChecker can download the GEO/ArrayExpress raw data files from EBI/ncbi ftp 
server, extract individual data files, calculate MD5 fingerprint for each 
data file and validate the redundancy of those data files. 

\vspace{1em}

\textbf{DupChecker version:} \Sexpr{packageDescription("DupChecker")$Version}

\vspace{1em}

\begin{center}
    \begin{tabular}{ | l | }
      \hline 
      If you use DupChecker in published research, please cite:  \\
      \\
      Quanhu Sheng, Yu Shyr, Xi Chen.: DupChecker: a bioconductor package for checking high-throughput genomic \\
      data redundancy in meta-analysis. BMC bioinformatics 2014, 15:323. \\
      \hline 
    \end{tabular}
  \end{center}
\end{abstract}

\tableofcontents

\section{Introduction}

Meta-analysis has become a popular approach for high-throughput genomic data 
analysis because it often can significantly increase power to detect biological 
signals or patterns in datasets. However, when using public-available databases 
for meta-analysis, duplication of samples is an often encountered problem, 
especially for gene expression data. Not removing duplicates would make study 
results questionable. We developed a package DupChecker that efficiently 
identifies duplicated samples by generating MD5 fingerprints for individual data
file.

\section{Standard workflow}

\subsection{Quick start}

Here we show the most basic steps for a validation procedure. You need to 
create a target directory used to store the data. In order to illustrate the 
procedure, here we used two very small datasets without redundant CEL files. 
You may also try a real example with redundancy at example~\ref{realexample}. 

<<quick, echo=TRUE, message=FALSE, warning=FALSE, tidy=FALSE>>=
library(DupChecker)

## Create a "DupChecker" directory under temporary directory
rootDir<-paste0(dirname(tempdir()), "/DupChecker")
dir.create(rootDir, showWarnings = FALSE)
message(paste0("Downloading data to  ", rootDir, " ..."))

geoDownload(datasets=c("GSE1478"), targetDir=rootDir)
arrayExpressDownload(datasets=c("E-MEXP-3872"), targetDir=rootDir)

datafile<-buildFileTable(rootDir=rootDir, filePattern="cel$")
result<-validateFile(datafile)
if(result$hasdup){
  duptable<-result$duptable
  write.csv(duptable, file=paste0(rootDir, "/duptable.csv"))
}
@

\subsection{GEO/ArrayExpress data download}

Firstly, function geoDownload/arrayExpressDownload will download raw data from 
ncbi/EBI ftp server based on datasets user provided. Once the compressed raw 
data is downloaded, individual data files will be extracted from compressed raw 
data. Some of the data files may have been compressed. Since the files compressed 
from same data file by different softwares will have different MD5 figureprints, 
we will decompress those compressed data files and validate the file redundancy 
using decompressed data files.

<<download, echo=TRUE, message=FALSE, warning=FALSE, tidy=FALSE>>=
datatable<-geoDownload(datasets = c("GSE1478"), targetDir=rootDir)
datatable

datatable<-arrayExpressDownload(datasets=c("E-MEXP-3872"), targetDir=rootDir)
datatable
@
The datatable is a data frame containing dataset name and how many files 
in that dataset.

There are two possible situations that you may want to download and decompress 
the data using external tools. 1) The download or decompress cost too much time
in R environment; 2) The internal tool in R may download imcomplete or 
interrupted file for huge file which will cause the "untar" or "gunzip" command 
fails. 

\subsection{Build file table}

Secondly, function buildFileTable will try to find all files (default) or expected 
files matching filePattern parameter in the subdirectories of root directory. 
The result data frame contains two columns, dataset (directory name) and filename. 
Here, rootDir can also be an array of directories. In following example, we just 
focused on AffyMetrix CEL file only.

<<buildFileTable, message=FALSE, warning=FALSE>>=
datafile<-buildFileTable(rootDir=rootDir, filePattern="cel$")
@

\subsection{Validate file redundancy}

The function validateFile will calculate MD5 fingerprint for each file in 
table and then check to see if any two files have same MD5 fingerprint. The 
files with same fingerprint will be treated as duplication. The function will 
return a table contains all duplicated files and datasets.

<<validateFile, message=FALSE, warning=FALSE>>=
result<-validateFile(datafile)
if(result$hasdup){
  duptable<-result$duptable
  write.csv(duptable, file=paste0(rootDir, "/duptable.csv"))
}
@

\subsection{Real example}
\label{realexample}

There are three colon cancer related datasets (GSE14333, GSE13067 and GSE17538)
containing seriously data redundancy problem. User may run following codes
to do the validation. It may cost a few miniutes to a few hours based on the 
network bindwidth and the computer performance.

<<realexample, eval=FALSE, tidy=FALSE>>=
library(DupChecker)

rootDir<-paste0(dirname(tempdir()), "/DupChecker_RealExample")
dir.create(rootDir, showWarnings = FALSE)

geoDownload(datasets = c("GSE14333", "GSE13067", 
                         "GSE17538"), targetDir=rootDir)
datafile<-buildFileTable(rootDir=rootDir, filePattern="cel$")
result<-validateFile(datafile)
if(result$hasdup){
  duptable<-result$duptable
  write.csv(duptable, file=paste0(rootDir, "/duptable.csv"))
}
@

Table \ref{tab:duptable} illustrated the duplication between those three 
datasets. GSE13067 {[}64/74{]} means 64 of 74 CEL files in GSE13067 dataset 
were duplicated in other two datasets.

\begin{table}[ht]
\centering
\caption{Part of duplication table of three datasets} 
\label{tab:duptable}
\resizebox{\columnwidth}{!}{
\begin{tabular}{|c|c|c|c|}
\hline 
MD5 & GSE13067{[}64/74{]}  & GSE14333{[}231/290{]}  & GSE17538{[}167/244{]} \\
\hline 
001ddd757f185561c9ff9b4e95563372  &  & GSM358397.CEL  & GSM437169.CEL \\
00b2e2290a924fc2d67b40c097687404  &  & GSM358503.CEL  & GSM437210.CEL \\
012ed9083b8f1b2ae828af44dbab29f0  & GSM327335.CEL  & GSM358620.CEL  & \\
023c4e4f9ebfc09b838a22f2a7bdaa59  &  & GSM358441.CEL  & GSM437117.CEL \\
\hline 
\end{tabular}
}
\end{table}
\section{Discussion}

We illustrated the application using gene expression data, but DupChecker 
package can also be applied to other types of high-throughput genomic data 
including next-generation sequencing data. 

\section{Session Info}

<<sessInfo, echo=FALSE, results="asis">>=
toLatex(sessionInfo())
@

\end{document}