% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteIndexEntry{Basic Functions for Flow Cytometry Data}
%\VignetteDepends{flowCore, flowViz}
%\VignetteKeywords{ncdf,flowCore}
%\VignettePackage{ncdfFlow}
\documentclass[11pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{times}

\usepackage{graphicx}
%\usepackage{subfigure}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}



\title{ncdfFlow: Provides netCDF storage based methods and functions for manipulation of flow cytometry data}
\author{Mike Jiang,Greg Finak,N. Gopalakrishnan}

\begin{document}
\maketitle

\begin{abstract}
\noindent \textbf{Background} The Bioconductor package \Rpackage{flowCore} is the object model 
and a collection of standard tools designed for flow cytometry data analysis. The related R packages 
including data analysis (flowClust, flowMerge, flowMeans,flowTrans ,flowStats), visualization (flowViz) 
and quality control (flowQ) use the flowCore infrastructure to deal with flow cytometry data.
However the flowFrame or flowSet which represent a single or a set of FCS files are memory-resident data 
structures and require the entire data elements to remain in memory in order to perform all kinds of the 
data manipulations.  Hundreds or thousands of datasets generated by high throughput instruments can 
easily hit the memory limit if they are imported as the flowSet or flowFrames in R. It presents a challenge 
to scientists and bioinformaticians who use the R tools described above to perform statistical data analysis 
on a regular computer .We propose a new R object model and related functions to address this problem. 
The new model ncdfFlowSet inherit most of data structures from flowSet. It stores the large volume of 
event-level data on disk and only keeps the file handler and meta data in memory. Thus the memory consumption 
is significantly reduced. NetCDF is used as the data formats because it is self-describing, 
machine-independent and specifically optimized for storing and accessing array-oriented scientific data. 
With the compression and chunking features introduced by the new release of netCDF4, the new model is able 
to maintain high performance of data processing.

Most of the functions and methods including transformation,compensation,gating and subsetting methods for 
flowSet are extended to ncdfFlowSet 
(\Rfunction{spillover},\Rfunction{normalize} and \Rfunction{workflow} methods of \Rpackage{flowCore} are currently not supported yet.).
Thus the change of data structure is almost transparent to the users of flowCore-based R packages. 


\noindent \textbf{keywords} Flow cytometry, high throughput,netCDF, flowSet,ncdfFlowSet
\end{abstract}

<<loadPackage, echo=false,results=hide>>=
library(ncdfFlow)
@


\section{Representing Flow Cytometry Data with ncdfFlowSet}

\Rpackage{ncdfFlow} reppresents a flow cytometry data model that is very similar to 
the \Robject{flowSet} structure.The only difference is that the event-based 2-D data matrices from multiple samples of the same experiment are stored as one single 3D data matrix on disk in ncdf format. 
Each sample can be accessed efficiently from the 3-D matrix as a data chunk or slice and further manipulated in memory.

The basic unit of manipulation in \Rpackage{ncdfFlow} is the
\Rclass{ncdfFlowSet}. It inherites all the slots from \Rclass{flowSet}.
However, the \Robject{flowFrame} objects stored in the "frames" slot of a \Robject{ncdfFlowSet} object 
do not host the data matrix.Instead,their the "exprs" lots are kepted empty and the actual data are 
stored in the 3-D data matrix on disk and only the file name is stored in "file" slot of \Robject{ncdfFlowSet}. 
Thus \Robject{ncdfFlowSet} reduces the memory requirements and meanwhile ensures the consistent data structure with \Rclass{flowSet}.


\section{Creating a \Rclass{ncdfFlowSet}}

We provide a function to read FCS files into a \Rclass{ncdfFlowSet} object:

<<ncdfFlowSet, echo=true, results=verbatim>>=
path<-system.file("extdata","compdata","data",package="flowCore")
files<-list.files(path,full.names=TRUE)[1:3]
nc1 <- read.ncdfFlowSet(files=files,ncdfFile="ncfsTest.nc")
nc1
@

As we see,the contructor function is very similar to the \Rclass{flowSet} execpt that it requires a filename for the ncdf file. 
<<flowSet, echo=true, results=hide>>=
fs1  <- read.flowSet(files=files)
@


Note that an ncdf file that stores the actual data is generated and saved on disk once a \Robject{ncdfFlowSet} is created. 
Users need to explicitely call the \Rfunction{unlink} method to remove the file before delete the object from memory by \Rfunction{rm}. 

<<ncdfFlowSet, echo=true, results=verbatim>>=
unlink(nc1)
rm(nc1)
@

Users can also create an empty ncdfFlowSet first and add data slices later by assigning argument isWriteSlice as FALSE.

<<ncdfFlowSet, echo=true,results=verbatim>>=
nc1  <- read.ncdfFlowSet(files=files,ncdfFile="ncfsTest.nc",isWriteSlice= FALSE)
nc1[[1]]
@

As we see here,before writing the actual flowFrame by \Rfunction{[[<-}, 
the flowFrame object returned by \Rfunction{[[} has 0 events.

<<ncdfFlowSet, echo=true,results=verbatim>>=
targetSampleName<-sampleNames(fs1)[1]
nc1[[targetSampleName]] <- fs1[[1]]
nc1[[1]]
nc1[[2]]
@

Note that it is important to always use sample name to specify the target position in the data matrix where the actual is added.  
Because the sample name is the identifier used to index the data matrix.

Sometime it is helpful to copy the structure from an existing ncdfFlow object and then add the data slice to the empty ncdfFlow cloned by
 \Rfunction{clone.ncdfFlowSet}.
<<ncdfFlowSet, echo=true,results=hide>>=
nc2<-clone.ncdfFlowSet(nc1,"clone.nc")
nc2[[1]]
nc2[[sampleNames(fs1)[1]]] <- fs1[[1]]
nc2[[1]]
@

<<ncdfFlowSet, echo=FALSE,results=hide>>=
unlink(nc2)
rm(nc2)

unlink(nc1)
rm(nc1)
@

We also provide the coerce function to convert the flowSet to a ncdfFlowSet.
<<ncdfFlowSet, echo=true,results=verbatim>>=
data(GvHD)
GvHD <- GvHD[pData(GvHD)$Patient %in% 6:7][1:4]
nc1<-ncdfFlowSet(GvHD)
@

Or coerce a ncdfFlowSet to flowSet
<<ncdfFlowSet, echo=true,results=verbatim>>=
fs1<-as.flowSet(nc1,top=2)
@

Note that \Rclass{ncdfFlowSet} is designed to store large datasets and it is not recommened to 
corece the entire ncdfFlowset to flowSet. Usually users want to select a subset from ncdfFlowSet by \Rfunction{[} and convert the subetted data.
Sometimes it is helpful to randomly select a cerntain number of flowFrames from the entire datasets represented by 
by \Rclass{ncdfFlowSet} to have a preview of the data.The arugment "top" can be used here for this purpose.



\section{Working with metadata}

Like \Rclass{flowSet},\Rclass{ncdfFlowSet} has an associated \Rclass{AnnotatedDataFrame} that provides metadata of experiments. 
This data frame is accessed and modified via the same methods of \Rpackage{flowCore}. :

<<metaData, echo=true, results=hide>>=
phenoData(nc1)
pData(nc1)
varLabels(nc1)
varMetadata(nc1)
sampleNames(nc1)
keyword(nc1,"FILENAME")
identifier(nc1)
colnames(nc1)
colnames(nc1,prefix="s6a01")
length(nc1)
getIndices(nc1,"s6a01")
@

\section{Manipulating a \Rclass{ncdfFlowSet}}

You can extract a \Rclass{flowFrame} from a \Rclass{ncdfFlowSet} object in
the same way as \Rclass{flowSet} by using the \Rfunction{[[} or \Rfunction{\$} extraction operators. 
Note that using the \Rfunction{[} extraction operator returns a new \Rclass{ncdfFlowSet} that points to the same ncdf 
file. SO the original ncdf file serves as a data repository and the \Robject{ncdfFlowSet} works as view of the 
data in this sense.

<<extraction, echo=true,results=verbatim>>=

nm<-sampleNames(nc1)[1]
expr1<-paste("nc1$'",nm,"'",sep="")
eval(parse(text=expr1))
nc1[[nm]]

nm<-sampleNames(nc1)[c(1,3)]
nc2<-nc1[nm]
summary(nc2)
@

\Rclass{flowSet}-specific iterator \Rfunction{fsApply} can also be applied to Robject{ncdfFlowSet}:
<<fsApply, echo=true,results=hide>>=
fsApply(nc1,range)
fsApply(nc1, each_col, median)
@

However, we recmmend to use another iterator \Rfunction{ncfsApply} designed for the function that returns a flowFrame (such as compensate,transform...).
\Rfunction{ncfsApply} works the same as \Rfunction{fsApply} execpt that it returns a ncdfFlowSet object with the acutal data stored in cdf to avoid the huge memory consumption.
Note that the return value of the function applied in \Rfunction{ncfsApply} must be a flowFrame object and it should have the same dimensions(channles and events) as the original data.

\section{Compensation,Transformation and Gating}

\Rfunction{transform} amd \Rfunction{compensate} for \Robject{ncdfFlowSet} work the same as \Robject{flowSet}.
 
<<Transformation and compensation, echo=true,results=hide>>=
cfile <- system.file("extdata","compdata","compmatrix", package="flowCore")
comp.mat <- read.table(cfile, header=TRUE, skip=2, check.names = FALSE)
comp <- compensation(comp.mat)

#compensation
summary(nc1)[[1]]
nc2<-compensate(nc1, comp)
summary(nc2)[[1]]
unlink(nc2)
rm(nc2)

#transformation
asinhTrans <- arcsinhTransform(transformationId="ln-transformation", a=1, b=1, c=1)
nc2 <- transform(nc1,`FL1-H`=asinhTrans(`FL1-H`))
summary(nc1)[[1]]
summary(nc2)[[1]]
unlink(nc2)
rm(nc2)
@

Note that compensation/transformation return the \Robject{ncdfFlowSet} objects that point to 
the new cdf file containing the compensated/transformed data.

\Rfunction{filter} for \Robject{flowSet} also works for \Robject{ncdfFlowSet}:
<<Gating, echo=true,results=hide>>=
rectGate <- rectangleGate(filterId="nonDebris","FSC-H"=c(200,Inf))
fr <- filter(nc1,rectGate)
summary(fr)

rg2 <- rectangleGate(filterId="nonDebris","FSC-H"=c(300,Inf))
rg3 <- rectangleGate(filterId="nonDebris","FSC-H"=c(400,Inf))
flist <- list(rectGate, rg2, rg3)
names(flist) <- sampleNames(nc1[1:3])
fr3 <- filter(nc1[1:3], flist)
summary(fr3[[3]])
@

\section{Subsetting}

The \Rfunction{Subset} and \Rfunction{split} methods for \Rclass{ncdfFlowSet}:

<<Subsetting, echo=true, results=verbatim>>=
nc2 <- Subset(nc1,rectGate)
summary(nc2[[1]])
nc2 <- Subset(nc1, fr)
summary(nc2[[1]])
rm(nc2)


morphGate <- norm2Filter("FSC-H", "SSC-H", filterId = "MorphologyGate",scale = 2)
smaller <- Subset(nc1[c(1,3)], morphGate,c("FSC-H", "SSC-H"))
smaller[[1]]
nc1[[1]]
rm(smaller)
@
Note that \Rfunction{Subset} does not create the new ncdf file (the same as extraction operator \Rfunction{[}).
So we need to be careful about using \Rfunction{unlink} to delete the subsetted data 
since it points to the same ncdf file that is also used by the original \Robject{ncdfFlowSet} object. 


\Rfunction{split} returns a \Robject{ncdfFlowList} object which is a container of multiple \Robject{ncdfFlowSet} objects.  
<<split, echo=true>>=

##splitting by a gate
qGate <- quadGate(filterId="qg", "FSC-H"=200, "SSC-H"=400)
fr<-filter(nc1,qGate)
ncList<-split(nc1,fr)
ncList
nc1[[1]]
ncList[[2]][[1]]
ncList[[1]][[1]]
@

Note that the \Robject{ncdfFlowSet} objects contained in \Robject{ncdfFlowList} by default share the 
same ncdf file as the original \Robject{ncdfFlowSet}. In order to keep its own data copy, try to set argument \Robject{isNew} to "TRUE"
in \Robject{split} method. 
<<split, echo=true>>=
ncList_new<-split(nc1,fr,isNew=TRUE)
@


\clearpage
%\bibliographystyle{plainnat} 
%\bibliography{cytoref}
\end{document}