%\VignetteIndexEntry{Importing flowJo Workspaces into R} %\VignetteKeywords{flow cytometry, single cell assay, import} %\VignettePackage{flowWorkspace} \documentclass[11pt]{article} \usepackage{Sweave} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{flowWorkspace: A Package for Importing flowJo Workspaces into R} \author{Greg Finak } \begin{document} \maketitle \section{Purpose} The purpose of this package is to provide functionality to import relatively simple \textit{flowJo} workspaces into R. By this we mean, accessing the samples, groups, transformations, compensation matrices, gates, and population statistics in the \textit{flowJo} workspace, and replicating these using (primarily) \textit{flowCore} functionality. \section{Why Another flowJo Workspace Import Package?} There was a need to import \textit{flowJo} workspaces into R for comparative gating. The \textit{flowFlowJo} package did not meet our needs. Many groups have legacy data with associated flowJo XML workspace files in version 2.0 format that they would like to access using BioConductor's tools. Hopefully this package will fill that need. \section{Support} This package supports importing of \textbf{Version 2.0 XML workspaces only}. We cannot import \textbf{.jo} files directly. You will have to save them in XML workspace format, and ensure that that format is \textit{workspace version 2.0}. The package has been tested and works with files generated using flowJo version 9.1 on Mac OS X. XML generated by older versions of \textit{flowJo} on windows should work as well. We do not yet support \textit{flowJo}'s \textbf{Chimera} XML schema, though that support will be provided in the future. The package supports import of only a subset of the features present in a flowJo workspace. The package allows importing of sample and group names, gating hierarchy, compensation matrices, data transformation functions, a subset of gates, and population counts. BooleanGates are now supported by flowWorkspace. \section{Data Structures} The following section walks through opening and importing a flowJo workspace. \subsection{Loading the library} Simply call: <>= library(flowWorkspace) @ The library depends on numerous other packages, including \textit{graph, XML, Rgraphviz, flowCore, flowViz, RBGL}. \subsection{Opening a Workspace} We represent flowJo workspaces using \texttt{flowJoWorkspace} objects. We only need to know the path to, and filename of the flowJo workspace. <>= d<-system.file("extdata",package="flowWorkspaceData"); wsfile<-list.files(d,pattern="A2004Analysis.xml",full=T) @ In order to open this workspace we call: <>= ws<-openWorkspace(wsfile) summary(ws) @ We see that this a version 2.0 workspace file. It's location and filename are printed. Additionally, you are notified that the workspace file is open. This refers to the fact that the XML document is internally represented using 'C' data structures from the \Rpackage{XML} package. After importing the file, the workspace must be explicitly closed using \Rfunction{closeWorkspace()} in order to free up that memory. \subsection{Parsing the Workspace} With the workspace file open, we have not yet imported the XML document. The next step parses the XML workspace and creates R data structures to represent some of the information therein. Specifically, by calling \Rfunction{parseWorkspace()} the user will be presented with a list of \textit{groups} in the workspace file and need to choose one group to import. Why only one? Because of the way flowJo handles data transformation and compensation. Each group of samples is associated with a compensation matrix and specific data transformation. These are applied to all samples in the group. When a particular group of samples is imported, the package generates a \Rclass{GatingHierarchy} for each sample, describing the set of gates applied to the data (note: polygons, rectangles, quadrants, and ovals and boolean gates are supported). The set of GatingHierarchies for the group of samples is stored in a \Rclass{GatingSet} object. Calling \Rfunction{parseWorkspace()} is quite verbose, informing the user as each gate is created. The parsing can also be done non--interactively by specifying which group to import directly in the function call (either an index or a group name). An additional optional argument \texttt{execute=T/F} specifies whether you want to load, compensate, transform the data and compute statistics immediately after parsing the XML tree. <>= G<-parseWorkspace(ws,name=1,execute=TRUE,path=ws@path,isNcdf=FALSE,cleanup=FALSE,keep.indices=TRUE); #import the first group #Lots of output here suppressed for the vignette. @ When isNcdf flag is set TRUE,the data is stored in ncdf format on disc. <>= G @ We have generated a \Rclass{GatingSet} with $\Sexpr{length(G)}$ samples, each of which has 19 associated gates. Subsets of gating hierarchies can be accessed using the standard R subset syntax. At this point we have parsed the workspace file and generate the gating hierarchy associated with each sample imported from the file. The data have been loaded, compensated, and transformed in the workspace, since we passed \texttt{execute=TRUE} to the \Rfunction{parseWorkspace()} function. This can also be done separately via the \Rfunction{execute()} method, which takes a \Rclass{GatingHierarchy} and the \Rclass{flowJoWorkspace} that generated it, as arguments. It returns a \Rclass{GatingHierarchy} with additional data attached to each node of the hierarchy (population counts, membership indices, and a \Rclass{flowFrame}. <>= G<-lapply(G,function(x)execute(x)) @ We can plot the gating hierarchy for a given sample: <>= plot(G[[1]]) @ We can list the nodes (populations) in the gating hierarchy: <>= getNodes(G[[1]]) @ Note that the number preceding the period in the node names is just an identifier to help uniquely label populations in the gating hierarchy. It does not represent any information about population statistics. We can get a specific gate definition: <>= getGate(G[[1]],getNodes(G[[1]])[3]) @ We can extract the dimensions relating to a specific gate: <>= getDimensions(G[[1]],getNodes(G[[1]])[3]) @ We can extract vertices of a gate: <>= getBoundaries(G[[1]],getNodes(G[[1]])[3]) @ We can get the population proportion (relative to its parent) for a single population: <>= getProp(G[[1]],getNodes(G[[1]])[3]) @ Or we can retrieve the population statistics for all populations in the sample: <>= getPopStats(G[[1]]) @ We can plot the coefficients of variation between the counts derived using flowJo and flowCore for each population: <>= print(plotPopCV(G[[1]])) @ We can plot individual gates: note the scale of the transformed axes. <>= print(plotGate(G[[1]],getNodes(G[[1]])[3],lwd=2)) @ If we have metadata associated with the experiment, it can be attached to the \Robject{GatingSet}. <>= d<-data.frame(sample=factor(c("sample 1", "sample 2")),treatment=factor(c("sample","control")) ) G@metadata<-new("AnnotatedDataFrame",data=d) pData(G); @ We can retrieve the subset of data associated with a node: <>= getData(G[[1]],getNodes(G[[1]])[3]); @ Or we can retrieve the indices specifying if an event is included inside or outside a gate using: <>= getIndices(G[[1]],getNodes(G[[1]])[3]) @ The indices returned are relative to the parent population (member of parent AND member of current gate), so they reflect the true hierarchical gating structure. If we wish to do compensation or transformation manually, we can retrieve all the compensation matrices from the workspace: <>= C<-getCompensationMatrices(ws); C @ Or we can retrieve transformations: <>= T<-getTransformations(ws) names(T) names(T[[1]]) T[[1]][[1]] @ <>= A<-names(T) B<-names(T[[1]]) @ \Rfunction{getTransformations} returns a list, each element of which corresponds to a transformation applied to a group of samples. The transformation is presented as a list of functions to be applied to different dimensions of the data. Above, the transformation is applied to all samples of the group and for each sample in the group, the appropriate dimension is transformed using a channel--specific function from the list. The list of samples in a workspace can be accessed by: <>= getSamples(ws); @ And the groups can be accessed by: <>= getSampleGroups(ws) @ The \texttt{compID} column tells you which compensation matrix to apply to a group of files, and similarly, based on the name of the compensation matrix, which transformations to apply. \subsection{Converting to flowCore Objects} You may want to convert the imported workspace into \Robject{flowCore} objects, such as workflows. We provide this functionality via the \Rfunction{flowWorkspace2flowCore} function. \Rfunction{flowWorkspace2flowCore} extracts the compensation matrices,transformation functions and all the gates from GatingHierarchies generated by flowWorkspace package and converts them to the respective views and actionItems of workFlows defined by flowCore package. It takes a gatingHierarchy, flowJoWorkspace or GatingSet as the input, and returns one or multiple workflows as the result, depending on whether the gating hierarchies for each sample (including gate coordinates) are identical. <>= wfs<-flowWorkspace2flowCore(G,path=ws@path); wfs @ \Rfunction{plotWf} plots the workflow tree <>= plotWf(wfs[[1]]) @ Finally, when we are finished with the workspace, we close it: <>= closeWorkspace(ws); ws @ \subsection{Exporting to FlowJo OSX 9.2} The \Rfunction{exportAsFlowJoXML} function can be used to export a \Robject{flowCore::workFlow} as an XML workspace for FlowJo 9.2 OSX. If flowWorkspace has been used to import an existing FlowJo workspace, \Rfunction{flowWorkspace2flowCore} can be used to obtain a \Robject{workFlow} for exporting. Currently this function can export one workFlow at a time. \subsection{Additional Important Notes} \subsubsection{NetCDF Support} If you have particularly large data files (millions of events), then you will want to make use of the netCDF framework. To do so, you will have to install the netcdf4 C library (available at \url{http://www.unidata.ucar.edu/downloads/netcdf/index.jsp}), and build it with HDF5 support.You will also need the R library ncdf4. To use NetCDF, pass isNcdf=TRUE to parseWorkspace. flowWorkspace will create one netcdf file for the processed data, and additional netcdf files (one per sample) containing the event memberships for each gate in each sample. To build flowWorkspace to use netcdf, you may need to run autoconf in the top--level directory of the untarred flowWorkspace source directory, before installing with R CMD INSTALL. \subsubsection{Known Bugs} Importing flowJo transformations. We have made every effort to support the importing of flowJo's data transformations. Sometimes, however, flowWorkspace may have difficulty identifying the correct transofrmation to apply to your data. There are several things you can do: \begin{itemize} \item Visualize your data after import using plotPopCV(). This will tell you if there is a discrepancy between the flowJo counts and the flowCore counts for individual populations. Keep in mind that tare populations can differ by a few cells and still show large coefficients of variation. \item If you are having difficulty importing your data due to a transformation problem, ensure that you are using either a log transform, or flowJo's "custom biexponential transform". The latter provides an explicit mapping between transformations, compensation matrices, and flow parameters. Outside of these two cases, success or failure appears to be dependent on the version of flowJo that was used to create the original workspace. \end{itemize} \section{Troubleshooting} If this package is throwing errors when parsing your workspace, and you are certain your workspace is version 2.0, contact the package author. If you can send your workspace by email, we can test, debug, and fix the package so that it works for you. Our goal is to provide a tool that works, and that people find useful. \section{Future Improvements} We are working on support for flowJo XML workspaces exported from the Windows version of flowJo. Efforts are underway to integrate GatingSet and GatingHierarchy objects more closely with the rest of the flow infrastructure. \end{document}