--- title: "SC3 manual" author: "Vladimir Kiselev" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{SC3 manual} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction Single-Cell Consensus Clustering (__SC3__) is a tool for unsupervised clustering of scRNA-seq data. __SC3__ achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. An interactive graphical implementation makes __SC3__ accessible to a wide audience of users. In addition, __SC3__ also aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. A manuscript describing __SC3__ in details is currently under review but a copy of it is available on [bioRxiv](http://biorxiv.org/content/early/2016/01/13/036558). ## Prerequisites (optional) > Note that this prerequisite is only optional. If it is not installed you can still run __SC3__, but the _GO anlaysis_ functionality will be disabled. __SC3__ imports some of the [RSelenium](https://cran.r-project.org/web/packages/RSelenium/) functionality. [RSelenium](https://cran.r-project.org/web/packages/RSelenium/) depends on a stand-alone java binary file (see [Rselenium documentation](https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html) for more details). You can download and install this binary file by running (the file size is about 30Mb): ```{r, eval=FALSE} RSelenium::checkForServer() ``` Note, this command has to be executed only once, before running __SC3__ for the first time. Also note that the minimum Java version requirement for [RSelenium](https://cran.r-project.org/web/packages/RSelenium/) is 1.7 (see [this post](https://github.com/ropensci/RSelenium/issues/54) for details). ## "Built-in" datasets There is one built-in dataset that is automatically loaded with __SC3__: | Dataset | Source | __N__ cells | __k__ clusters | --- | --- | --- | --- | | [treutlein](http://www.nature.com/nature/journal/v509/n7500/full/nature13173.html) | Distal lung epithelium | 80 | 5 | ## Quick start To run __SC3__ with the built-in dataset, start R and then type: ```{r, eval=FALSE} library(SC3) sc3(treutlein) ``` It should open SC3 in a browser window without providing any error. ## __SC3__ parameters By default __SC3__ clusters the input dataset in a region of __k__ from 3 to 7. This region can be changed by adjusting the __ks__ parameter of __SC3__. E.g. if you would like to check the clustering of your own __dataset__ for __k__ from 2 to 5, then you need to run the following: ```{r, eval=FALSE} sc3(dataset, ks = 2:5) ``` where __dataset__ is either an R matrix / data.frame / data.table object OR a path to your input file containing an expression matrix. For the full list of __SC3__ parameters please read the documentation by typing ?sc3. ## Running __SC3__ without interactivity If you would like to separate clustering calculations from visualisation, e.g. when your dataset is very large and you need to run it on a computing cluster, you can disable interactivity by setting __interactivity__ parameter to FALSE: ```{r, eval=FALSE} sc3(treutlein, interactivity = FALSE) ``` In this case all clustering results will be saved as _sc3.interactive.arg_ variable in your R session. You can then visualise your results by running: ```{r, eval=FALSE} sc3_interactive(sc3.interactive.arg) ``` ## Number of cells Combining multiple clustering results through the consensus approach is quite computationally expensive. Therefore running __SC3__ on datasets with 1,000 cells may take ~10-15 mins. We do not recommend to use the default options on datasets with more than 1,000 cells. In order to apply SC3 to larger datasets, we have implemented a hybrid approach that combines unsupervised and supervised methodologies. When the number of cells is large (N>1,000), SC3 selects a small subset of cells uniformly at random, and obtains clusters from this subset. Subsequently, the inferred labels are used to train a Support Vector Machine (SVM), which is employed to assign labels to the remaining cells. Training the SVM typically takes only a few minutes, thus allowing SC3 to be applied to very large datasets. Based on our results, we set the default of __SC3__ so that when N>1,000 only 20% of the cells are used and when N>5,000 only 1,000 cells are used for clustering before training the SVM. To enable the SVM training please set the _svm_ parameter to __TRUE__: ```{r, eval=FALSE} sc3(treutlein, svm = TRUE) ``` You can also control the number of training cells used to train SVM with the _svm.num.cells_ parameter: ```{r, eval=FALSE} sc3(treutlein, svm = TRUE, svm.num.cells = 25) ``` ## Input file format To run __SC3__ on an input file containing an expression matrix one need to preprocess the input file so that it looks as follows: | | cell1 | cell2 | cell3 | cell4 | cell5 --- | --- | --- | --- | --- | --- | __gene1__ | 1 | 2 | 3 | 4 | 5 | __gene2__ | 1 | 2 | 3 | 4 | 5 | __gene3__ | 1 | 2 | 3 | 4 | 5 The first row of the expression matrix (with cell labels, e.g. __cell1__, __cell2__, etc.) should contain one fewer field than all other rows. Separators should be either spaces or tabs. If separators are commas (,) then the extension of the file must be .csv. If a path to your input file is "/path/to/input/file/expression-matrix.txt", to run it: ```{r, eval=FALSE} sc3("/path/to/input/file/expression-matrix.txt", ks = 2:5) ``` ## Saving results After finding the best clustering solution __SC3__ allows a user to either export the results into an excel file or to save them to the current R session. If the former is chosen then the excel file will be written to the default _Downloads_ folder used by your browser. When saving to the R session a variable (list) __SC3.results__ containing all the results from the interactive session will be created.