---
title: "Setting up the workflow and first steps"
author: 
  - name: Giulia Pais
    affiliation: | 
     San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, 
     Via Olgettina 60, 20132 Milano - Italia
    email: giuliapais1@gmail.com, calabria.andrea@hsr.it
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('ISAnalytics')`"
vignette: >
  %\VignetteIndexEntry{workflow_start}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r GenSetup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
    ## Related to
    ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ISAnalytics = citation("ISAnalytics")[1],
    VISPA2 = BibEntry(
        bibtype = "Article",
        title = paste(
            "VISPA2: a scalable pipeline for",
            "high-throughput identification and",
            "annotation of vector integration sites"
        ),
        author = "Giulio Spinozzi, Andrea Calabria, Stefano Brasca",
        journaltitle = "BMC Bioinformatics",
        date = "2017-11-25",
        doi = "10.1186/s12859-017-1937-9"
    ),
    eulerr = citation("eulerr")[1]
)

library(ISAnalytics)
library(BiocStyle)
library(DT)
main_fig <- fs::path("../man", "figures", "dyn_vars_general.png")
```

# Introduction

ISAnalytics is an R package developed to analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies.

In this vignette we will explain how to properly setup the workflow and
the first steps of data import and data cleaning.

# Setting up your workflow with dynamic vars

This section demonstrates how to properly setup your workflow 
with `ISAnalytics` using the "dynamic vars" system.

From `ISAnalytics 1.5.4` onwards, a new system here referred to as 
"dynamic vars" has been implemented to improve the flexibility of the package,
by allowing multiple input formats based on user needs rather than enforcing
hard-coded names and structures. In this way, users that do not follow the
standard name conventions used by the package have to put minimal effort
into making their inputs compliant to the package requirements.

There are 5 main categories of inputs you can customize:

* The **"mandatory IS vars"**: this set of variables is used to uniquely 
identify integration events across several functions implemented in the 
package
* The **"annotation IS vars"**: this set of variables holds the names of
the columns that contain genomic annotations
* The **"association file columns"**: this set contains information on how 
metadata is structured
* The **"VISPA2 stats specs"**: this set contains information on the format
of pool statistics files produced automatically by VISPA2
* The **"matrix files suffixes"**: this set contains all default file names
for each quantification type and it is used by automated import functions

## General approach

The general approach is based on the specification of 
predefined tags and their associated information in the form of simple
data frames with a standard structure, namely:

names                | types  | transform          | flag   | tag 
---------------------|--------|--------------------|--------|-----
`<name of the column>` | `<type>` | `<a lambda or NULL>` | `<flag>` | `<tag>`

where

* `names` contains the name of the column as a character
* `types` contains the type of the column. Type should be expressed as a 
string and should be in one of the allowed types
    * `char` for character (strings)
    * `int` for integers 
    * `logi` for logical values (TRUE / FALSE)
    * `numeric` for numeric values
    * `factor` for factors
    * `date` for generic date format - note that functions that
    need to read and parse files will try to guess the format and parsing
    may fail
    * One of the accepted date/datetime formats by `r CRANpkg("lubridate")`, 
    you can use `ISAnalytics::date_formats()` to view the accepted formats
* `transform`: a purrr-style lambda that is applied immediately after importing.
This is useful to operate simple transformations like removing unwanted 
characters or rounding to a certain precision. Please note that these lambdas
need to be functions that accept a vector as input and only operate a 
<u>transformation</u>, aka they output a vector of the same length as the 
input. For more complicated applications that may require the value of other
columns, appropriate functions should be manually applied post-import.
* `flag`: as of now, it should be set either to `required` or `optional` - 
some functions internally check for only required tags presence and if those
are missing from inputs they fail, signaling failure to the user
* `tag`: a specific tag expressed as a string - see Section \@ref(tags)

![Dynamic variables general approach](`r main_fig`)

## Tags

As already mentioned, certain functions included in the package require 
the presence of specific tags (and associated column names) in the input
to work properly. 
You can always check what a tag means and where it is used by using the
function `inspect_tags()` and providing in input the tags you want to check
as a character vector.

```{r}
inspect_tags("chromosome")
```

You should make sure the tag matches the right information in your inputs
by looking at the description of the tag.
It is also possible to add entries that are not associated with any tag.
Here is an overview of the critical tags for each 
category.

### Mandatory IS vars

```{r echo=FALSE}
all_tags <- available_tags()
all_tags <- all_tags |>
    dplyr::mutate(needed_in = purrr::map_chr(
        .data$needed_in,
        ~ paste0(.x, collapse = ", ")
    ))
mand_tags <- all_tags |>
    dplyr::filter(.data$dyn_vars_tbl == "mand_vars") |>
    dplyr::select(dplyr::all_of(c("tag", "needed_in", "description")))
annot_tags <- all_tags |>
    dplyr::filter(.data$dyn_vars_tbl == "annot_vars") |>
    dplyr::select(dplyr::all_of(c("tag", "needed_in", "description")))
af_tags <- all_tags |>
    dplyr::filter(.data$dyn_vars_tbl == "af_vars") |>
    dplyr::select(dplyr::all_of(c("tag", "needed_in", "description")))
iss_tags <- all_tags |>
    dplyr::filter(.data$dyn_vars_tbl == "iss_vars") |>
    dplyr::select(dplyr::all_of(c("tag", "needed_in", "description")))
```

```{r echo=FALSE}
datatable(
    mand_tags,
    class = "stripe",
    options = list(dom = "t"),
    rownames = FALSE
)
```

The presence of all mandatory IS vars is also checked and used
in other functions - for example, when importing matrices it is ensured
that all mandatory variables are present in the input,
as declared in the look up table.
Some functions may require information that needs to be specified as input,
always check the documentation if you have doubts.

### Annotation IS vars

```{r echo=FALSE}
datatable(
    annot_tags,
    class = "stripe",
    options = list(dom = "t"),
    rownames = FALSE
)
```

Although genomic annotations are not necessarily required to work with 
ISAnalytics, some operations do require annotation - if you're working with
matrices that are not annotated you can either annotate them with a tool of
your choice or skip the steps that require annotation.

### Association file columns

```{r echo=FALSE}
datatable(
    af_tags,
    class = "stripe",
    options = list(dom = "t"),
    rownames = FALSE
)
```

Some tags in this table are not associated with any function yet, but they
exist for potential new features that will be added in the future.

### VISPA2 stats specs

```{r echo=FALSE}
datatable(
    iss_tags,
    class = "stripe",
    options = list(dom = "t"),
    rownames = FALSE
)
```

NOTE: VISPA2 stats files usually follow standard naming conventions. 
If the pipeline launch was configured with default parameters, do not change
this lookup table.

## Customizing dynamic vars

For each category of dynamic vars there are 3 functions:

* A getter - returns the current lookup table
* A setter - allows the user to change the current lookup table
* A resetter - reverts all changes to defaults

Setters will take in input the new variables, validate and eventually
change the lookup table. If validation fails an error will be thrown instead,
inviting the user to review the inputs. Moreover, if some of the critical
tags for the category are missing, a warning appears, with a list of the 
missing ones.

Let's take a look at some examples.

On package loading, all lookup tables are set to default values. 
For example, for mandatory IS vars we have:

```{r}
mandatory_IS_vars(TRUE)
```

Let's suppose our matrices follow a different standard, and integration events
are characterized by 5 fields, like so (the example contains random data):

chrom   | position | strand | gap    | junction 
--------|----------|--------|--------|---------
"chr1"  | 342543   | "+"    | 100    | 50
...     | ...      | ...    | ...    | ...

To make this work with ISAnalytics functions, we need to compile the lookup
table like this:

```{r}
new_mand_vars <- tibble::tribble(
    ~names, ~types, ~transform, ~flag, ~tag,
    "chrom", "char", ~ stringr::str_replace_all(.x, "chr", ""), "required",
    "chromosome",
    "position", "int", NULL, "required", "locus",
    "strand", "char", NULL, "required", "is_strand",
    "gap", "int", NULL, "required", NA_character_,
    "junction", "int", NULL, "required", NA_character_
)
```

Notice that we have specified a transformation for the "chromosome" tag:
in this case we would like to have only the number of the chromosome without
the prefix "chr" - this lambda will get executed immediately after import.

To set the new variables simply do:

```{r}
set_mandatory_IS_vars(new_mand_vars)
mandatory_IS_vars(TRUE)
```

If you don't specify a critical tag, a warning message is displayed:

```{r}
new_mand_vars[1, ]$tag <- NA_character_
set_mandatory_IS_vars(new_mand_vars)
mandatory_IS_vars(TRUE)
```

If you change your mind and want to go back to defaults:

```{r}
reset_mandatory_IS_vars()
mandatory_IS_vars(TRUE)
```

The principle is the same for annotation IS vars, association file columns and
VISPA2 stats specs. Here is a summary of the functions for each:

* mandatory IS vars: `mandatory_IS_vars()`, `set_mandatory_IS_vars()`,
`reset_mandatory_IS_vars()`
* annotation IS vars: `annotation_IS_vars()`, `set_annotation_IS_vars()`,
`reset_annotation_IS_vars()`
* association file columns: `association_file_columns()`, 
`set_af_columns_def()`, `reset_af_columns_def()`
* VISPA2 stats specs: `iss_stats_specs()`, `set_iss_stats_specs()`,
`reset_iss_stats_specs`

Matrix files suffixes work slightly different:

```{r}
matrix_file_suffixes()
```

To change this lookup table use the function `set_matrix_file_suffixes()`:
the function will ask to specify a suffix for each quantification and for
both annotated and not annotated versions. These suffixes are used in the
automated matrix import function when scanning the file system.

To reset all lookup tables to their default configurations you can also 
use the function `reset_dyn_vars_config()`, which reverts all changes.

## FAQs

### Do I have to do this every time the package loads?

No, if you frequently have to work with a non-standard settings profile,
you can use the functions `export_ISA_settings()` and `import_ISA_settings()`:
these functions allow the import/export of setting profiles in *.json format.

Once you set your variables for the first time through the procedure described
before, simply call the export function and all will be saved to a json file,
which can then be imported for the next workflow.

# Reporting progress

From `ISAnalytics 1.7.4`, functions that make use of parallel workers or
process long tasks report progress via the functions offered by
[progressr](https://progressr.futureverse.org/). To enable progress bars
for all functions in ISAnalytics do

```{r eval=FALSE}
enable_progress_bars()
```

before calling other functions. 
For customizing the appearance of the progress bar please refer to `progressr`
documentation.

# Introduction to `ISAnalytics` import functions family

In this section we're going to explain more in detail how functions of the 
import family should be used, the most common workflows to follow and more.

## Designed to work with VISPA2 pipeline

The vast majority of the functions included in this package is designed to work 
in combination with VISPA2 pipeline `r Citep(bib[["VISPA2"]])`. 
If you don't know what it is, we strongly 
recommend you to take a look at these links:  

* Article: [VISPA2: Article](
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5702242/)  
* BitBucket Wiki: [VISPA2 Wiki](
https://bitbucket.org/andreacalabria/vispa2/wiki/Home)  

## File system structure generated

VISPA2 produces a standard file system structure starting from a folder you 
specify as your workbench or root. The structure always follows this schema:  

* root/
  * Optional intermediate folders
    * Projects (PROJECTID)
      * bam
      * bcmuxall
      * bed
      * iss
        * Pools (concatenatePoolIDSeqRun)
      * quality
      * quantification
        * Pools (concatenatePoolIDSeqRun)
      * report

Most of the functions implemented expect a standard file system structure
as the one described above.

## Notation {#notation}

We call an *"integration matrix"* a tabular structure characterized by:

* k mandatory columns of genomic features that characterize a viral insertion
site in the genome, which are specified via `mandatory_IS_vars()`. By default
they're set to `chr`, `integration_locus` and `strand`
* a (optional) annotation columns, provided via `annotation_IS_vars()`. 
By default they're set to `GeneName` and `GeneStrand`
* A variable number n of sample columns containing the quantification
of the corresponding integration site

```{r echo=FALSE}
sample_sparse_matrix <- tibble::tribble(
    ~chr, ~integration_locus, ~strand, ~GeneName, ~GeneStrand,
    ~exp1, ~exp2, ~exp3,
    "1", 12324, "+", "NFATC3", "+", 4553, 5345, NA_integer_,
    "6", 657532, "+", "LOC100507487", "+", 76, 545, 5,
    "7", 657532, "+", "EDIL3", "-", NA_integer_, 56, NA_integer_,
)
print(sample_sparse_matrix, width = Inf)
```

The package uses a more compact form of these matrices, limiting the amount
of NA values and optimizing time and memory consumption. 
For more info on this take a look at: 
[Tidy data](https://r4ds.had.co.nz/tidy-data.html)

While integration matrices contain the actual data, we also need associated
sample metadata to perform the vast majority of the analyses. 
`ISAnalytics` expects the metadata to be contained in a so called 
*"association file"*, which is a simple tabular file.

To generate a blank association file you can use the function 
`generate_blank_association_file`. You can also view the standard 
column names with `association_file_columns()`.

## Importing metadata {#metadata}

To import metadata we use `import_association_file()`. This function is not
only responsible for reading the file into the R environment as a data frame,
but it is capable to perform a file system alignment operation,
that is, for each project and pool contained in the file, it scans
the file system starting from the provided root to check if the corresponding
folders (contained in the appropriate column) can be found. Remember that
to work properly, this operation expects a standard folder structure, such
as the one provided by VISPA2. This function also produces an interactive
HTML report.

```{r}
fs_path <- generate_default_folder_structure()
withr::with_options(list(ISAnalytics.reports = FALSE), code = {
    af <- import_association_file(fs_path$af, root = fs_path$root)
})
```

```{r echo=FALSE}
print(head(af))
```

### Function arguments

You can change several arguments in the function call to modify the 
behavior of the function.

* `root`
  * Set it to `NULL` if you only want to import the association file without
  file system alignment. Beware that some of the automated import 
  functionalities won't work!
  * Set it to a non-empty string (path on disk): in this case, 
  the column associated with the tag `proj_folder`
  (by default `PathToFolderProjectID`) in the file should contain 
  **relative** file paths, so if for example your root is set to "/home" and
  your project folder in the association file is set to "/PJ01", the function
  will check that the directory exists under "/home/PJ01"
  * Set it to an empty string: ideal if you want to store paths in the 
  association file as **absolute** file paths. In this case if your project
  folder is in "/home/PJ01" you should have this path in the
  `PathToFolderProjectID` column and set `root` = ""
* `dates_format`: a string that is useful for properly parsing dates from 
tabular formats
* `separator`: the column separator used in the file. Defaults to "\\t",
other valid separators are "," (comma), ";" (semi-colon)
* `filter_for`: you can set this argument to a **named** list of filters, 
where names are column names. For example `list(ProjectID = "PJ01")` will
return only those rows whose attribute "ProjectID" equals "PJ01"
* `import_iss`: either `TRUE` or `FALSE`. If set to `TRUE`, performs
an internal call to `import_Vispa2_stats()` (see next section), and appends
the imported files to metadata
* `convert_tp`: either `TRUE` or `FALSE`. Converts the column containing
the time point expressed in days in months and years (with custom logic).
* `report_path`
  * Set it to `NULL` to avoid the production of a report
  * Set it to a folder (if it doesn't exist, it gets automatically created)
* `...`: additional named arguments to pass to `import_Vispa2_stats()` if
you chose to import VISPA2 stats

For further details view the dedicated function documentation.

*NOTE*: the function supports files in various formats as long as the correct
separator is provided. It also accepts files in `*.xlsx` and `*.xls` formats 
but we do not recommend using these since the report won't include a 
detailed summary of potential parsing problems.

The interactive report includes useful information such as

* General issues: parsing problems, missing columns, NA values in important 
columns etc. This allows you to immediately spot problems and correct them 
before proceeding with the analyses
* File system alignment issues: very useful to know if all data can be imported
or folders are missing
* Info on VISPA2 stats (if `import_iss` was `TRUE`)

## Importing VISPA2 stats files

VISPA2 automatically produces summary files for each pool holding
information that can be useful for other analyses downstream,
so it is recommended to import them in the first steps of the workflow.
To do that, you can use `import_VISPA2_stats`:

```{r results='hide'}
vispa_stats <- import_Vispa2_stats(
    association_file = af,
    join_with_af = FALSE,
    report_path = NULL
)
```

```{r echo=FALSE}
print(head(vispa_stats))
```

The function requires as input the imported and file system aligned
association file and it will scan the `iss` folder for files that match some
known prefixes (defaults are already provided but you can change them as you
see fit). You can either choose to join the imported data frames with the
association file in input and obtain a single data frame or keep it as it is,
just set the parameter `join_with_af` accordingly.
At the end of the process an HTML report is produced, signaling potential
problems.

You can directly call this function when you import the association file
by setting the `import_iss` argument of `import_association_file` to `TRUE`.

## Importing a single integration matrix

If you want to import a single integration matrix you can do so by using the
`import_single_Vispa2Matrix()` function.
This function reads the file and converts it into a tidy structure: several
different formats can be read, since you can specify the column separator.

```{r message=FALSE, results='hide'}
matrix_path <- fs::path(
    fs_path$root,
    "PJ01",
    "quantification",
    "POOL01-1",
    "PJ01_POOL01-1_seqCount_matrix.no0.annotated.tsv.gz"
)
matrix <- import_single_Vispa2Matrix(matrix_path)
```

```{r echo=FALSE}
matrix
```

For details on usage and arguments view the dedicated function documentation.

## Automated integration matrices import

Integration matrices import can be automated when when the association file
is imported with the file system alignment option.
`ISAnalytics` provides a function, `import_parallel_Vispa2Matrices()`,
that allows to do just that in a fast and efficient way.

```{r}
withr::with_options(list(ISAnalytics.reports = FALSE), {
    matrices <- import_parallel_Vispa2Matrices(af,
        c("seqCount", "fragmentEstimate"),
        mode = "AUTO"
    )
})
```

## Function arguments

Let's see how the behavior of the function changes when we change arguments.

### `association_file` argument

You can supply a data frame object, imported via `import_association_file()` 
(see Section \@ref(metadata)) or a string (the path to the association file
on disk). In the first scenario it is necessary to perform file system 
alignment, since the function scans the folders contained in the column 
`Path_quant`, while in the second case you should also provide as additional
**named** argument (to `...`) an appropriate `root`: the function will 
internally call `import_association_file()`, if you don't have specific
needs we recommend doing the 2 steps separately and provide the association
file as a data frame.

### `quantification_type` argument

For each pool there may be multiple available quantification types, that is,
different matrices containing the same samples
and same genomic features but a different quantification.
A typical workflow contemplates `seqCount` and `fragmentEstimate`,
all the supported quantification types can be viewed with
`quantification_types()`.

### `matrix_type` argument

As we mentioned in Section \@ref(notation), annotation columns are optional 
and may not be included in some matrices. This argument allows you to 
specify the function to look for only a specific type of matrix, either
`annotated` or `not_annotated`. 

File suffixes for matrices are specified via `matrix_file_suffixes()`.

### `workers` argument

Sets the number of parallel workers to set up. This highly depends on the
hardware configuration of your machine.

### `multi_quant_matrix` argument

When importing more than one quantification at once, it can be very handy
to have all data in a single data frame rather than two. If set to `TRUE`
the function will internally call `comparison_matrix()` and produce a 
single data frames that has a dedicated column for each quantification.
For example, for the matrices we've imported before:

```{r echo=FALSE}
print(head(matrices))
```

### `report_path` argument

As other import functions, also `import_parallel_Vispa2Matrices()` produces
an interactive report, use this argument to set the appropriate path were
the report should be saved.

### `mode` argument

Since `ISAnalytics 1.8.3` this argument can only be set to `AUTO`.

**What do you want to import?**  
In a fully automated mode, the function will try to import everything that
is contained in the input association file. This means that if you need to
import only a specific set of projects/pools, you will need to filter the
association file accordingly prior calling the function (you can easily
do that via the `filter_for` argument as explained in Section \@ref(metadata)).

**How to deal with duplicates?**  
When scanning folders for files that match a given pattern (in our case the 
function looks for matrices that match the quantification type and the 
matrix type), it is very possible that the same folder contains multiple files
for the same quantification. Of course this is not recommended, we suggest to
move the duplicated files in a sub directory or remove them if they're not 
necessary, but in case this happens, you need to set two other arguments
(described in the next sub sections) to "help" the function discriminate 
between duplicates. Please note that if such discrimination is not possible
no files are imported.

### `patterns` argument

Providing a 
set of patterns (interpreted as regular expressions) helps the function to 
choose between duplicated files if any are found. If you're confident your
folders don't contain any duplicates feel free to ignore this argument.

### `matching_opt` argument

This argument is relevant only if `patterns` 
isn't `NULL`. Tells the function how to match the given patterns if multiple
are supplied: `ALL` means keep only those files whose name matches all the
given patterns, `ANY` means keep only those files whose name matches any of the
given patterns and `OPTIONAL` expresses a preference, try to find files that
contain the patterns and if you don't find any return whatever you find.

### `...` argument

Additional named arguments to supply to `comparison_matrix()` and 
`import_single_Vispa2_matrix`

## Notes

Earlier versions of the package featured two separated functions, 
`import_parallel_Vispa2Matrices_auto()` and 
`import_parallel_Vispa2Matrices_interactive()`. Those functions are now
officially deprecated (since `ISAnalytics 1.3.3`) and will be defunct on
the next release cycle.

# Data cleaning and pre-processing

This section goes more in detail on some data cleaning and pre-processing 
operations you can perform with this package.

ISAnalytics offers several different functions for cleaning and pre-processing
your data.

* Recalibration: identifies integration events that are near to each other
and condenses them into a single event whenever appropriate -
`compute_near_integrations()`
* Outliers identification and removal: identifies samples that are considered
outliers according to user-defined logic and filters them out - 
`outlier_filter()`
* Collision removal: identifies collision events between independent samples -
`remove_collisions()`
* Filter based on cell lineage purity: identifies and removes contamination
between different cell types - `purity_filter()`
* Data and metadata aggregation: allows the union of biological samples from
single pcr replicates or other arbitrary aggregations -
`aggregate_values_by_key()`, `aggregate_metadata()`

## Removing collisions

In this section we illustrate the functions dedicated to collision removal.

### What is a collision and why should you care?

We're not going into too much detail here, but we're going to explain in a
very simple way what a "collision" is and how the function in this package
deals with them.

We say that an integration (aka a unique combination of 
`mandatory_IS_vars()`) is a *collision* if this combination is shared
between different independent samples: an independent sample is a unique
combination of metadata fields specified by the user. 
The reason behind this is that it's highly improbable to observe
the very same integration in two different independent samples
and this phenomenon might
be an indicator of some kind of contamination in the sequencing phase or in
PCR phase, for this reason we might want to exclude such contamination from
our analysis.
`ISAnalytics` provides a function that processes the imported data for the
removal or reassignment of these "problematic" integrations,
`remove_collisions()`.

The processing is done using the sequence count value, so the corresponding
matrix is needed for this operation.

### The logic behind the function

The `remove_collisions()` function follows several logical
steps to decide whether
an integration is a collision and if it is it decides whether to re-assign it or
remove it entirely based on different criteria.

#### Identifying the collisions

The function uses the information stored
in the association file to assess which independent samples are present and
counts the number of independent samples for each integration: those who have a
count > 1 are considered collisions.

```{r echo=FALSE}
library(ISAnalytics)
ex_coll <- tibble::tribble(
    ~chr, ~integration_locus, ~strand, ~seqCount, ~CompleteAmplificationID,
    ~SubjectID, ~ProjectID,
    "1", 123454, "+", 653, "SAMPLE1", "SUBJ01", "PJ01",
    "1", 123454, "+", 456, "SAMPLE2", "SUBJ02", "PJ01"
)
knitr::kable(ex_coll, caption = paste(
    "Example of collisions: the same",
    "integration (1, 123454, +) is found",
    "in 2 different independent samples",
    "((SUBJ01, PJ01) & (SUBJ02, PJ01))"
))
```

#### Re-assign vs remove

Once the collisions are identified, the function follows 3 steps where it tries
to re-assign the combination to a single independent sample.
The criteria are:

1. Compare dates: if it's possible to have an absolute ordering on dates, the
integration is re-assigned to the sample that has the earliest date. If two
samples share the same date it's impossible to decide, so the next criteria is
tested
2. Compare replicate number: if a sample has the same integration in more than
one replicate, it's more probable the integration is not an artifact. If it's
possible to have an absolute ordering, the collision is re-assigned to the
sample whose grouping is largest
3. Compare the sequence count value: if the previous criteria wasn't sufficient
to make a decision, for each group of independent samples it's evaluated the
sum of the sequence count value - for each group there is a cumulative value of
the sequence count and this is compared to the value of other groups. If there
is a single group which has a ratio n times bigger than other groups, this one
is chosen for re-assignment. The factor n is passed as a parameter in the
function (`reads_ratio`), the default value is 10.

If none of the criteria were sufficient to make a decision, the integration
is simply removed from the matrix.

### Usage

```{r}
data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
## Multi quantification matrix
no_coll <- remove_collisions(
    x = integration_matrices,
    association_file = association_file,
    report_path = NULL
)
## Matrix list
separated <- separate_quant_matrices(integration_matrices)
no_coll_list <- remove_collisions(
    x = separated,
    association_file = association_file,
    report_path = NULL
)
## Only sequence count
no_coll_single <- remove_collisions(
    x = separated$seqCount,
    association_file = association_file,
    quant_cols = c(seqCount = "Value"),
    report_path = NULL
)
```

Important notes on the association file:

* You have to be sure your association file is properly filled out. The function
requires you to specify a date column (by default "SequencingDate"), you have to
ensure this column doesn't contain NA values or incorrect values.

The function accepts different inputs, namely:

* A multi-quantification matrix: this is always
the recommended approach
* A named list of matrices where names are quantification types in
`quantification_types()`
* The single sequence count matrix: this is not the recommended approach
since it requires a realignment step for other quantification matrices if 
you have them.

If the option `ISAnalytics.reports` is active, an interactive report in 
HTML format will be produced at the specified path.

### Re-align other matrices

If you've given as input the standalone sequence count 
matrix to `remove_collisions()`, to realign other matrices you have
to call the function `realign_after_collisions()`, passing as input the
processed sequence count matrix and the named list of other matrices
to realign.
**NOTE: the names in the list must be quantification types.**

```{r realign}
other_realigned <- realign_after_collisions(
    sc_matrix = no_coll_single,
    other_matrices = list(fragmentEstimate = separated$fragmentEstimate)
)
```

## Performing data and metadata aggregation

In this section we're going to explain in detail how to use functions of the 
aggregate family, namely:

1. `aggregate_metadata()`
2. `aggregate_values_by_key()`

### Aggregating metadata

We refer to information contained in the association file as "metadata": 
sometimes it's useful to obtain collective information based on a certain 
group of variables we're interested in. The function `aggregate_metadata()` 
does just that: according to the grouping variables, meaning the names of 
the columns in the association file to perform a `group_by` operation with,it 
creates a summary. You can fully customize the summary by providing a 
"function table" that tells the function which operation should be 
applied to which column and what name to give to the output column.
A default is already supplied:

```{r echo=FALSE}
library(ISAnalytics)
print(default_meta_agg(), width = Inf)
```

You can either provide purrr-style lambdas (as given in the example above),
or simply specify the name of the function and additional parameters as a 
list in a separated column. If you choose to provide your own table you
should maintain the column names for the function to work properly.
For more details on this take a look at the function documentation 
`?default_meta_agg`.

#### Typical workflow

1. Import the association file via `import_assocition_file()`. If you need more 
information on import function please view the vignette 
"How to use import functions": 
`vignette("how_to_import_functions", package="ISAnalytics")`.
2. Perform aggregation

```{r}
data("association_file", package = "ISAnalytics")
aggregated_meta <- aggregate_metadata(association_file = association_file)
```

```{r echo=FALSE}
print(aggregated_meta)
```

### Aggregation of values by key

`ISAnalytics` contains useful functions to aggregate the values contained in 
your imported matrices based on a key, aka a single column or a combination of 
columns contained in the association file that are related to the samples.

#### Typical workflow

1. Import your association file
2. Import integration matrices via `import_parallel_Vispa2Matrices()`
3. Perform aggregation

```{r}
data("integration_matrices", package = "ISAnalytics")
data("association_file", package = "ISAnalytics")
aggreg <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    value_cols = c("seqCount", "fragmentEstimate")
)
```

```{r echo=FALSE}
print(aggreg, width = Inf)
```

The function `aggregate_values_by_key` can perform the aggregation both on the 
list of matrices and a single matrix.

##### Changing parameters to obtain different results

The function has several different parameters that have default values that 
can be changed according to user preference.

1. **Changing the `key` value**  
You can change the value of the parameter key as you see fit. This parameter 
should contain one or multiple columns of the association file that you want 
to include in the grouping when performing the aggregation. 
The default value is set to `c("SubjectID", "CellMarker",
"Tissue", "TimePoint")`
(same default key as the `aggregate_metadata` 
function).

```{r}
agg1 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = c("SubjectID", "ProjectID"),
    value_cols = c("seqCount", "fragmentEstimate")
)
```

```{r echo=FALSE}
print(agg1, width = Inf)
```

2. **Changing the `lambda` value**  
The `lambda` parameter indicates the function(s) to be applied to the 
values for aggregation. 
`lambda` must be a named list of either functions or purrr-style lambdas:
if you would like to specify additional parameters to the function 
the second option is recommended.
The only important note on functions is that they should perform some kind of 
aggregation on numeric values: this means in practical terms they need
to accept a vector of numeric/integer values as input and produce a 
SINGLE value as output. Valid options for this purpose might be: `sum`, `mean`, 
`median`, `min`, `max` and so on.

```{r}
agg2 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(mean = ~ mean(.x, na.rm = TRUE)),
    value_cols = c("seqCount", "fragmentEstimate")
)
```

```{r echo=FALSE}
print(agg2, width = Inf)
```

Note that, when specifying purrr-style lambdas (formulas), the first 
parameter needs to be set to `.x`, other parameters can be set as usual.

You can also use in `lambda` functions that produce data frames or lists.
In this case all variables from the produced data frame will be included
in the final data frame. For example:

```{r}
agg3 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(describe = ~ list(psych::describe(.x))),
    value_cols = c("seqCount", "fragmentEstimate")
)
```

```{r echo=FALSE}
print(agg3, width = Inf)
```

3. **Changing the `value_cols` value**  
The `value_cols` parameter tells the function on which numeric columns 
of x the functions should be applied. 
Note that every function contained in `lambda` will be applied to every
column in `value_cols`: resulting columns will be named as 
"original name_function applied".

```{r}
agg4 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(sum = sum, mean = mean),
    value_cols = c("seqCount", "fragmentEstimate")
)
```

```{r echo=FALSE}
print(agg4, width = Inf)
```

4. **Changing the `group` value**  
The `group` parameter should contain all other variables to include in the 
grouping besides `key`. By default this contains 
`c("chr", "integration_locus","strand", "GeneName", "GeneStrand")`. 
You can change this grouping as you see 
fit, if you don't want to add any other variable to the key, just set it to 
`NULL`.

```{r}
agg5 <- aggregate_values_by_key(
    x = integration_matrices,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(sum = sum, mean = mean),
    group = c(mandatory_IS_vars()),
    value_cols = c("seqCount", "fragmentEstimate")
)
```

```{r echo=FALSE}
print(agg5, width = Inf)
```

# Analysis use-case example: shared integration sites

An integration site is always characterized by a triple of values:
`(chr, integration_locus, strand)`, hence these attributes are always present
in integration matrices.

We can aggregate our data in different ways according to our needs, obtaining
therefore different groups. Each group has an associated set of integration
sites.

NOTE: aggregating data is not mandatory, since sharing functions in ISAnalytics
only count distinct integration sites and do not require any quantification.
The only important thing is that columns that are included in the specified key
are also included in the input matrices.


```{r}
## Aggregation by standard key
agg <- aggregate_values_by_key(integration_matrices,
    association_file,
    value_cols = c("seqCount", "fragmentEstimate")
)
agg <- agg |> dplyr::filter(TimePoint %in% c("0030", "0060"))
```

```{r echo=FALSE}
print(agg, width = Inf)
```

An integration site is *shared* between two or more groups if the same triple
is observed in all the groups considered.

## Automated sharing counts

ISAnalytics provides the function `is_sharing()` for computing automated 
sharing counts. The function has several arguments that can be tuned according
to user needs.

## SCENARIO 1: single input data frame and single grouping key

```{r}
sharing_1 <- is_sharing(agg,
    group_key = c(
        "SubjectID", "CellMarker",
        "Tissue", "TimePoint"
    ),
    n_comp = 2,
    is_count = TRUE,
    relative_is_sharing = TRUE,
    minimal = TRUE,
    include_self_comp = FALSE,
    keep_genomic_coord = TRUE
)
sharing_1
```

In this configuration we set:

* A single input data frame: `agg`
* A single grouping key by setting the argument `grouping_key`. In this
specific case, our groups will be identified by a unique combination of
`SubjectID`, `CellMarker`, `Tissue` and `TimePoint`
* `n_comp` represents the number of comparisons to compute: 2 means we're 
interested in knowing the sharing for PAIRS of distinct groups
* We want to keep the counts of distinct integration sites for each group
by setting `is_count` to `TRUE`
* `relative_is_sharing` if set to `TRUE` adds sharing expressed as a percentage,
more precisely it adds a column `on_g1` that is calculated as the 
absolute number of shared integrations divided by the cardinality of the
first group, `on_g2` is analogous but is computed on the cardinality of the
second group and finally `on_union` is computed on the cardinality 
of the union of the two groups.
* By setting the argument `minimal` to `TRUE` we tell the function to avoid
redundant comparisons: in this way only **combinations** and not permutations
are included in the output table
* `include_self_comp` adds rows in the table that are labelled with the same
group: these rows always have a 100% sharing with all other groups. There are
few scenarios where this is useful, but for now we set it to `FALSE` since
we don't need it
* `keep_genomic_coord` allows us to keep the genomic coordinates of the 
shared integration sites as a separate table

#### Changing the number of comparisons

```{r}
sharing_1_a <- is_sharing(agg,
    group_key = c(
        "SubjectID", "CellMarker",
        "Tissue", "TimePoint"
    ),
    n_comp = 3,
    is_count = TRUE,
    relative_is_sharing = TRUE,
    minimal = TRUE,
    include_self_comp = FALSE,
    keep_genomic_coord = TRUE
)
sharing_1_a
```

Changing the `n_comp` to 3 means that we want to calculate the sharing between
3 different groups. Note that the `shared` column contains the counts of 
integrations that are shared by **ALL** groups, which is equivalent to 
a set intersection.

Beware of the fact that the more comparisons are requested the more time
the computation requires.

#### A case when it is useful to set `minimal = FALSE` 

Setting `minimal = FALSE` produces all possible permutations of the groups
and the corresponding values. In combination with `include_self_comp = TRUE`,
this is useful when we want to know the sharing between pairs of groups and
plot results as a heatmap.

```{r}
sharing_1_b <- is_sharing(agg,
    group_key = c(
        "SubjectID", "CellMarker",
        "Tissue", "TimePoint"
    ),
    n_comp = 2,
    is_count = TRUE,
    relative_is_sharing = TRUE,
    minimal = FALSE,
    include_self_comp = TRUE
)
sharing_1_b
heatmaps <- sharing_heatmap(sharing_1_b)
```

The function `sharing_heatmap()` automatically plots sharing between 2 groups.
There are several arguments to this function that allow us to obtain heatmaps
for the absolute sharing values or the relative (percentage) values.

```{r}
heatmaps$absolute
heatmaps$on_g1
heatmaps$on_union
```

Beware of the fact that calculating all permutations takes longer than 
calculating only distinct combinations, therefore if you don't need your
results plotted or you have more than 2 groups to compare you should stick 
with `minimal = TRUE` and `include_self_comp = FALSE`.

### SCENARIO 2: single input data frame and multiple grouping keys

In the first scenario, groups were homogeneous, that is, they were grouped all
with the same key. In this other scenario we want to start with data contained
in just one data frame but we want to compare sets of integrations that are
grouped differently. To do this we give as input a **list of keys** through
the argument `group_keys`.

```{r}
sharing_2 <- is_sharing(agg,
    group_keys = list(
        g1 = c(
            "SubjectID", "CellMarker",
            "Tissue", "TimePoint"
        ),
        g2 = c("SubjectID", "CellMarker"),
        g3 = c("CellMarker", "Tissue")
    )
)
sharing_2
```

There are a few things to keep in mind in this case:

* The arguments `group_key` (notice the absence of plural), 
`n_comp` and `include_self_comp` are ignored: the number of comparisons is 
automatically detected by counting the provided keys and a self comparison
doesn't make sense since group labels are different
* If you provide a list of identical keys or just one key
you fall back to scenario 1

### SCENARIO 3: multiple input data frame and single grouping key

Providing multiple input data frames and the same grouping key is an effective
way to reduce the number of comparisons performed.
Let's make an example: suppose we're interested in comparing groups labelled
by a unique combination of `SubjectID`, `CellMarker`, `Tissue` and `TimePoint`,
but this time we want the first group to contain only integrations relative to
`PT001_MNC_BM_0030` and the second group to contain only integrations 
relative to `PT001_MNC_BM_0060`.

We're going to filter the original data frame in order to obtain only 
relevant data in 2 separated tables and then proceed by calling the function.

```{r}
first_sample <- agg |>
    dplyr::filter(
        SubjectID == "PT001", CellMarker == "MNC", Tissue == "BM",
        TimePoint == "0030"
    )
second_sample <- agg |>
    dplyr::filter(
        SubjectID == "PT001", CellMarker == "MNC", Tissue == "BM",
        TimePoint == "0060"
    )
sharing_3 <- is_sharing(first_sample, second_sample,
    group_key = c(
        "SubjectID", "CellMarker",
        "Tissue", "TimePoint"
    ),
    is_count = TRUE,
    relative_is_sharing = TRUE,
    minimal = TRUE
)
sharing_3
```

Once again the arguments `n_comp` and `include_self_comp` are ignored:
the number of comparisons is equal to the number of data frames in input.

To handle special limit cases, the output group ids are appended with a dash
and a number (`-1`, `-2`, ...) that indicates the data frame of origin: this
is useful in the case group ids are duplicated in the inputs. To understand
better let's make an example:

```{r}
sharing_3_a <- is_sharing(
    first_sample, second_sample,
    group_key = c(
        "CellMarker", "Tissue"
    ),
    is_count = TRUE,
    relative_is_sharing = TRUE,
    minimal = FALSE
)
sharing_3_a
```

As you can see, the IDs of group 1 and group 2 are duplicated and without
a suffix it would be impossible to know which one came from which data frame.
In this way we know that the group "MNC_BM-1" comes from table 1 and has 54 ISs,
while "MNC_BM-2" comes from the second input table and has 114 ISs.


### SCENARIO 4: multiple input data frame and multiple grouping keys

Finally, the most general scenario is when we have multiple data frames
with multiple keys. In this case the number of data frames must be equal to
the number of provided keys and grouping keys are applied in order (
data frame 1 is grouped with key 1, data frame 2 is grouped with key 2, and
so on).

```{r}
df1 <- agg |>
    dplyr::filter(TimePoint == "0030")
df2 <- agg |>
    dplyr::filter(TimePoint == "0060")
df3 <- agg |>
    dplyr::filter(Tissue == "BM")

keys <- list(
    g1 = c("SubjectID", "CellMarker", "Tissue"),
    g2 = c("SubjectID", "Tissue"),
    g3 = c("SubjectID", "CellMarker", "Tissue")
)

sharing_4 <- is_sharing(df1, df2, df3, group_keys = keys)
sharing_4
```

Notice that in this example the keys for g1 and g3 are the same, meaning the
labels of the groups are actually the same, however you should remember that
the groups contain a **different set of integration sites** since they come
from different data frames.

## Plotting sharing results

When we have more than 2 comparisons it is convenient to plot them as venn or
euler diagrams. ISAnalytics has a fast way to do that through the functions
`is_sharing()` and `sharing_venn()`.

```{r}
sharing_5 <- is_sharing(agg,
    group_keys = list(
        g1 = c(
            "SubjectID", "CellMarker",
            "Tissue", "TimePoint"
        ),
        g2 = c("SubjectID", "CellMarker"),
        g3 = c("CellMarker", "Tissue")
    ), table_for_venn = TRUE
)
sharing_5
```

The argument `table_for_venn = TRUE` will add a new column `truth_tbl_venn`
that contains corresponding truth tables for each row.

```{r}
sharing_plots1 <- sharing_venn(sharing_5, row_range = 1, euler = TRUE)
sharing_plots2 <- sharing_venn(sharing_5, row_range = 1, euler = FALSE)
```

Say that we're interested in plotting just the first row of our sharing data 
frame. Then we can call the function `sharing_venn` and specify in the 
`row_range` argument the index 1. Note that this function requires the package
`r CRANpkg("eulerr")` to work. The argument `euler` indicates if the function
should produce euler or venn diagrams instead. 

Once obtained the lists of euler/venn objects we can plot them by simply 
calling the function `plot()`:

```{r}
plot(sharing_plots1[[1]])
plot(sharing_plots2[[1]])
```

There are several options that can be set, for this please refer to 
[eulerr docs](https://jolars.github.io/eulerr/reference/plot.euler.html).

# Reproducibility

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```

# Bibliography

This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes.

Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`.

```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```