---
title: "Getting started with the plyranges package"
author: "Stuart Lee"
package: plyranges
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Introduction}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

# `Ranges` revisited

In Bioconductor there are two classes, `IRanges` and `GRanges`, 
that are standard data structures for representing genomics data. 
Throughout this document I refer to either of these classes as `Ranges` if an
operation can be performed on either class, otherwise I explicitly mention if
a function is appropriate for an `IRanges` or `GRanges`.

`Ranges` objects can either represent sets of integers as `IRanges`
(which have start, end and width attributes)
or represent genomic intervals (which have additional attributes, sequence name,
and strand) as `GRanges`. In addition, both types of `Ranges` can store information
about their intervals as metadata columns (for example GC content 
over a genomic interval). 

`Ranges` objects follow the tidy data principle: each row
of a `Ranges` object corresponds to an interval, while each column will represent
a variable about that interval, and generally each object will represent a
single unit of observation (like gene annotations). 

Consequently, `Ranges` objects provide a powerful representation for reasoning
about genomic data. In this vignette, you will learn more about `Ranges` objects
and how via grouping, restriction and summarisation you can perform common data
tasks. 


# Constructing `Ranges`

To construct an `IRanges` we require that there are at least two columns that represent at either a 
starting coordinate, finishing coordinate or the width of the interval. 

```{r}
suppressPackageStartupMessages(library(plyranges))
set.seed(100)
df <- data.frame(start=c(2:-1, 13:15), 
                 width=c(0:3, 2:0))

# produces IRanges
rng <- df %>% as_iranges()
rng

```

To construct a `GRanges` we require a column that represents that sequence name (
contig or chromosome id), and an optional column to represent the strandedness
of an interval. 

```{r}
# seqname is required for GRanges, metadata is automatically kept
grng <- df %>% 
  transform(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
         strand = sample(c("+", "-"), 7, replace = TRUE),
         gc = runif(7)) %>% 
  as_granges()

grng
```

# Arithmetic on Ranges

Sometimes you want to modify a genomic interval by altering the width of the 
interval while leaving the start, end or midpoint of the coordinates unaltered. 
This is achieved with the `mutate` verb along with `anchor_*` adverbs. 

The act of anchoring fixes either the start, end, center coordinates of the 
`Range` object, as shown in the figure and code below and anchors are used in
combination with either `mutate` or `stretch`.

```{r, echo = FALSE, out.width="400px", fig.align="center"}
knitr::include_graphics("anchors.png", dpi = 150)
```

```{r}
rng <- as_iranges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8)))
grng <- as_granges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8), 
                          seqnames = "seq1",
                          strand = c("+", "*", "-")))
mutate(rng, width = 10)
mutate(anchor_start(rng), width = 10)
mutate(anchor_end(rng), width = 10)
mutate(anchor_center(rng), width = 10)
mutate(anchor_3p(grng), width = 10) # leave negative strand fixed
mutate(anchor_5p(grng), width = 10) # leave positive strand fixed
```

Similarly, you can modify the width of an interval using the `stretch` verb.
Without anchoring, this function will extend the interval in either direction
by an integer amount. With anchoring, either the start, end or midpoint are
preserved. 

```{r}
rng2 <- stretch(anchor_center(rng), 10)
rng2
stretch(anchor_end(rng2), 10)
stretch(anchor_start(rng2), 10)
stretch(anchor_3p(grng), 10)
stretch(anchor_5p(grng), 10)
```

`Ranges` can be shifted left or right. If strand information is available 
we can also shift upstream or downstream. 

```{r}
shift_left(rng, 10)
shift_right(rng, 10)
shift_upstream(grng, 10)
shift_downstream(grng, 10)
```

# Grouping `Ranges`

`plyranges` introduces a new class of `Ranges` called `RangesGrouped`, 
this is a similar idea to the grouped `data.frame\tibble` in `dplyr`. 

Grouping can act on either the core components or the metadata columns of
a `Ranges` object.

It is most effective when combined with other verbs such as `mutate()`,
`summarise()`, `filter()`, `reduce_ranges()` or `disjoin_ranges()`.
```{r}
grng <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
         strand = sample(c("+", "-"), 7, replace = TRUE),
         gc = runif(7),
         start = 1:7,
         width = 10) %>%
  as_granges()

grng_by_strand <- grng %>%
  group_by(strand)

grng_by_strand
```

# Restricting `Ranges`

The verb `filter` can be used to restrict rows in the `Ranges`.
Note that grouping will cause the `filter` to act within each group 
of the data.
```{r}
grng %>% filter(gc < 0.3)
# filtering by group 
grng_by_strand %>% filter(gc == max(gc)) 
```

We also provide the convenience methods `filter_by_overlaps` and 
`filter_by_non_overlaps` for restricting by any overlapping `Ranges`.

```{r}
ir0 <- data.frame(start = c(5,10, 15,20), width = 5) %>%
  as_iranges()
ir1 <- data.frame(start = 2:6, width = 3:7) %>%
  as_iranges()
ir0
ir1
ir0 %>% filter_by_overlaps(ir1)
ir0 %>% filter_by_non_overlaps(ir1) 
```

# Summarising `Ranges`

The `summarise` function will return a `DataFrame` because the information required
to return a `Ranges` object is lost. It is often most useful to use `summarise()`
in combination with the `group_by()` family of functions. 

```{r}
ir1 <- ir1 %>%
  mutate(gc = runif(length(.)))

ir0 %>% 
  group_by_overlaps(ir1) %>%
  summarise(gc = mean(gc))
```

# Joins, or another way at looking at overlaps between `Ranges`

A join acts on two GRanges objects, a query and a subject.

```{r}
query <- data.frame(seqnames = "chr1",
               strand = c("+", "-"),
               start = c(1, 9), 
               end =  c(7, 10),
               key.a = letters[1:2]) %>% 
  as_granges()

subject <- data.frame(seqnames = "chr1",
               strand = c("-", "+"),
               start = c(2, 6), 
               end = c(4, 8),
               key.b = LETTERS[1:2]) %>% 
  as_granges()
```

```{r, echo = FALSE, fig.cap = "Query and Subject Ranges", message = FALSE}
library(ggplot2)
query_df <- as.data.frame(query)[, -6]
query_df$key <- "Query"
subject_df <- as.data.frame(subject)[, -6]
subject_df$key <- "Subject"
melted_ranges <- rbind(query_df, subject_df)
ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) +
  geom_rect() + 
  facet_grid(key ~ .) +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```

The join operator is relational in the sense that metadata 
from the query and subject ranges is retained in the joined range.  All join operators in the `plyranges` DSL generate a set of hits based on overlap or proximity of ranges and use those hits to merge the two 
datasets in different ways. 
There are four supported matching algorithms: _overlap_,
_nearest_, _precede_, and _follow_. We can further restrict the
matching by whether the query is completely _within_ the subject, and
adding the _directed_ suffix ensures that matching ranges have the
same direction (strand).


```{r echo = FALSE, out.width="600px"}
knitr::include_graphics("olaps.png")
```

The first function, `join_overlap_intersect()` will return a `Ranges` object where the start, end, and width coordinates correspond to the amount of any overlap between the left and right input `Ranges`. It also returns any metadatain the subject range if the subject overlaps the query.

```{r}
intersect_rng <- join_overlap_intersect(query, subject)
intersect_rng
```

```{r echo = FALSE, fig.cap="Intersect Join"}
intersect_df <- as.data.frame(intersect_rng)[, -c(6,7)]
intersect_df$key <- "Intersect Join"
melted_ranges <- rbind(query_df, subject_df, intersect_df)
melted_ranges$key <- factor(melted_ranges$key, 
                              levels = c("Query", "Subject", "Intersect Join"))
ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) +
  geom_rect() + 
  facet_grid(key ~ .) +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```


The `join_overlap_inner()` function will return the `Ranges` in the query that
overlap any `Ranges` in the subject. Like the `join_overlap_intersect()` function
metadata of the subject `Range` is returned if it overlaps the query. 

```{r}
inner_rng <- join_overlap_inner(query, subject)
inner_rng
```


```{r, echo = FALSE, fig.cap = "Inner Join", message = FALSE}
inner_df <- as.data.frame(inner_rng)[, -c(6,7)]
inner_df$ymin <- c(1,4)
inner_df$ymax <- c(3,6)
inner_df$key <- "Inner Join"
melted_ranges <- rbind(query_df, subject_df)
melted_ranges$ymin <- 1
melted_ranges$ymax <- 3
melted_ranges <- rbind(melted_ranges, inner_df)
melted_ranges$key <- factor(melted_ranges$key, 
                              levels = c("Query", "Subject", "Inner Join"))

ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
  geom_rect() + 
  facet_grid(key ~ ., scales = "free_y") +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```

We also provide a convenience method called `find_overlaps` that computes the same result as `join_overlap_inner()`.

```{r}
find_overlaps(query, subject)
```

The `join_overlap_left()` method will perform an outer left join. 


First any overlaps that are found will be returned similar to `join_overlap_inner()`. Then any non-overlapping ranges will be returned, with missing values on the metadata columns. 

```{r}
left_rng <- join_overlap_left(query, subject)
left_rng
```

```{r, echo = FALSE, fig.cap = "Left Join", message = FALSE}
left_df <- as.data.frame(left_rng)[, -c(6,7)]
left_df$ymin <- c(1,4, 1)
left_df$ymax <- c(3,6, 3)
left_df$key <- "Left Join"
melted_ranges <- rbind(query_df, subject_df)
melted_ranges$ymin <- 1
melted_ranges$ymax <- 3
melted_ranges <- rbind(melted_ranges, left_df)
melted_ranges$key <- factor(melted_ranges$key, 
                              levels = c("Query", "Subject", "Left Join"))

ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
  geom_rect() + 
  facet_grid(key ~ ., scales = "free_y") +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```

Compared with `filter_by_overlaps()` above, the overlap left join expands the
`Ranges` to give information about each interval on the query `Ranges` that 
overlap those on the subject `Ranges` as well as the intervals on the left that do not overlap any range on the right.

## Finding your neighbours
We also provide methods for finding nearest, preceding or following `Ranges`. Conceptually this is identical to our approach for finding
overlaps, except the semantics of the join are different.

```{r echo = FALSE, out.width="600px"}
knitr::include_graphics("neighbours.png")
```

```{r}
join_nearest(ir0, ir1)
join_follow(ir0, ir1)
join_precede(ir0, ir1) # nothing precedes returns empty `Ranges`
join_precede(ir1, ir0)
```


## Example: dealing with multimapping
This example is taken from the Bioconductor support [site](https://support.bioconductor.org/p/100046/).

We have two `Ranges` objects. The first contains single nucleotide positions
corresponding to an intensity measurement such as a ChiP-seq experiment, 
while the other contains coordinates for two genes of interest.


We want to identify which positions in the `intensities` `Ranges` overlap the
genes, where each row corresponds to a position that overlaps a single gene.

First we create the two `Ranges` objects
```{r ex1}
intensities <- data.frame(seqnames = "VI",
                          start = c(3320:3321,3330:3331,3341:3342),
                          width = 1) %>% 
  as_granges()

intensities 

genes <- data.frame(seqnames = "VI", 
                    start = c(3322, 3030),
                    end = c(3846, 3338),
                    gene_id=c("YFL064C", "YFL065C")) %>% 
  as_granges()
                    
genes
```

Now to find where the positions overlap each gene, we can perform an overlap
join. This will automatically carry over the gene_id information as well as
their coordinates (we can drop those by only selecting the gene_id). 
```{r}
olap <- join_overlap_inner(intensities, genes) %>%
  select(gene_id)
olap
```

Several positions match to both genes. We can count them using `summarise`
and grouping by the `start` position:

```{r}
olap %>% 
  group_by(start) %>%
  summarise(n = n())
```

## Grouping by overlaps

It's also possible to group by overlaps. Using this approach we can count the
number of overlaps that are greater than 0. 

```{r}
grp_by_olap <- ir0 %>% 
  group_by_overlaps(ir1)
grp_by_olap
grp_by_olap %>%
  mutate(n_overlaps = n())
```

Of course we can also add overlap counts via the `count_overlaps()` function.

```{r}
ir0 %>%
  mutate(n_overlaps = count_overlaps(., ir1))
```

# Data Import/Output

We provide convenience functions via `rtracklayer` and `GenomicAlignments` 
for reading/writing the following data formats to/from `Ranges` objects.

| `plyranges` functions | File Format |
|-----------------------|-------------|
|   `read_bam()`        | BAM         |
| `read_bed()`/`write_bed()` | BED    |
| `read_bedgraph()`/ `write_bedgraph()` | BEDGraph |
| `read_narrowpeaks()`/ `write_narrowpeaks()` | narrowPeaks |
| `read_gff()` / `write_gff()` | GFF(1-3) / GTF |
| `read_bigwig()` / `write_bigwig()` |  BigWig |
| `read_wig()` / `write_wig()` | Wig |


# Mapping to GenomicRanges/IRanges

For users already familiar with the `IRanges` and `GenomicRanges` we provide 
mappings to the  `plyranges` API. 

## Operations on range width 

 For `G`Ranges`` objects all functions ignore
any strandedness, unless the strand of the range is anchored.

| `plyranges` functions | Description | `GenomicRanges/IRanges` command |
|-----------------------|-------------|---------------------------------|
| `anchor_(start/end/center/3p/5p)` | Fix the `start/end/center/` coordinates or positive/negative strand of range. Can be used in combination with any of the following   | Available in functions that have a `fix` argument. |
| `mutate(x, width = width + modifier)` | Modify the width of a `Ranges` | `resize` |
| `stretch(x, extend)` | Extend the start and end coordinates in opposite directions by a fixed amount. | `start(x)<- start(x) - extend%/%2`, `end(x) <- end(x)  + extend%/%2` |

## Operations on range width (invariant) 

| `plyranges` functions | Description | `GenomicRanges/IRanges` command |
|-----------------------|-------------|---------------------------------|
| `shift_[left/right/downstream/upstream](x, shift)` | Shift the coordinates of the interval (left/right/downstream/upstream) by an integer amount.    | `shift_right` corresponds to `shift` |
| `flank_[left/right/downstream/upstream](x, width)` | Generates flanking regions of size width `left/right/downstream/upstream/` | corresponds to `flank` |

## Set operations (vector wise)

These are usual set-operations that act on the sets of the
`Ranges` represented in x and y. By default these operations will ignore
any strand information. The directed versions of these functions will
take into account strand.

| `plyranges` functions | Description | `GenomicRanges/IRanges` command |
|-----------------------|-------------|---------------------------------|
| `[intersect/setdiff/union/]_Ranges` | Set operations between two `Ranges`, ignoring strand. | `intersect/setdiff/union/` with `ignore.strand = FALSE` |
| `[intersect/setdiff/union/]_anchored_Ranges` | As above taking into account strandedness. | 

## Set operations (element wise)

We provide infix operators and the verbs between and span to the represent element wise range operations. These
map to the `pintersect/punion/psetdiff/pgap/punion(fill.gap = FALSE)` functions.

## Restrictions

The verb `filter` corresponds to `subset`, while `filter_by_[overlaps/non_overlaps]` corresponds to `subsetByOverlaps`.

## Aggregation

The `summarise` verb is most similar to the `aggregate` methods defined in `GenomicRanges/IRanges`. 

The `reduce_ranges/disjoin_ranges` correspond to the `reduce/disjoin` methods. 
However, the former methods allow additional summarisation. 

The `compute_coverage(x)` method corresponds to `[I/G]Ranges(coverage(x))`. 

## Overlaps

 For `GRanges` objects all functions ignore
any strandedness, unless the suffix `directed` is added to the function call

| `plyranges` function                     | Description                                                                                 | `GenomicRanges/IRanges` command  |
|------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------|
| `find_overlaps(x, y, maxgap, minoverlap)`|  Returns a `Ranges` object with any range in `y` that overlaps `x`. Appends the metadata in `y` and its genomic intervals to the returning `Ranges`. | `findOverlaps(x,y, maxgap, minoverlap, type = "any")` with expanding `x` and `y` by their hits and appending the `mcols` in `y`. | 
| `group_by_overlaps(x, y, maxgap, minoverlap)` | Returns a GroupedRanges object grouped by the query hits. |  Same as above with an additional column called `query` which contains the queryHits. |
| `count_overlaps(x, y, maxgap, minoverlap)` | Returns an integer vector (used with `mutate`) | `countOverlaps(x, y, maxgap, minoverlap, type = "any")` |
| `join_overlap_self(x, maxgap, minoverlap)` |  Returns a `Ranges` object with any range that overlaps itself. | `findOverlaps(x,x, maxgap, minoverlap, type = "any")`|
| `join_overlap_inner(x, y, maxgap, minoverlap)` | Finds the intersecting `Ranges` that overlap in `x` and `y`. Returns a `Ranges` object with the metadata from `x` and `y`.  |  `findOverlapsPairs(x,y, maxgap, minoverlap, type = "any")` + `pintersect`. |               | `join_overlap_left(x, y, maxgap, minoverlap)` | Identical to `find_overlaps`  | Identical to `find_overlaps`. |                   
| `*_within` | Adding suffix `within` will find overlaps   | Makes `type = "within"` |
| `*_includes` | inverse of within functions | - |
| `join_nearest[_left/right/up/downstream](x,y)` | Finds nearest neighbour `Ranges` between `x` and `y`. | `nearest` + reindexing to return a `Ranges` object. |
|`join_precede[_left/right/up/downstream](x,y)`  | Finds `Ranges` in `x` that precede `y` | `precedes` + reindexing to return a `Ranges` object. |
|`join_follow[_left/right/up/downstream](x,y)`  | Finds `Ranges` in `x` that follow `y`  | `precedes` + reindexing to return a `Ranges` object. |


# Appendix

```{r}
sessionInfo()
```