---
title: "Introduction to Jellyfisher"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to Jellyfisher}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(jellyfisher)
```

**Jellyfisher** is an R package (a [htmlwidget](https://www.htmlwidgets.org/))
for visualizing tumor evolution and subclonal compositions using Jellyfish plots.
The package is based on the
[Jellyfish](https://github.com/HautaniemiLab/jellyfish) visualization tool,
bringing its functionality to R users. Jellyfisher supports both
[ClonEvol](https://github.com/hdng/clonevol) results and plain data frames,
making it compatible with various tools and workflows.

## Input data

The input data should follow specific structures for _samples_, _phylogeny_, 
subclonal _compositions_, and optional _ranks_.

The `jellyfisher` package includes an example data set
(`jellyfisher_example_tables`) based on the following publication:  
Lahtinen, A., Lavikka, K., Virtanen, A., et al. "Evolutionary states and
trajectories characterized by distinct pathways stratify patients with ovarian
high-grade serous carcinoma." _Cancer Cell_ **41**, 1103–1117.e12 (2023). DOI:
[10.1016/j.ccell.2023.04.017](https://doi.org/10.1016/j.ccell.2023.04.017).

```{r echo=F, message=F}
# Subset the data to keep the compiled vignette a bit smaller
jellyfisher_example_tables <- jellyfisher_example_tables |>
  select_patients(c("EOC69", "EOC153", "EOC495", "EOC677", "EOC809"))
```


### Samples
            
- `sample` (string): specifies the unique identifier for each sample.
- `displayName` (string, optional): allows for specifying a custom name for each sample. If the column is omitted, the `sample` column is used as the display name.
- `rank` (integer): specifies the position of each sample in the Jellyfish plot. For example, different stages of a disease can be ranked in chronological order: diagnosis (1), interval (2), and relapse (3). The zeroth rank is reserved for the root of the sample tree. Ranks can be any integer, and unused ranks are automatically excluded from the plot. If the `rank` column is
  absent, ranks are assigned based on each sample’s depth in the sample tree.
- `parent` (string): identifies the parent sample for each entry. Samples without a specified parent are treated as children of an imaginary root sample.

```{r}
head(jellyfisher_example_tables$samples, 25)
```

### Phylogeny

- `subclone` (string): specifies subclone IDs, which can be any string.
- `parent` (string): designates the parent subclone. The subclone without a parent is considered the root of the phylogeny.
- `color` (string, optional): specifies the color for the subclone. If the column is omitted, colors will be generated automatically.
- `branchLength` (number): specifies the length of the branch leading to the subclone. The length may be based on, for example, the number of unique mutations in the subclone. The branch length is shown in the Jellyfish plot's legend as a bar chart. It is also used when generating a phylogeny-aware color scheme.

```{r}
head(jellyfisher_example_tables$phylogeny, 25)
```

### Subclonal compositions

Subclonal compositions are specified in a
[tidy](https://vita.had.co.nz/papers/tidy-data.pdf) format, where each row
represents a subclone in a sample.

- `sample` (string): specifies the sample ID.
- `subclone` (string): specifies the subclone ID.
- `clonalPrevalence` (number): specifies the clonal prevalence of the subclone in the sample. The clonal prevalence is the proportion of the subclone in the sample. The clonal prevalences in a sample must sum to 1.

```{r}
head(jellyfisher_example_tables$compositions, 25)
```

### Ranks

The ranks in the example data set are used to indicate the time points when the
samples were acquired. 

- `rank` (integer): specifies the rank number. The zeroth rank is reserved for the inferred root of the sample tree. However, you are free to define a title for it.
- `title` (string): specifies the title for the rank.

```{r}
head(jellyfisher_example_tables$ranks, 6)
```

## Plotting

### Basic plotting 

The three tables are passed to the `jellyfisher` function as a named list. The
function generates an interactive Jellyfish plot based on the input data. If the
data set contains multiple patients, the Jellyfisher htmlwidget shows navigation
buttons to switch between patients.

```{r}
jellyfisher(jellyfisher_example_tables,
            width = "100%", height = 450)
```

### Plotting with custom options

```{r}
jellyfisher(jellyfisher_example_tables,
            options = list(
              sampleHeight = 70,
              sampleTakenGuide = "none",
              tentacleWidth = 3,
              showLegend = FALSE
            ),
            width = "100%", height = 400)
```

### Plotting a single patient

When plotting multiple patients, Jellyfisher shows buttons (Previous and Next)
to navigate between patients. When the data contains only one patient, these
buttons are hidden. The package also provides a `select_patients` function to
filter the data set with ease.

```{r}
jellyfisher_example_tables |>
  select_patients("EOC677") |>
  jellyfisher(width = "100%", height = 400)
```

## Adjusting the sample tree structure

The sample trees in the example data set were constructed as follows:

*"For each sample, we checked whether an earlier time point included exactly one
sample from the same anatomical location.  If such a sample existed, it was
assigned as the parent; otherwise, the inferred root was used as the parent."*

However, this mechanistic approach may not always produce credible sample trees.

### Changing parent

The *r1Bow1* (bowel) sample in the following jellyfish plot is derived
from an earlier bowel sample *p2Bow1_c*, which has no traces of the subclone 12.

```{r}
jellyfisher_example_tables |>
  select_patients("EOC809") |>
  jellyfisher(width = "100%", height = 600)
```

Using the `set_parents` function, we can adjust the parent of the *r1Bow1* sample
to be *p2Per1_cO* (peritoneum), which is a possible source of the metastasis due
to its proximity. The high prevalence of subclone 12 in this sample suggests
that it is the likely source of the metastasis in the *r1Bow1* sample.

```{r}
jellyfisher_example_tables |>
  select_patients("EOC809") |>
  set_parents(list("EOC809_r1Bow1_DNA1" = "EOC809_p2Per1_cO_DNA2")) |>
  jellyfisher(width = "100%", height = 600)
```

### Changing topology

While ranks (the columns) can indicate the time points when the samples were
acquired, they can also be used to simply show the sample's depth in the sample
tree. For instance, the following plot shows all the samples on the same rank,
indicating that they were diagnostic samples acquired at the same time.

```{r}
jellyfisher_example_tables |>
  select_patients("EOC495") |>
  jellyfisher(width = "100%", height = 650)
```

However, one can argue that the LN (lymph node) samples represent a later
development in the disease, and thus, they should be placed on a later rank.
We can remove the existing ranks, define new parent-child relationships, and
let Jellyfisher assign the ranks based on the sample tree depth.

```{r}
tables <- jellyfisher_example_tables |>
  select_patients("EOC495")

# Remove existing ranks. The ranks will be assigned automatically based
# on samples' depths in the sample tree.
tables$samples$rank <- NA

# Rank titles should be removed as well because they are no longer valid.
tables$ranks <- NULL

tables |>
  set_parents(list("EOC495_pLNL1_DNA1" = "EOC495_pLNR_DNA1",
                   "EOC495_pLNL2_DNA1" = "EOC495_pLNL1_DNA1")) |>
  jellyfisher(width = "100%", height = 500)
```

If we think that the lymph node samples represent an even later development,
we can manually assign ranks to the samples. The `set_ranks` function provides
an easy way to do this.

```{r}
tables |>
  set_parents(list("EOC495_pLNL1_DNA1" = "EOC495_pLNR_DNA1",
                   "EOC495_pLNL2_DNA1" = "EOC495_pLNL1_DNA1")) |>
  set_ranks(list("EOC495_pLNR_DNA1" = 2,
                 "EOC495_pLNL1_DNA1" = 3,
                 "EOC495_pLNL2_DNA1" = 4),
            default = 1) |>
  jellyfisher(width = "100%", height = 400)
```

```{r echo=F}
rm(tables)
```

### Adding hypothetical samples for more plausible evolutionary scenarios

This feature is mainly useful when multiple time points are present, since
subclones that appear across several branches of the sample tree may otherwise
have their LCA placed at the inferred root. While technically correct, this 
suggests that the subclone emerged before all real samples, which may sometimes
be biologically implausible.

To address this, you can insert a hypothetical **inferred sample** into the
sample tree and use it as a new parent for a subset of samples. This creates a
more plausible location for the LCA and often yields a clearer evolutionary
scenario. The same mechanism can also be used more generally to organize the
sample tree so that diverging clades are displayed in a more structured way.

In the following example, the samples in the Diagnosis and Interval time points
of patient EOC153 contain distinct clades. The original subclones were
apparently eradicated during treatment, and new subclones emerged at the
interval debulking surgery.  Without a manually inserted inferred sample, the
interval subclones’ LCA is placed at the inferred root, suggesting an unlikely 
evolutionary scenario. By adding a hypothetical intermediate sample, we can
anchor the emergence of these subclones at a more realistic position in the
sample tree.

```{r}
jellyfisher_example_tables |>
  select_patients("EOC153") |>
  add_inferred_sample(
    name = "EOC153_Inf",
    rank = 3,
    samples = c(
      "EOC153_iPer1_DNA4",
      "EOC153_iOme1_DNA4",
      "EOC153_iOvaR1_DNA1"
    )
  ) |>
  jellyfisher(width = "100%", height = 600)
```

The `add_inferred_sample()` function performs three actions:

1. Creates a new inferred sample with the given name and rank.
2. Sets it as the parent of the specified samples.
3. Updates the sample tree so that subclone LCAs can be placed at this new node.

The resulting plot shows the subclones emerging at the newly introduced sample,
between the diagnosis and relapse time points, producing a more realistic and
interpretable evolutionary scenario.

## Handling non-aberrant cells

Typically in tumor evolution studies, the focus is on the subclonal compositions
of the tumor cells. Thus, the root node in the phylogenetic tree represents the
founding clone. However, the data may also contain non-aberrant cells, i.e., the
tumor purity is less than 100%. For these cases, the `normalsInPhylogenyRoot`
option instructs Jellyfisher to treat the root node as non-aberrant cells. The
option has two consequences: (1) No tentacles are drawn between the root
subclones, and (2) the root subclone is colored as white when the
phylogeny-aware color scheme is used.

```{r}
# Subclone N at the root represents the non-aberrant cells.
# The letter N has no special meaning in Jellyfisher.
non_aberrant <- list(
  samples = data.frame(sample = c("A", "B")),
  compositions = data.frame(
    sample = c("A", "A", "A", "B", "B", "B"),
    subclone = c("N", "1", "2", "N", "1", "2"),
    clonalPrevalence = c(0.2, 0.4, 0.4, 0.3, 0.3, 0.4)
  ),
  phylogeny = data.frame(
    subclone = c("N", "1", "2"),
    parent = c(NA, "N", "1")
  )
)
```

```{r}
non_aberrant |>
  jellyfisher(options = list(
    normalsAtPhylogenyRoot = TRUE
  ),
  width = "100%", height = 350)
```

In the above plot, the root clone, which represents the non-aberrant cells, is
hidden from the inferred root sample. However, sometimes a patient may have
multiple independent clones, and in these cases the root clone is shown:

```{r}
# Change the parent of subclone 2 to N
non_aberrant$phylogeny$parent[non_aberrant$phylogeny$subclone == "2"] <- "N"

non_aberrant |>
  jellyfisher(options = list(
    normalsAtPhylogenyRoot = TRUE
  ),
  width = "100%", height = 350)
```

## Session info

```{r}
sessionInfo()
```