---
title: "mums2"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{mums2}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE, warning=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The `{mums2}` package is designed to provide researchers with tools to analyze untargeted metabolomics data generated using tandem mass spectroscopy from microbial communities. The overall approach taken to analyze metabolomics data parallels that used to analyze microbial communities using 16S rRNA gene sequencing data. To showcase how to do this, we will demonstrate the package using a previously published dataset that analyzed the metabolome of [*Botryllus schlosseri*](https://doi.org/10.1128/msystems.00793-25).

```{r setup, message = FALSE, warning = FALSE}
library(mums2)
library(tidyverse)
library(networkD3)
```

```{r setup_dt, message = FALSE, warning = FALSE, include = FALSE}
Sys.setenv("OMP_THREAD_LIMIT" = 1)
data.table::setDTthreads(threads = 1)
```


## Process data

Before we begin to analyze the data, we have to process it into a readable `data.frame` or object that can be viewed and transformed. To load your data, we need to run two functions: `import_all_data()`, and `ms2_ms1_compare()`. `import_all_data()` creates an object reflecting MS1 data, and `ms2_ms1_compare()`assigns MS2 spectra to your MS1 data. At this point, you may need to filter out noise or transform certain parameters before you can properly analyze your data. To accommodate for those issues, we have two other functions: `filter_peak_table()`, and `change_rt_to_seconds_or_minute()`. `filter_peak_table()` allows your to remove low-quality features and `change_rt_to_seconds_or_minute()` allows you to transform your retention time to minutes or seconds. This allows you to ensure that the retention time in your MS1 data matches your MS2 data. Below will explain what each function does in more detail and illustrate how to go through the pipeline.

### Import

The `import_all_data()` function takes in two files one describing the peak data (i.e., peak_table) and the other describing the metadata (i.e., `metadata`) and converts them into an `mpactr` object. The `peak_table` argument takes a path to a file describing the peak table data. You can specify the format of the peak table data using the `format` argument: Metaboscape, Progenesis, and None. The value given to the `metadata` argument is the path to a CSV-formatted file that provides information about the samples. In order for your metadata to be valid, it needs the following columns: "injection", "sample_code", "biological_group". The values in the "injection" column should match the sample injection columns inside of the peak_table. "sample_code" is the ID of your technical replicates. Finally, "biological_group" is the ID of your biological replicate groups. If you need a deeper understanding of what format the objects given to the `peak_table` and `metadata` arguments should be in, take a look at mpactr's [getting started page](https://www.mums2.org/mpactr/articles/mpactr.html).


```{r import_data}
data <- import_all_data(
  peak_table = mums2_example("boryillus_peaktable.csv"), 
  metadata = mums2_example("boryillus_metadata.csv"), 
  format = "Progenesis")
```


#### Peak Table

Below is the expected format for a Progenesis peak table. It contains samples as columns and features as rows. The feature intensities are expected to be un-normalized. 

```{r peak_table, R.options=list(max.print=10)}
read_csv(mums2_example("boryillus_peaktable.csv"), skip = 2)
```

#### Metadata

The expected format for metadata is below. The metadata file needs to contain at minimum columns for "injection", "sample_code", and "biological_group".
```{r metadata, R.options=list(max.print=10)}
read_csv(mums2_example("boryillus_metadata.csv"))
```

### Filter

After importing the data, you can use functions from mpactR to filter the data. There are four different filters in the mpactR package: `filter_mispicked_ions()`, `filter_group()`, `filter_cv()`, and `filter_insource_ions()` (You can find more information on [mpactR's](https://www.mums2.org/mpactr/) website). Although data filtering is not required, it will help reduce noise and correct peak selection errors, which will also speed up the analysis.

```{r filter_data,  R.options=list(max.print=10)}
filtered_data <- data |>
  filter_peak_table(filter_mispicked_ions_params()) |>
  filter_peak_table(filter_cv_params(cv_threshold = 0.2)) |>
  filter_peak_table(filter_group_params(group_threshold = 0.1,
                                            "Blanks")) |>
  filter_peak_table(filter_insource_ions_params())

filtered_data
```

### Convert retention time to "rt in minutes" or "rt in seconds"

Sometimes an MS2 file will report the retention time in minutes but the MS1 file will report in seconds. This mismatch will cause incorrect peak matching between MS1 and MS2 data. The `change_rt_to_seconds_or_minute()` function allows users to synthesize data with different units. Be aware that this function modifies the `{mpactr}` object in place. Therefore, you will need to call the function again to revert the units. Below will display a vector of retention time.

```{r show_rt_of_filtered_data, R.options=list(max.print=5)}
get_peak_table(filtered_data)$rt
```

```{r convert_rt_to_minutes_or_seconds, R.options=list(max.print=5)}
filtered_data <- change_rt_to_seconds_or_minute(
  mpactr_object = filtered_data, rt_type = "seconds"
)

filtered_data
```

```{r R.options=list(max.print=5)}
filtered_data <- change_rt_to_seconds_or_minute(
  mpactr_object = filtered_data, rt_type = "minutes"
)

filtered_data
```

### Preprocess MS2 data
We can use a .mgf/.mzxml/.mzml-formatted file to match MS2 spectra to MS1 peaks. The `ms2_ms1_compare()` function reads a list of mgf files and matches them to a MS1 feature based on the mass-to-charge ratio and retention time tolerance. If there are multiple matches, this function will select the MS2 spectra with the highest intensity. Keep in mind that MS2 spectra files are very large, they can be anywhere from 1 MB to 100 GB. Therefore, depending on how big the file is, it might take a few moments for the function to complete.

`ms2_ms1_compare()` generates a list of data with two data frames (`ms1_data`, `ms2_matches`), a list (`peak_data`), and a character vector ("samples"):

* **`ms2_matches`** is a `data.frame` object that contains five columns: `mz`, `rt`, `ms1_compound_id`, `spectra_index`, and `ms2_spectrum_id`. `mz` and `rt` are reported from the MS2 file as mass-to-charge ratio and retention time, respectively. `ms1_compound_id` is the compound id that was imported from the MS1 peak_table. `spectra_index` matches the MS2 data with its MS2 spectrum. Finally, `ms2_spectrum_id` is the generated name to represent your MS2 spectra (combination of your mz and rt). This is necessary to properly compare compounds. Since compounds with similar structures are expected to have similar MS2 patterns, we can use algorithmic techniques to compute the similarity between two spectra. This allows us to generate annotations and cluster similar spectra together.
* **`ms1_data`** is a `data.frame` object containing the data created by running filtering steps from `{mpactr}`.
* **`peak_data`** is a list object that is generated from `ms2_ms1_compare()`. `peak_data` is a collection of MS2 peak list. A peak list a collection of fragment ions, they all have a value to represent their intensity and mass-charge ratio.
* **`samples`** is a character vector named listing the groups/samples contained in the `peak_table` file.

```{r preprocess_ms2_data}
matched_data <- ms2_ms1_compare(
  ms2_files = mums2_example("botryllus_v2.gnps.mgf"),
  mpactr_object = filtered_data, mz_tolerance = 5, rt_tolerance = 6)

head(get_ms2_matches(matched_data))

head(get_ms1_data(matched_data))

get_ms2_peaks_data(matched_data)[1]

get_samples(matched_data)
```

## Annotating data

Once you have preprocessed your data, we can start to generate additional information like molecular formulas and annotations. `compute_molecular_formulas()` allows us to generate molecular formulas and `annotate_ms2()` allows us to annotate our data based on reference databases. This allows us to confirm known features and annotate additional metadata to unknown features. Below will explain in further detail how these functions can be used.

### Molecular formula prediction

`{mums2}` can generate de novo molecular formula predictions using fragmentation trees. The `compute_molecular_formulas()` function uses MS2 data to generate the most explained molecular formula and return it as a result (for more information: [Fragmentation Trees](https://doi.org/10.1093/bioinformatics/btn270)). The most explained molecular formula simply means the molecular formula that is explained by the most peaks in the spectra. In order to create a fragmentation tree, we generate candidate formulas that comprise of every possible molecular formula the parent mass can create (using only CHNOPS). We then look at every mass and intensity pair inside of the spectra and compute every molecular formula. We then create a one directional graph based on all the molecular formulas using. A molecular formula will point to another if it is a sub-formula of another formula (meaning it contains at most as many atoms as the parent formula). Finally, we can compute a score for each one of the nodes using methods like Ring Double Bond equivalents, compute molecular formula score, etc. It is possible that the function is unable to compute a molecular formula. In these cases, a value of `NA` is returned. Warning messages will be printed if no molecular formula is generated or there is only one possible molecular formula. Due to the time this function will take to run, we are going to use a small dataset.

```{r  chemical_formula_prediction, R.options=list(max.print=10)}
data_small <- import_all_data(
  peak_table = mums2_example("botryllus_pt_small.csv"), 
  metadata = mums2_example("boryillus_metadata.csv"), 
  format = "None") |> 
    filter_peak_table(filter_mispicked_ions_params()) |>
    filter_peak_table(filter_cv_params(cv_threshold = 0.05)) |>
    filter_peak_table(filter_group_params(group_threshold = 0.1,
                                              "Blanks")) |>
    filter_peak_table(filter_insource_ions_params())

matched_data_small <- ms2_ms1_compare(
  ms2_files = mums2_example("botryllus_v2.gnps.mgf"),
  mpactr_object = data_small, mz_tolerance = .5, rt_tolerance = 6)


matched_data_small <- compute_molecular_formulas(
  mass_data = matched_data_small, parent_ppm = 3, number_of_threads = 2)

get_molecular_formula_preds(matched_data_small)
```

### Annotations

Beyond predicting the molecular formula, we can also use the `annotate_ms2()` function to annotate the data in the `matched_ms2_ms1` object using reference databases. A reference database can be input as msp files using the `read_msp()` function. It returns a reference database that can be used as an input for the `annotate_ms2()` function. In the `{mums2}` package, MS2 spectral similarity can be determined using either spectral entropy (for more information: [Spectral Entropy](https://doi.org/10.1038/s41592-021-01331-z)) or the [GNPS algorithm](https://doi.org/10.1038/nbt.3597). While GNPS uses a modified cosine score to compute similarity between spectra, spectral entropy takes advantage of the total intensities of the spectra. The chosen method can be used by supplying either, `gnps_param()` or `spec_entropy_params()`. Using these two methods, we can effectively generate a collection of annotations based on the reference databases. We have a small massbank database provided from [MSDial on 05/12/2026](https://systemsomicslab.github.io/compms/msdial/main.html#MSP) that is preloaded in the package and can be used to annotate data.

```{r annotations}
reference_db <- read_msp(msp_file = mums2_example("massbank_example_data.msp"))

annotations <- annotate_ms2(
  mass_data = matched_data, reference = reference_db,
  scoring_params = modified_cosine_params(0.5), ppm = 1000,
  min_score =  0.1, chemical_min_score = 0, number_of_threads = 2)

annotations[1:5,]
```

## Operational Metabolomic Units (OMUs)

Operational Metabolomic Units (OMUs) are clusters of metabolites with similar MS2 spectral patterns and can be used for numerous analyses. To properly cluster your data together, you need to generate some similarity or distance between the features of your data. This is where our `dist_ms2()` function comes in. After you generate a `data.frame` object containing the distances you can use the `cluster_data()` function to create your OMUs.

### Calculating distances

We have implemented two different distance calculations to generate distances between compounds. To generate the distances you can use the GNPS algorithm (`modified_cosine_params()`) or spectral entropy algorithm (`spec_entropy_params()`). Just like above, being able to compute the similarity between MS2 spectra is what allows us to cluster data. We can also use the similarity distances to generate a data.frame for later use.

```{r  scoring_distance, R.options=list(max.print=10)}
dist <- dist_ms2(
  data = matched_data, cutoff = 0.3, precursor_threshold = -1,
  score_params = modified_cosine_params(0.5), min_peaks = 0, number_of_threads = 2)

dist
```

### Clustering into OMUs

We can now cluster these features based on how similar their MS2 spectra are. Using the output of `dist_ms2` and the `matched_ms2_ms1` object, we are able to cluster the features to generate OMUs using the `cluster_data()` function. This function implements the iterative [OptiClust algorithm](https://journals.asm.org/doi/10.1128/mspheredirect.00073-17). This function returns a list object with five slots: `label`, `abundance`, `cluster`, `cluster_metrics`, and `iteration_metrics`:

* **label**. Represents the cutoff distance used for the cluster
* **abundance**. A `data.frame` object that indicates the absolute abundance of the clusters by sample
* **cluster**. A `data.frame` object that indicates which features clustered together by cluster ID
* **cluster_metrics**. A `data.frame` object that includes metrics for how the clusters were formed.
* **iteration_metrics**. A `data.frame` object that shows how the features were clustered at each iteration.

```{r  clustering, R.options=list(max.print=10)}
cluster_results <- cluster_data(
  distance_df = dist, ms2_match_data = matched_data, cutoff = 0.3,
  cluster_method = "opticlust")

cluster_results
```

Now that the metabolomic features have been clustered into OMUs, it is possible to use the abundance of each OMU across samples to perform ecological analyses using the `create_community_matrix_object()` function followed by the `get_community_matrix()` function.

```{r  community_object, R.options=list(max.print=10)}
community_w_omus <- create_community_matrix_object(cluster_results)
get_community_matrix(community_w_omus)
```

You also have the ability to generate a community matrix object without clustering your data into OMUs. If you are familiar with 16S rRNA gene sequencing analysis, this would be similar to ASVs...

```{r community_object_no_omus, R.options=list(max.print=10)}
community_wo_omus <- create_community_matrix_object(data = matched_data)
get_community_matrix(community_wo_omus)
```


### Annotation of OMUs

After your data has been clustered, if you wish to see which OMUs the annotated features are in, you can supply that to the annotation function using the `cluster_data` argument. Doing so will add a column named OMU to the annotations result `data.frame` object. This column will display the OMU the feature is represented in.

```{r omu_annotations}
annotations <- annotate_ms2(
  mass_data = matched_data, reference = reference_db,
  scoring_params = modified_cosine_params(0.5), ppm = 1000, min_score = 0.7,
  chemical_min_score = 0, cluster_data = cluster_results, number_of_threads = 2)

annotations[1:5,]
```


## OMU-based Ecological Analyses

Because of the variation in total ion intensity across samples, it is necessary to control for uneven sampling of ions. In ecological analyses, this is typically performed using rarefaction. Those analyses anticipate that abundances are counts or whole numbers. However, intensity values are not whole numbers, they have decimal values. Therefore, because other tools like `{vegan}` will not work with this type of data, it was necessary to adapt the traditional rarefaction approach to account for decimal values. In the analyses that follow, it is necessary to set a `threshold` value below which intensities are not included. The default value is 100. To determine the `size` parameter in the following `alpha_summary()` and `dist_shared()` functions, it is helpful to look at the distribution of total ion intensities for each sample

```{r}
get_community_matrix(community_w_omus) |>
  rowSums() |>
  sort()
```

We can see the sample with the lowest intensity is a blank with a total intensity of 303998.8. We might be tempted to use a `size` of 303998. However,
that would take a very long time to run. In our experience such a large value does not yield meaningfully more precision than using a smaller value. In this case, we would encourage one to use a value of 40000 as is illustrated in the rest of this tutorial.


### Alpha Diversity

Using your community data, you can calculate alpha diversity metrics. Alpha diversity represents richness or diversity in each sample separately. Within `{mums2}` you can quantify diversity using Simpson’s and Shannon’s diversity indices.

```{r  alpha_summary, R.options=list(max.print=15)}
alpha_summary(
  community_object = community_w_omus, size = 40000, threshold = 200,
              diversity_index = c("simpson", "shannon", "richness"),
              subsample = TRUE, number_of_threads = 2) |>
  head()
```

### Beta Diversity

Beta diversity represents the dissimilarity between different samples. Beta diversity values can be calculated using community structure metrics (i.e., Bray-Curtis, Theta-YC, Moristia-Horn) or community membership metrics (i.e., Jaccard, Hamming, or Sorenson). With the resulting distance values further analyses are possible. Below are examples use Bray-Curtis distances using unclustered metabolites or OMUs. 

```{r  dist_shared, R.options=list(max.print=10)}
bray_no_omus <- dist_shared(community_object = community_wo_omus, size = 40000,
                           threshold = 200, diversity_index = "bray", number_of_threads =  2)
bray_no_omus

bray_w_omus <- dist_shared(community_object = community_w_omus, size = 40000,
                          threshold = 200, diversity_index = "bray", number_of_threads =  2)
bray_w_omus
```

The output of `dist_shared()` is a `dist` object, which lends itself well to visualization techniques like PCoA and NMDS and hierarchical clustering.


### Visualizations

There are no functions built-in to `{mums2}` for data visualization. Instead, we encourage users to leverage the functionality of `{ggplot2}` and the rest of the `{tidyverse}`.

#### Principal Coordinate Analysis (PCoA)

PCoA is a statistical technique used to show similarities between data. PCOA allows us, in particular, to use the distances generated from our ecological analysis.

```{r pcoa, fig.width=10, fig.height=10, fig.fullwidth=TRUE}
pcoa <- cmdscale(bray_w_omus, k = 2, eig = T, add = T)

variance <- round(pcoa$eig*100/sum(pcoa$eig), 1)
colnames(pcoa$points) <- c("pcoa_1", "pcoa_2")

as_tibble(pcoa$points, rownames = "sample") |>
  inner_join(get_metadata(filtered_data), by = c("sample" = "injection")) |>
  ggplot(aes(x=pcoa_1, y = pcoa_2, color = biological_group)) +
    geom_point(size = 5.5) +
    labs(
      x = paste0("PCoA 1 (", variance[1], "%)"),
      y = paste0("PCoA 2 (", variance[2], "%)"),
      color = "Sample type") +
    theme(
      legend.text = element_text(size = 10), 
      legend.title = element_text(size = 15)
    )
```


### Hierarchical Clustering

Another visualization approach is to perform hierarchical clustering using the `stats::hclust()` function. Hierarchical clustering generates a tree of samples based on their beta diversity values to show the similarities of the samples. The results are similar to that of PCoA but it uses a tree like structure to showcase relationships. It tends to be easier to look at relationships or clusters using this type of graph.

```{r hierarchical_clustering, fig.width=10, fig.height=10, fig.fullwidth=TRUE}
old_par <- par(no.readonly = TRUE)
hclust_result <- hclust(bray_w_omus, "average")
par(cex=0.6, mar=c(5, 8, 4, 1))
plot(hclust_result)
par(old_par)
```


### Network Plot

Sometimes, it is useful to see a visual map of your annotations. We can accomplish this using a network plot. By creating a network of data, we can see which annotations matched certain features. Using `networkD3::simpleNetwork()`, we can generate a simple network plot.

```{r network_plot}
distance_df_annotations <- annotations[, c("query_ms2_id", "name", "score")]

simpleNetwork(
  distance_df_annotations, height="100px", width="100px", zoom = TRUE)
```


###  Using Group Averages

Another feature included inside of `{mums2}` is the ability to run your analysis using group averages. Some researchers conduct mass spectrometry experiments by running samples in technical/injection triplicate to account for instrument variability. Since we know all three injections are from the same vial, we can average all of these into one group. We can generate group averages by using the metadata file that was supplied when you imported your data based on "sample_code". We can accomplish this by using the `convert_to_group_averages()` function.

Below, you can see a copy of the alpha and beta diversity functions we ran above. But this time, we are using group averages.

```{r group_average}
  matched_avg <- convert_to_group_averages(
  matched_data = matched_data, mpactr_object = filtered_data)

dist_avg <- dist_ms2(
  data = matched_avg, cutoff = 0.3, precursor_thresh = 2,
  score_params = modified_cosine_params(0.5), min_peaks = 0, number_of_threads = 2)

cluster_results_avg <- cluster_data(
  distance_df = dist, ms2_match_data = matched_avg, cutoff = 0.3,
  cluster_method = "opticlust")

annotations_avg <- annotate_ms2(
  mass_data = matched_avg,
  reference = reference_db, scoring_params = modified_cosine_params(0.5), 
  ppm = 1000, min_score = 0.6, chemical_min_score = 0,
  cluster_data = cluster_results_avg, number_of_threads = 2)

community_object_avg <- create_community_matrix_object(
  data = cluster_results_avg)
head(get_community_matrix(community_object = community_object_avg), 3)
```

You can also see how the samples in the matched object have been adjusted based on the information supplied in the metadata file.
```{r samples}
# Normal
head(get_samples(matched_data))

# Group Average
head(get_samples(matched_avg))

# The items in the injection have been converted into the corresponding sample code.
get_metadata(filtered_data)[, c("injection", "sample_code")]
```

### Combined data frames
`{mums2}` also allows you to view your all of your generated data together using the `generate_a_combined_table()` function. This function will take your matched object (generated from `ms2_ms1_compare()`), your annotations object (generated from `annotate_ms2()`) and your cluster data (generated from `cluster_data()`) to generate a combined data.frame with all of the generated data. 
```{r combined_df}
# For normal data
generate_a_combined_table(
  matched_data = matched_data, annotations = annotations, 
  cluster_data = cluster_results) |> 
  head(n = 3)

# For group averaged data
generate_a_combined_table(
  matched_data = matched_avg, annotations = annotations_avg,
  cluster_data = cluster_results_avg) |>
  head(n = 3)
```


### Heat map
Another way to visualize diversity data is to use a heat map. We can easily see how correlated data is between samples with a heatmap.
```{r heat_map, fig.width=10, fig.height=10, fig.fullwidth=TRUE}
dist_shared(community_object_avg, 40000, 200, "bray", number_of_threads = 2) |> 
  as.matrix() |>
  as_tibble(rownames = "A") |>
  pivot_longer(-A, names_to = "B", values_to = "distance") |>
  ggplot(aes(x = A, y = B, fill = distance)) +
  geom_tile() +
  scale_fill_gradient(low = "#FFFFFF", high="#FF0000") +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    panel.background = element_blank()
  )
```


### Box Plot
A box and whisker plot can also be used to depict the variation of the data. We can take advantage of this to look at the distributions inside of our generate beta and alpha diversity data. Below is a box plot using alpha diversity.
```{r box_plot, fig.width=10, fig.height=10, fig.fullwidth=TRUE}
metadata <- get_metadata(filtered_data)

alpha <- alpha_summary(community_object_avg, 40000, 200, "simpson", number_of_threads = 2)

inner_join(alpha, metadata, by = c("samples" = "sample_code")) |>
  ggplot(aes(x = biological_group, y = simpson)) +
  geom_boxplot(outliers = FALSE, fill = NA, color = "gray") +
  geom_jitter(width = 0.3)  + 
  scale_y_continuous(limits = c(0, NA)) +
  theme_classic()
```