---
title: "Integrating Spectra with Python's matchms package"
package: SpectriPy
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Integrating Spectra with Python's matchms package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure, Quantitative}
  %\VignettePackage{SpectriPy}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{Spectra,BiocStyle,SpectriPy,basilisk,reticulate}
---

```{r style, echo = FALSE, results = 'asis', message = FALSE}
BiocStyle::markdown()
```

**Compiled**: `r date()`

# Introduction

The `SpectriPy` package allows integration of Python MS packages into a
`Spectra`-based MS analysis in `R`. Python functionality is wrapped into R
functions allowing a seamless integration of the functionality of Python's
`matchms` package into `R`. In addition, functions to convert between R's
`Spectra` objects and Python's `matchms` spectrum objects are available to the
advanced user or developer.

# Installation

The package requires a python environment to be available and can be installed
with the `BiocManager` R package using the command
`BiocManager::install("RforMassSpectrometry/SpectriPy")`. All required python
packages are installed automatically on demand.

# Spectra similarity calculations using `matchms`

The `SpectriPy` package provides the `compareSpectriPy` function that allows to
perform spectra similarity calculations using the scoring functions from the
`matchms` Python package. Below all currently supported scoring
functions are listed along with the *parameter* class that allows selecting and
configuring the algorithm in the `compareSpectriPy` function. Additional
functions will be added in future.

- [*CosineGreedy*](https://matchms.readthedocs.io/en/latest/api/matchms.similarity.CosineGreedy.html):
  `CosineGreedyParam`.
- [*CosineHungarian*](https://matchms.readthedocs.io/en/latest/api/matchms.similarity.CosineHungarian.html):
  `CosineHungarianParam`.
- [*ModifiedCosineParam*](https://matchms.readthedocs.io/en/latest/api/matchms.similarity.ModifiedCosine.html):
  `ModifiedCosineParam`.

We next create some simple example spectra and subsequently use the
`compareSpectriPy` function to calculate pairwise similarities between these.

```{r, message = FALSE}
library(Spectra)
library(SpectriPy)

## Create a Spectra object with two MS2 spectra for Caffeine.
caf <- DataFrame(
    msLevel = c(2L, 2L),
    name = "Caffeine",
    precursorMz = c(195.0877, 195.0877)
)
caf$intensity <- list(
    c(340.0, 416, 2580, 412),
    c(388.0, 3270, 85, 54, 10111))
caf$mz <- list(
    c(135.0432, 138.0632, 163.0375, 195.0880),
    c(110.0710, 138.0655, 138.1057, 138.1742, 195.0864))
caf <- Spectra(caf)

## Create a Spectra object with two MS2 spectra for 1-Methylhistidine
mhd <- DataFrame(
    msLevel = c(2L, 2L),
    precursorMz = c(170.0924, 170.0924),
    id = c("HMDB0000001", "HMDB0000001"),
    name = c("1-Methylhistidine", "1-Methylhistidine"))
mhd$mz <- list(
    c(109.2, 124.2, 124.5, 170.16, 170.52),
    c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16))
mhd$intensity <- list(
    c(3.407, 47.494, 3.094, 100.0, 13.240),
    c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643))
mhd <- Spectra(mhd)
```

We first calculate pairwise similarities between all spectra defined above and
those of caffeine using `Spectra`'s built-in `compareSpectra` function.

```{r}
all <- c(caf, mhd)
res_r <- compareSpectra(all, caf)
res_r
```

Thus, `compareSpectra` returned the pairwise similarity scores (by default
calculated using the normalized dot-product function) between all spectra in
`all` (rows) and all spectra in `caf` (columns). `compareSpectriPy` works
similar, with the difference that we need to specify and configure the
similarity function (from `matchms`) using a dedicated parameter object. Below
we calculate the similarity using the *CosineGreedy* function changing the
`tolerance` to a value of `0.05` (instead of the default `0.1`).

```{r}
res <- compareSpectriPy(all, caf, param = CosineGreedyParam(tolerance = 0.05))
res
```

As a result `compareSpectriPy` returns also a numeric matrix of similarities.
Note also that the first `compareSpectriPy` call takes usually a little longer
because the Python setup has to be initialized.

Next we use the *ModifiedCosine* algorithm that considers also differences
between the spectra's precursor m/z in the calculation.

```{r}
res <- compareSpectriPy(all, caf, param = ModifiedCosineParam())
res
```

Note that for this calculation all spectra precursor m/z values need to be
available, otherwise an error will be thrown. Thus, we should always ensure to
remove spectra without precursor m/z values prior to similarity scoring with
this similarity method. Below we remove the precursor m/z from one of our input
spectra and then show how the `Spectra` object could be subsetted to *valid*
spectra for this method.

```{r}
## Remove precursor m/z from the 3rd spectrum
all$precursorMz[3] <- NA

## Filter the input spectra removing those with missing precursor.
all <- all[!is.na(precursorMz(all))]

compareSpectriPy(all, caf, param = ModifiedCosineParam())
```


# Advanced use and internals

For advanced users or developers, the `rspec_to_pyspec` and `pyspec_to_rspec`
functions are available that enable conversion between MS data representations
in `R` and Python (i.e. between the `Spectra` object and the Python `matchms`
Spectrum object). To use these functions the `reticulate` package needs to be
installed along with a Python environment and the `matchms` Python package.

To illustrate their use we initialize below the Python environment that is
bundled using the `r BiocStyle::Biocpkg("basilisk")` package within `SpectriPy`.

```{r setup-basilisk}
library(SpectriPy)
library(basilisk)
cl <- basiliskStart(SpectriPy:::matchms_env)
```

We next create a simple `Spectra` object representing fragment spectra for some
small compounds.

```{r}
library(Spectra)

# create example spectra
spd <- DataFrame(
  msLevel = c(2L, 2L, 2L),
  id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"),
  name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine"))

## Assign m/z and intensity values.
spd$mz <- list(
  c(109.2, 124.2, 124.5, 170.16, 170.52),
  c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16),
  c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712,
    111.0551, 123.0429, 138.0662, 195.0876))
spd$intensity <- list(
  c(3.407, 47.494, 3.094, 100.0, 13.240),
  c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643),
  c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994))

sps <- Spectra(spd)
```

We next convert the `Spectra` to a `matchms` Spectrum object.

```{r}
pysps <- rspec_to_pyspec(sps)
pysps
```

As a result we got now a Python list with 3 Spectrum objects containing the peak
data of `sps` as well as a reduced set of the available spectra variables in
`sps`. Which spectra variables get copied to Python can be defined with the
`mapping` parameter of `rspec_to_pyspec`. The default is to convert all
variables defined by `spectraVariableMapping()`, but additional variables along
with their respective names in the Python Spectrum object can be defined
too. Below we list the pre-selected spectra variables that are converted by
default.

```{r}
spectraVariableMapping()
```

With our `Spectra` data converted to Python, we could directly call all routines
from the `matchms` package using the `reticulate` R package. Below we normalize
the intensities of the 3 spectra using the from `normalize_intensities`
functions from the `matchms.filtering` Python package. We thus need to import
first the functionality from that package and can then directly call the
function on the objects.

```{r}
library(reticulate)
filters <- import("matchms.filtering")

res <- vector("list", length(pysps))
for (i in (seq_along(pysps) - 1))
    res[[i + 1]] <- filters$normalize_intensities(pysps[i])
res <- r_to_py(res)
res
```

We can now convert the list of Python Spectrum objects back to R with
`pyspec_to_rspec`:

```{r}
sps_r <- pyspec_to_rspec(res)
#' The normalized intensities
intensity(sps_r)
#' The original intensities
intensity(sps)
```

Intensity values have now been normalized to values between 0 and 1.

Note however that, it would be much better (and likely more efficient) to
directly use a Python function on the list of Spectrum objects instead of the
mixed R and Python code used in the example above. Below we define a simple
python script that iterates over the spectra in python and performs the
normalization.

```{r}
py_script <- paste0("from matchms.filtering import normalize_intensities\n",
                    "for i in range(len(pysps)):\n",
                    "    pysps[i] = normalize_intensities(pysps[i])\n")
py$pysps <- pysps
py_run_string(py_script)

tmp <- pyspec_to_rspec(pysps)
intensity(tmp)
```

At very last we need also to stop the Python environment created by `basilisk`.

```{r}
basiliskStop(cl)
```

# Session information

```{r}
sessionInfo()
```