---
title: "Enabling integration of Python libraries and R packages for combined mass spectrometry data analysis"
package: SpectriPy
format:
  html:
    minimal: true
    theme: flatly
vignette: >
  %\VignetteIndexEntry{Enabling integration of Python libraries and R packages for combined mass spectrometry data analysis}
  %\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure, Quantitative}
  %\VignettePackage{SpectriPy}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{quarto::html}
  %\VignetteDepends{Spectra,BiocStyle,SpectriPy,reticulate,MsBackendMgf,MsDataHub,mzR}
---

**Compiled**: `r date()`

**Note**: since version 0.99.6 *SpectriPy* uses the newer, recommended approach
to install and configure required Python libraries (i.e., through *reticulate*'s
`py_require()` functionality). This affects also the way a pre-defined Python
environment can be used. See section [Startup and Python configuration](https://rformassspectrometry.github.io/SpectriPy/articles/detailed-installation-configuration.html#sec-python) in the
*Detailed information on installation and configuration* vignette for updated
information.

# Introduction

Powerful software libraries for mass spectrometry (MS) data are available in
both Python and R, covering specific and sometimes distinct aspects in the
analysis of proteomics and metabolomics data. R's *reticulate* package converts
basic data types between Python and R and enables a seamless interoperability of
both programming languages. The *SpectriPy* package extends *reticulate*
allowing to translate MS data data structures between R and Python,
specifically, between R's `Spectra` objects and Python MS data structures from
the [*matchms*](https://github.com/matchms/matchms) and
[*spectrum_utils*](https://github.com/bittremieux-lab/spectrum_utils) Python
libraries. In addition, functionality from Python's *matchms* library is
directly wrapped into R functions allowing a seamless integration into R based
workflows. *SpectriPy* thus enables powerful proteomics or metabolomics analysis
workflows combining the strengths of Python and R MS libraries.

This vignette provides information on how to share and translate MS data
structures between R and Python to enable combined Python and R-based analysis
workflows. For an example use case analysis, see the *SpectriPy tutorial:
Annotation of LC-MS/MS spectra* vignette.


# System requirements

The [*reticulate*](https://rstudio.github.io/reticulate/) package enables a
seamless integration of Python with R by translating the core data structures
between the two programming languages and sharing a Python (respectively R)
runtime environment, including shared variables, that can be accessed from the
other programming language. If the *reticulate* package is not already
available, it will be installed during the installation of *SpectriPy*.
Alternatively, it can be installed using `install.packages("reticulate")`.

The *SpectriPy* package builds on *reticulate* providing in addition the
functionality to translate data structures for MS data between the two
languages. Specifically, the package translates between R's `Spectra` objects
(from the `r BiocStyle::Biocpkg("Spectra")` package) and Python's
`matchms.Spectrum` and `spectrum_utils.spectrum.MsmsSpectrum` objects from the
[*matchms*](https://github.com/matchms/matchms) and
[*spectrum_utils*](https://github.com/bittremieux-lab/spectrum_utils) Python
libraries, respectively.


# Installation

If the *BiocManager* package is not already available, please install it with
`install.packages("BiocManager")`. As a system dependency, the package requires
Python (version >= 3.12) to be available. During package installation,
*SpectriPy* will by default install all required Python libraries
automatically. See section [Startup and Python
configuration](https://rformassspectrometry.github.io/SpectriPy/articles/detailed-installation-configuration.html#sec-python)
in the *Detailed information on installation and configuration* vignette for
information on manual library installation or usage of a pre-defined or system
Python environment. See section [Fixing package installation or loading
problems](https://rformassspectrometry.github.io/SpectriPy/articles/detailed-installation-configuration.html#sec-fix)
in the *Detailed information on installation and configuration* vignette for
some hints how to solve package installation or loading problems.

To install the package use the code below:

```{r}
#| eval: false
install.packages("BiocManager")
BiocManager::install("SpectriPy")
```


# Translating data structures between R and Python

This section describes how MS data is being converted between R and Python to
enable analyses combining both programming/analysis languages and accessing the
same shared MS data. The first two sections describe how MS data can be
converted between R and Python. See also section @sec-backend for a more
elegant approach and improved integration with [*Spectra*](https://)-based
workflows in R.

## Library loading and system setup

Below we load all required packages. By loading *SpectriPy*, the package will
evaluate if the required Python libraries (i.e., *matchms* version >= 0.1,
*spectrum_utils* version >= 0.3.2 and *numpy* version >= 2.2) are available. If
they are not available, *SpectriPy* will install them using functionality from
the *reticulate* R package. See section [Startup and Python
configuration](https://rformassspectrometry.github.io/SpectriPy/articles/detailed-installation-configuration.html#sec-python)
in the *Detailed information on installation and configuration* vignette for
more configuration options of *SpectriPy*.

The *reticulate* package will be loaded by *SpectriPy*, ensuring the R/Python
integration provided by that package to be available as well. To better
discriminate between R and Python code chunks, we add the comment `#' R
session:` or `#' Python session:` to label the R and Python code chunks,
respectively.

```{r}
#| label: libraries
#| message: false
#' R session:

library(Spectra)
library(SpectriPy)
```


## Converting MS data from R to Python

In this section we show how MS data can be converted from R to Python. Below we
first define the name and path of a data file in mzML format (provided by the `r
BiocStyle::Biocpkg("MsDataHub")` package) and load that data as an R `Spectra`
object. We thus first ensure that the package *MsDataHub* is installed. If the
line below results in an error you need to first install *MsDataHub* using
`BiocManager::install("MsDataHub")`.

```{r}
stopifnot(require("MsDataHub"))
```

Next, we define the file path to the test file and load the data as a `Spectra`
object in R.

```{r}
#| label: load-data
#| message: false
#' R session:

#' Loading the data from a mzML file as a `Spectra` object
library(MsDataHub)
fl <- MsDataHub::PestMix1_DDA.mzML()
mzml_r <- Spectra(fl)
```

We next restrict the data to MS level 2 and remove spectra with less than 3
(fragment) peaks.

```{r}
#| label: filter MS level
#' R session:

#' Restrict to MS level 2 spectra
mzml_r <- filterMsLevel(mzml_r, 2)
mzml_r <- mzml_r[lengths(mzml_r) >= 3]
mzml_r
```

This `Spectra` object can now be converted to equivalent data structures in
Python using the `rspec_to_pyspec()` function:

```{r}
#| label: convert spectra to matchms
#' R session:

#' Convert the R Spectra to a list of Python matchms.Spectrum objects
tmp <- rspec_to_pyspec(mzml_r)
```

The `tmp` variable is now a Python list of `matchms.Spectrum` objects:

```{r}
#' R session:

#' Class of the converted variable
class(tmp)

#' First element
class(tmp[1])
```

Note that this Python data structure is now stored within the R session.
Therefore, we have two full copies of the data in memory, i.e., an R `Spectra`
object and a Python `matchms.Spectrum` object (see Section @sec-backend for an
alternative to avoid multiple data copies). We can access the variable from the
associated Python environment through *reticulate*'s special `r` attribute using
`r.<variable name in R>`.

```{python}
#' Python session:

#' Access the Python data structure stored in the R session
r.tmp
```

While it is thus possible to access the variable, it is suggested, and has also
performance advantages, if the Python data is stored directly in the
Python environment. Similar to the special `r` attribute, *reticulate* provides
the `py` variable in R that allows access to the main Python environment
associated with the R session. Attributes can be assigned to the Python
environment with the `py_set_attr()` function. We repeat our data conversion
operation but assign the result to a Python attribute `"mzml_py"`:

```{r}
#| label: assign variable to Python
#' R session:

#' Assign the data to a variable in the Python environment
py_set_attr(py, "mzml_py", rspec_to_pyspec(mzml_r))
names(py)
```

We can now access this data directly from within Python:

```{python}
#| label: data types of variables
#' Python session:

#' Data type of the variable:
type(mzml_py)

#' The length of the list:
len(mzml_py)

#' Data type of the first element:
type(mzml_py[0])

#' Intensities of the first spectrum:
mzml_py[0].peaks.intensities
```

This allows us to analyze and process the MS data directly in Python. As an
example we load the *matchms.filtering* library and scale the intensity
values of each spectrum with the `normalize_intensities()` function such that
their total sum is 1.

```{python}
#| label: normalize intensities with matchms
#' Python session:

import matchms.filtering as mms_filt

#' Iterate over the Spectrum list and scale the intensities
for i in range(len(mzml_py)):
    mzml_py[i] = mms_filt.normalize_intensities(mzml_py[i])

#' Intensities for the first spectrum
mzml_py[0].peaks.intensities
```

We can also access the changed data from R. The `py_get_attr()` function can be
used to retrieve the variable with the changed Python data. Through the
*reticulate* package it is then also possible to call attributes and Python
functions directly from R. To extract the intensities of the first spectrum we
can use the same code as above, just replacing `.` with `$`. Note also that,
since the variable returned by `py_get_attr()` is a Python object, we need to
use index `0` to access the first element.

```{r}
#| label: inspect changed intensities in R
#' R session:

#' Access the intensities of the first spectrum
py_get_attr(py, "mzml_py")[0]$peaks$intensities
```

Note that with the `rspec_to_pyspec()` function, we created a copy of the
original data in Python. We have thus now two variables, the `mzml_r` variable
in R with the original, unchanged, intensity values, and the `mzml_py` attribute
in Python with the scaled intensity values. To get the processing results back
to R we need to convert the data from Python to R. This can be done with the
`pyspec_to_rspec()` function which translates Python MS data structures into an
R `Spectra` object (see the following @sec-py-to-r section for more
information). A more elegant way is to use the `MsBackendPy` backend for
R' `Spectra` objects (see Section @sec-backend).

With *reticulate* we can use the `py` special variable in R to access attributes
defined in the associated Python environment. Similarly, *reticulate* defines an
attribute `r` in the Python session that allows to access variables in R. When
variables are accessed this way, they are automatically converted to the
corresponding data types in the other programming language if an `r_to_py()`
method (or `py_to_r()` method) is implemented for them. Such methods are
defined for the basic data types, so, when accessing for example the `fl`
R variable from Python, it is converted from the R `character` data type to the
equivalent Python `str` data type:

```{python}
#' Python session:

#' Access the `fl` variable from the R session:
r.fl

type(r.fl)
```

*SpectriPy* implements an `r_to_py()` method for `Spectra` objects, so, when
`Spectra` objects are accessed from Python, they are also automatically
translated to a Python list of `matchms.Spectrum` objects:

```{python}
#' Python session:

#' Access the `Spectra` object with the original data from R; the data
#' gets directly translated on-the-fly
r.mzml_r

#' Access the intensities of the first spectrum
r.mzml_r[0].peaks.intensities
```

While this automatic conversion is convenient in some cases, the manual
translation with `rspec_to_pyspec()` and `pyspec_to_rspec()` is preferred, as it
allows to configure the handling of the spectra variables (i.e. metadata) and
avoids eventual repeated translation of the data.


## Converting MS data from Python to R {#sec-py-to-r}

To show conversion of MS data from Python to R, we import a test data file in
MGF format using the Python *matchms* library. This MGF file is provided within
the *SpectriPy* package so we first define its file name and path in R.

```{r}
#| label: define MGF file
#' R session:

f_mgf <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
```

We next load the required Python library and import the data in Python. We can
access the variable (defined in the R session) with the file name to import the
data from through the `r.<variable name>`.

```{python}
#| label: load MGF in Python
#| warning: false
#' Python session

import matchms
from matchms.importing import load_from_mgf

mgf_py = list(load_from_mgf(r.f_mgf))
mgf_py
```

The MS data from the MGF is now loaded in Python as a list of
`matchms.Spectrum` objects. We can also directly access this variable from the R
session through `py$<variable name in Python>`. Below we access the first
spectrum in that list:

```{r}
#' R session:

#' Access the first spectrum
py$mgf_py[[1]]
```

The data is thus provided in a *matchms.Spectrum* object. We extract the *m/z*
and intensity peak matrix from that spectrum using the built-in functionality
from the *reticulate* package that allows to call Python functions from R or
translate between basic data types. As an example we below get the `peaks`
attribute from the first spectrum and convert that to a Python `numpy`
array. This array can then be translated to an R `matrix` with the `py_to_r()`
function (see also the [*Calling Python from
R*](https://rstudio.github.io/reticulate/articles/calling_python.html) vignette
from the *reticulate* package for examples on accessing data from Python):

```{r}
#' R session:

#' Extract the peaks matrix from the first spectrum
py_to_r(py$mgf_py[[1]]$peaks$to_numpy)
```

While such functionality can thus be used to extract the MS data from Python,
it is for MS data analysis in R more convenient to transform the full MS data to
an R `Spectra` object using *SpectriPy*'s `pyspec_to_rspec()` function. Note
that below we use the `py_get_attr()` function to get the Python variable with
the MS data instead of `py$mgf_py`. Accessing the Python attribute through
`py$mgf_py` immediately translates the Python list into an R `list` through the
default `py_to_r()` function and `pyspec_to_rspec()` thus iterates over the R
`list` (in R). With `py_get_attr()` the attribute is accessed in its
native Python data type and `pyspec_to_rspec()` hence iterates over the data in
Python, which has a minimal performance advantage.

```{r}
#' R session:

#' Convert and copy the data to R
mgf_r <- pyspec_to_rspec(py_get_attr(py, "mgf_py"))
mgf_r

#' Extract intensities
mgf_r$intensity
```

The full MS data is thus now available as a `Spectra` object. Note however that
with `pyspec_to_rspec()` the complete MS data gets **copied**. We have now
two (detached) variables containing the same MS data, one in R and one in
Python.


## Using a dedicated MS data *backend* for MS data in Python {#sec-backend}

An alternative, and more elegant, approach to avoid keeping multiple copies of
spectral objects (i.e., `Spectra`, `matchms.Spectrum`) in memory, is to use a
dedicated data *backend* for the `Spectra` object. This allows direct access of
MS data in Python from R without having an additional copy in R. The R's
`Spectra` object separates by design the functionality to analyze MS data from
the code to *represent* or retrieve the MS data, which is provided by dedicated
data *backends*. These allow for example to import MS data from different file
formats, or to store and access the data with different storage modes,
respectively (see also [this
tutorial](https://jorainer.github.io/SpectraTutorials/articles/Spectra-backends.html)
for more information). The *SpectriPy* packages defines such a backend, the
`MsBackendPy`, that allows to directly access MS data stored in Python (in their
respective data formats).

As an example we create below a `Spectra` object for the MS data (in Python)
previously imported from the MGF file (i.e. `"mgf_py"`) and using the
`MsBackendPy` as the data `source` (i.e., backend).

```{r}
#| label: initialize MsBackendPy for Python variable
#' R session:

#' Create a Spectra object with a MsBackendPy backend for the
#' attribute "mgf_py"
mgf <- Spectra("mgf_py", source = MsBackendPy())
mgf
```

In contrast to using the `pyspec_to_rspec()` function, no data was copied or
converted by this call. The `Spectra` object (through its backend) does only
keep a *reference* to the original data attribute in Python, but no data. Data
is retrieved and translated on-the-fly each time it is requested from the
`Spectra` object. Thus, calling e.g., `msLevel()` or `intensity()` causes the
backend to iterate over the MS data in Python, extract the respective
information, translate it and return it to the user.

```{r}
#| label: access MS level and intensity through MsBackendPy
#' R session:

#' Extract MS level
msLevel(mgf)

#' Extract intensity values
intensity(mgf)
```

While the performance is a little lower, compared to translating all data to R
using the `pyspec_to_rspec()` function, this approach is much more memory
efficient, because no additional copies of MS data are generated and only the MS
data currently required for a certain analysis task is loaded and translated at
a time.

Also, because data is always retrieved on-the-fly from Python, any changes to
the MS data attribute in Python are also immediately reflected in the respective
`Spectra` object. To illustrate this we below scale the intensities of the mass
peaks in Python:

```{python}
#| label: normalize intensities of MGF in Python
#' Python session:

#' Scale intensities
for i in range(len(mgf_py)):
    mgf_py[i] = mms_filt.normalize_intensities(mgf_py[i])

```

The intensities retrieved from the `Spectra` object are now also scaled.

```{r}
#| label: get intensities after modifying in Python
#' R session:

#' Get intensities after scaling the intensities in Python:
intensity(mgf)
```

While we created a `Spectra` object from an existing data object in Python, it
is similarly possible change the backend of a `Spectra` object to a
`MsBackendPy`. This follows the general concept of `Spectra` objects to support
changing between data representations, respectively backends, at any time. The
first data set in this vignette was imported from mzML files:

```{r}
#' R session:

mzml_r
```

This `Spectra` object uses the `MsBackedMzR` backend as data representation. We
can change the backend of `Spectra` objects using the `setBackend()`
function. Thus, below we change the backend to `MsBackendPy`, specifying also
the name of the variable in Python that should be used to store the data.

```{r}
#' R session:

mzml_r <- setBackend(mzml_r, backend = MsBackendPy(),
                     pythonVariableName = "mzml_p")
mzml_r
```

This converted all the MS data of the `mzml_r` `Spectra` object to Python
objects and assigned these to the Python variable with the name `"mzml_p"`. The
resulting `Spectra` object uses now a `MsBackendPy` backend which keeps only the
reference to the Python variable but no data. The size of the `Spectra` object
is thus very small.

```{r}
#' R session:

print(object.size(mzml_r), units = "MB")
```

And accessing the data will retrieve and convert the data from the referenced
Python objects:

```{r}
#' R session:

mz(mzml_r)
```

And all data can also be accessed in Python:

```{python}
#' Python session:

len(mzml_p)

mzml_p[0].peaks.mz
```

In addition to the MS data (i.e., the *m/z* and intensity values),
`setBackend()` will also convert core general spectra variables and store them
in the Python MS data structure as spectrum related metadata. By default, only a
limited set of spectra variables, the ones with the direct counterpart in
Python, are translated. These are for the *matchms* library:

```{r}
#' R session:

spectraVariableMapping("matchms")
```

with the names being the spectra variable names in R and the values the names of
the respective metadata fields in Python. The *spectrum_utils* library supports
only few, fixed, variables:

```{r}
#' R session:

spectraVariableMapping("spectrum_utils")
```

The spectra variables that have been transferred to Python with the
`setBackend()` call above are:

```{r}
#' R session:

spectraVariableMapping(mzml_r)
```

Thus, only some of the variables from the original `Spectra` object are now
available after changing the backend to `MsBackendPy`. However, since *matchms*
supports additional, arbitrary, metadata fields, it is also possible to convert
and transfer other spectra variables with `MsBackendPy`. As an example, we below
import data from an MGF.

```{r}
#' R session:

library(MsBackendMgf)
sps <- Spectra(system.file("extdata", "mgf", "test.mgf",
                           package = "SpectriPy"), source = MsBackendMgf())
spectraVariables(sps)
```

There are thus also spectra variables, such as `"SMILES"` and `"INCHI"` that are
not part of the default `spectraVariableMapping("matchms")` and would thus not
be converted/transferred by default when changing to a `MsBackendPy`. We can
however use the `spectraVariableMapping` parameter with `setBackend()` to define
which spectra variables should be transferred. The parameter takes a named
character with names being the names of the spectra variables to transfer and
values the names that should be used for the metadata in Python. We use the
`spectraVariableMapping()` function to append the mapping for the spectra
variables `"SMILES"` and `"INCHI"` to the default one.

```{r}
#' R session:

mp <- spectraVariableMapping("matchms", c(SMILES = "smiles", INCHI = "inchi"))
mp
```

We pass this character vector to the `setBackend()` call.

```{r}
#' R session:

sps <- setBackend(sps, backend = MsBackendPy(),
                  pythonVariableName = "sps_p",
                  spectraVariableMapping = mp)
spectraVariables(sps)
```

The spectra variables `"INCHI"` and `"SMILES"` have thus also been stored in
Python:

```{r}
#' R session:

sps$SMILES |> head()
```

See also the next section for more information on the mapping between
`Spectra`'s spectra variables and *matchms* metadata.

### Replacing data and ensuring data consistency

The `MsBackendPy` has full *read/write* support, i.e., it allows to add new
spectra variables or change existing spectra and/or peaks variables through the
available replacement methods `spectraData()<-`, `peaksData()<-`,
`intensity()<-`, `mz()<-` and `$<-`. These operations directly change the MS
data in the associated Python variable/data structure and special care is
advised if multiple copies of a `Spectra` object pointing to the **same** Python
variable exist. To avoid inadvertently changing data in other copies of a
`Spectra` object present in R it is possible to (temporarily) enable a
*copy-on-replace* strategy that copies (clones) the MS data of the MS data
structure in Python to another variable before replacing the values.

In the example below we first load an example data from an MGF file and change
the backend to a `MsBackendPy` object hence translating the full MS data to
Python.

```{r}
#' R session:

fl <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
s_mgf <- Spectra(fl, source = MsBackendMgf())
s_mgf <- setBackend(s_mgf, MsBackendPy(), pythonVariableName = "mgf_data")
s_mgf
```

The data is now stored in a Python variable with the name `"mgf_data"`. We next
enable the *copy-on-replace* option of *SpectriPy* and make a subset of the
data, still keeping the original `Spectra` object `s_mgf`.

```{r}
#' R session:

#' Enable copy-on-replace for MsBackendPy
pyspec_copy_on_replace(TRUE)

#' Make a subset of the original data and assign that to a different Spectra
s_mgf_sub <- s_mgf[1:10]
```

*SpectriPy* uses a *delayed* subset strategy keeping the index to the
individual spectra in the `MsBackendPy`. Both `Spectra` objects however still
reference the same original data in Python:

```{r}
#' R session:

#' Get the name of the asscociated Python variable
s_mgf@backend@py_var
s_mgf_sub@backend@py_var

#' The index to the spectrum objects in the Python list
s_mgf@backend@i
s_mgf_sub@backend@i
```

Next we assign/replace the retention times of the `s_mgf_sub` `Spectra` object.

```{r}
#' R session:

s_mgf_sub$rtime <- 1:10 + 0.1
```

After this replacement operation, the `s_mgf_sub` `Spectra` object points to a
**different** Python variable:

```{r}
#' R session:

s_mgf_sub@backend@py_var
```

The *copy-on-replace* strategy of *SpectriPy* copies (clones) the MS data
associated to the `MsBackendPy` to a new variable before replacing or updating
any of its data. Thus, the *original* `Spectra` object `s_mgf` points to a
Python variable with the MS data in its original state:

```{r}
#' R session:

s_mgf@backend@py_var
```

Note that this strategy is only needed when copies (or subsets) of a `Spectra`
object are created in R and data replacements are performed on only one of them,
or if different operations are applied to each. If no longer needed, i.e., if
each `Spectra` object present in R points to its own Python variable,
*copy-on-replace* should be disabled again to avoid multiple unnecessary copies
of the data.

```{r}
#' R session:

pyspec_copy_on_replace(FALSE)
```

**To summarize**: use `pyspec_copy_on_replace(TRUE)` if you have multiple
`Spectra` objects pointing to the same Python variable **before** you apply one
of the replacement operations `$<-`, `spectraData<-`, `mz<-`, `intensity<-`,
`peaksData<-`, or `applyProcessing()`. Use `pyspec_copy_on_replace(FALSE)`
**after** the operation was performed.


## Conversion of spectra variables

Conversion of the MS peaks data (i.e. the *m/z* and intensity values) is always
performed by the `rspec_to_pyspec()` and `pyspec_to_rspec()` functions. But next
to the peaks data, also additional information are available for individual
spectra. In R/*Spectra* these variables are called *spectra variables* while in
*matchms* they are stored as a *metadata* attribute to a `matchms.Spectrum`
object. The *SpectriPy* package defines a core set of spectra variables
that are by default converted by the `rspec_to_pyspec()` and `pyspec_to_rspec()`
function. These default variables can be accessed using the
`defaultSpectraVariableMapping()` function:

```{r}
#' R session:

#' List the *default* spectra variable mapping in R and python, respectively
defaultSpectraVariableMapping()
```

These variables, if present in the respective data object, are transferred (and
renamed) to the MS data structure of the other programming language. The names
of this character vector represent the name of the spectra variable in the
`Spectra` object, the elements (values) the names of the respective metadata
keys in Python's `matchms.Spectrum` class. This default mapping thus transfers
for example the `precursorMz()` spectra variable of R to the `"precursor_mz"`
spectrum metadata in Python (and *vice versa*). Note that any spectra variable
or metadata **not** being part of such a `mapping` will be ignored and hence not
converted.

Below we inspect the available metadata in the `matchms.Spectrum` objects that
were imported from the MGF file.

```{python}
#' Python session:

#' Available metadata for the first spectrum
mgf_py[0].metadata.keys()
```

Several additional metadata variables, such as `"smiles"`, `"inchi"` or
`"compound_name"`, not part of the default variables, are available. To
transfer also these, we create below a custom mapping adding in addition also a
mapping for these 3 variables:

```{r}
#' R session:

#' Add mapping for additional spectra variables to the default mapping in R and
#' python, respectively
map <- c(defaultSpectraVariableMapping(), smiles = "smiles",
         inchi = "inchi", name = "compound_name")
map
```

Such custom mapping can be passed with the parameter `mapping` to the
`pyspec_to_rspec()` (and also the `rspec_to_pyspec()`) function which will then
convert the full data to R.

```{r}
#| label: convert R to Python with spectra variables
#' R session:

#' Convert the Python MS data structures to an R `Spectra`
mgf_r <- pyspec_to_rspec(py_get_attr(py, "mgf_py"), mapping = map)
spectraVariables(mgf_r)
```

The respective metadata values have thus been added as new spectra variables to
our `Spectra` object and can also be extracted:

```{r}
#| label: access spectra variables from Python in R
#' R session:

#' Show the first values for the spectra variable "name"
mgf_r$name |>
    head()

#' Show the first values for the spectra variable "inchi"
mgf_r$inchi |>
    head()
```

When using a `MsBackendPy`, all metadata attributes from the `matchms.Spectrum`
objects can be accessed and extracted with the `spectraData()` function. The
`spectraVariables()` function lists all available columns:

```{r}
#' R session:

#' List available spectra variables
spectraVariables(mgf)
```

We thus have already access to e.g. the `"inchi"` variable:

```{r}
#' R session:

#' Get the first entries from the inchi variable
mgf$inchi |>
    head()
```

A `spectraVariableMapping()` can however be used to rename variables. Below we
add for example a mapping of the `matchms.Spectrum` metadata attribute
`"pepmassint"` to the (core) variable `"precursorIntensity"` using the
`MsBackendPy`:

```{r}
#| label: add additional mapping
#' R session:

#' Add mapping for additional spectra variables to the `MsBackendPy`
m <- defaultSpectraVariableMapping()
m["precursorIntensity"] <- "pepmassint"
spectraVariableMapping(mgf) <- m

#' List available spectra variables
spectraVariables(mgf)

#' Get spectra data for two variables
spectraData(mgf, c("precursorMz", "precursorIntensity"))
```

The `r_to_py()` methods do not support additional parameters, thus, in order to
use a similar mapping also with the `r_to_py()` method for `Spectra` the
*global* spectra variable mapping need to be changed. See the help of the
`setSpectraVariableMapping()` function for more details.


## Combined MS data analysis

With the functionality to translate between R and Python MS data structures, the
*SpectriPy* package enables thus a MS data analysis combining functionality
provided by both R and Python libraries. Some of the functionality of the
*matchms* Python library are directly wrapped into R functions simplifying
their use and inclusion in R-based workflows (see the help of
`compareSpectraPy()` and `filterSpectriPy()` functions). Combined analyses with
code chunks in both programming languages has however the advantage to use the
respective functionality provided by the original package/library.

As a simple use case we calculate below spectra similarities using two different
similarity scores between spectra from the mzML and the MGF files.

In R we can use the `compareSpectra()` function that by default calculates the
normalized dot product similarity between the compared spectra. The MS data from
the mzML file was processed in Python. To use this data we first create a
`Spectra` object with a `MsBackendPy` backend. For the MS data from the MGF file
we re-use the `mgf` variable, which is also a `Spectra` with a `MsBackendPy`
backend referencing the MS data imported and processed in Python.

```{r}
#| label: compareSpectra with mzML and MGF
#' R session:

#' Create a `Spectra` for the scaled MS data in Python
mzml <- Spectra("mzml_py", source = MsBackendPy())

#' Calculate the pairwise similarity between all spectra
sim <- compareSpectra(mzml, mgf, tolerance = 0.1)
dim(sim)
```

We can calculate the *Cosine Hungarian* similarity score in Python using the
functionality from the *matchms* library. Here we use the *original* attributes
available in Python, i.e. `mzml_py` and `mgf_py`.

```{python}
#| label: Cosine Hungarian similarity in Python
#' Python session:

import matchms.similarity as mms_similarity

#' Calculate similarity scores
scores = matchms.calculate_scores(
    mzml_py, mgf_py, mms_similarity.CosineHungarian(tolerance = 0.1))

#' Extract similarity scores
sim = scores.to_array()["CosineHungarian_score"]
```

Alternatively, we can also directly use any `Spectra` object from R as the
data is converted automatically when accessed from Python. We can for
example use `r.mzml` instead of `mzml_py` in the call above, which converts
the `Spectra` object `mzml` in R to Python before calculating the similarities.

We can also directly compare the scores calculated using the two different
algorithms.

```{r}
#| label: compare dot product vs Cosine Hungarian
#' R session:

#' Plot the similarity scores against each other
plot(sim, py$sim, pch = 21, col = "#000000ce", bg = "#00000060",
     xlab = "Dot product", ylab = "Cosine Hungarian")
grid()
```


## Summary

By translating between R and Python MS data structures, the *SpectriPy* package
enables data analyses that combine functionalities from both programming
language. In terms of (memory) efficiency, usage of *SpectriPy*'s `MsBackendPy`
backend for `Spectra` objects has clear advantages over the repeated translation
and copying of the MS data. See also section @sec-comments for general comments.


# Appendix

## General comments {#sec-comments}

- Be careful accessing Python attributes using `py$<attribute name>`: base
  Python data types will be automatically converted to the equivalent R data
  type. For MS data, it might be better to get the attributes using the
  `py_get_attr()` functions.

- Since MS data can be large, it is suggested the user converts MS data mostly
  manually, and only if/when needed, using `rspec_to_pyspec()` and
  `pyspec_to_rspec()` - or, ideally, use a `MsBackendPy` backend.

- The `rspec_to_pyspec()` and `pyspec_to_rspec()` functions **copy** the data
  while transferring. Thus, there will be eventually two (detached) copies of
  the same data in Python and R.

- The `MsBackendPy` backend allows to directly interface MS data from
  Python. Data will be converted on-the-fly, so no additional copies of the data
  exist.

- Be aware that, since the `MsBackendPy` backend does not contain any MS data
  but simply interfaces the MS data in Python, any changes to this data in
  Python affect also the `Spectra` object using that backend.

- Be careful when re-assigning a `Spectra` object that uses a `MsBackendPy` to
  a new variable (e.g. `b <- a`, where `a` is such a `Spectra` object). Both
  variables will point to the **same** Python variable and changing data/values
  in one of the variables will affect/change the data in the other variable
  (e.g. changing the MS level of all spectra in `b` using `b$msLevel <- 3` will
  change the MS levels in the shared Python objects, hence, `a$msLevel` will
  then also be `3` for all).


# Session information

```{r}
#' R session:

sessionInfo()
```
