Integrating Spectra with Python’s matchms package

Compiled: Thu Nov 14 23:27:45 2024

Introduction

The SpectriPy package allows integration of Python MS packages into a Spectra-based MS analysis in R. Python functionality is wrapped into R functions allowing a seamless integration of the functionality of Python’s matchms package into R. In addition, functions to convert between R’s Spectra objects and Python’s matchms spectrum objects are available to the advanced user or developer enabling to create custom functions or workflows on Spectra objects in Python and executing them in R using the reticulate R package.

Installation

The package requires a python environment to be available and can be installed with the BiocManager R package using the command BiocManager::install("RforMassSpectrometry/SpectriPy"). All required python libraries are installed automatically on demand. Note that first installation or first invocation of specific functions might take long because of this installation process.

Spectra similarity calculations using matchms

The SpectriPy package provides the compareSpectriPy() function that allows to perform spectra similarity calculations using the scoring functions from the matchms Python package. Below all currently supported scoring functions are listed along with the parameter class that allows selecting and configuring the algorithm in the compareSpectriPy() function. Additional functions will be added in future.

We next create some simple example spectra and subsequently use the compareSpectriPy() function to calculate pairwise similarities between these.

library(Spectra)
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
## Warning: multiple methods tables found for 'setequal'
## Warning: multiple methods tables found for 'union'
## Warning: multiple methods tables found for 'intersect'
## Warning: multiple methods tables found for 'setdiff'
library(SpectriPy)

## Create a Spectra object with two MS2 spectra for Caffeine.
caf <- DataFrame(
    msLevel = c(2L, 2L),
    name = "Caffeine",
    precursorMz = c(195.0877, 195.0877)
)
caf$intensity <- list(
    c(340.0, 416, 2580, 412),
    c(388.0, 3270, 85, 54, 10111))
caf$mz <- list(
    c(135.0432, 138.0632, 163.0375, 195.0880),
    c(110.0710, 138.0655, 138.1057, 138.1742, 195.0864))
caf <- Spectra(caf)

## Create a Spectra object with two MS2 spectra for 1-Methylhistidine
mhd <- DataFrame(
    msLevel = c(2L, 2L),
    precursorMz = c(170.0924, 170.0924),
    id = c("HMDB0000001", "HMDB0000001"),
    name = c("1-Methylhistidine", "1-Methylhistidine"))
mhd$mz <- list(
    c(109.2, 124.2, 124.5, 170.16, 170.52),
    c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16))
mhd$intensity <- list(
    c(3.407, 47.494, 3.094, 100.0, 13.240),
    c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643))
mhd <- Spectra(mhd)

We first calculate pairwise similarities between all spectra defined above and those of caffeine using Spectra’s built-in compareSpectra() function.

all <- c(caf, mhd)
res_r <- compareSpectra(all, caf)
res_r
##           1         2
## 1 1.0000000 0.1973448
## 2 0.1973448 1.0000000
## 3 0.0000000 0.0000000
## 4 0.0000000 0.0000000

Thus, compareSpectra() returned the pairwise similarity scores (by default calculated using the normalized dot-product function) between all spectra in all (rows) and all spectra in caf (columns). compareSpectriPy() works similar, with the difference that we need to specify and configure the similarity function (from matchms) using a dedicated parameter object. Below we calculate the similarity using the CosineGreedy function changing the tolerance to a value of 0.05 (instead of the default 0.1).

res <- compareSpectriPy(all, caf, param = CosineGreedyParam(tolerance = 0.05))
## + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda create --yes --prefix /github/home/.cache/R/basilisk/1.19.0/SpectriPy/0.2.0/matchms_env 'python=3.10.14' --quiet -c bioconda -c conda-forge --override-channels
## + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/SpectriPy/0.2.0/matchms_env 'python=3.10.14' -c bioconda -c conda-forge --override-channels
## + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/SpectriPy/0.2.0/matchms_env -c bioconda -c conda-forge 'python=3.10.14' 'matchms=0.28.2' --override-channels
res
##           [,1]      [,2]
## [1,] 1.0000000 0.1948181
## [2,] 0.1948181 1.0000000
## [3,] 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000

As a result compareSpectriPy returns also a numeric matrix of similarities. Note also that the first compareSpectriPy call takes usually a little longer because the Python setup has to be initialized.

Next we use the ModifiedCosine algorithm that considers also differences between the spectra’s precursor m/z in the calculation.

res <- compareSpectriPy(all, caf, param = ModifiedCosineParam())
res
##            [,1]      [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.13841831 0.8520549
## [4,] 0.05724816 0.3523997

Note that for this calculation all spectra precursor m/z values need to be available, otherwise an error will be thrown. Thus, we should always ensure to remove spectra without precursor m/z values prior to similarity scoring with this similarity method. Below we remove the precursor m/z from one of our input spectra and then show how the Spectra object could be subsetted to valid spectra for this method.

## Remove precursor m/z from the 3rd spectrum
all$precursorMz[3] <- NA

## Filter the input spectra removing those with missing precursor.
all <- all[!is.na(precursorMz(all))]

compareSpectriPy(all, caf, param = ModifiedCosineParam())
##            [,1]      [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.05724816 0.3523997

Advanced use and internals

For advanced users or developers, the rspec_to_pyspec and pyspec_to_rspec functions are available that enable conversion between MS data representations in R and Python (i.e. between the Spectra object and the Python matchms Spectrum object). To use these functions the reticulate package needs to be installed along with a Python environment and the matchms Python package.

To illustrate their use we initialize below the Python environment that is bundled using the basilisk package within SpectriPy.

library(SpectriPy)
library(basilisk)
## Loading required package: reticulate
cl <- basiliskStart(SpectriPy:::matchms_env)

We next create a simple Spectra object representing fragment spectra for some small compounds.

library(Spectra)

# create example spectra
spd <- DataFrame(
  msLevel = c(2L, 2L, 2L),
  id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"),
  name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine"))

## Assign m/z and intensity values.
spd$mz <- list(
  c(109.2, 124.2, 124.5, 170.16, 170.52),
  c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16),
  c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712,
    111.0551, 123.0429, 138.0662, 195.0876))
spd$intensity <- list(
  c(3.407, 47.494, 3.094, 100.0, 13.240),
  c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643),
  c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994))

sps <- Spectra(spd)

We next convert the Spectra to a matchms Spectrum object.

pysps <- rspec_to_pyspec(sps)
pysps
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]

As a result we got now a Python list with 3 Spectrum objects containing the peak data of sps as well as a reduced set of the available spectra variables in sps. Which spectra variables get copied to Python can be defined with the mapping parameter of rspec_to_pyspec. The default is to convert all variables defined by spectraVariableMapping(), but additional variables along with their respective names in the Python Spectrum object can be defined too. Below we list the pre-selected spectra variables that are converted by default.

spectraVariableMapping()
##                  precursorMz           precursorIntensity 
##               "precursor_mz"        "precursor_intensity" 
##              precursorCharge                        rtime 
##                     "charge"             "retention_time" 
##              collisionEnergy      isolationWindowTargetMz 
##           "collision_energy" "isolation_window_target_mz" 
##                      msLevel 
##                   "ms_level"

With our Spectra data converted to Python, we could directly call all routines from the matchms package using the reticulate R package. Below we normalize the intensities of the 3 spectra using the from normalize_intensities functions from the matchms.filtering Python package. We thus need to import first the functionality from that package and can then directly call the function on the objects.

library(reticulate)
filters <- import("matchms.filtering")

res <- vector("list", length(pysps))
for (i in (seq_along(pysps) - 1))
    res[[i + 1]] <- filters$normalize_intensities(pysps[i])
res <- r_to_py(res)
res
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]

We can now convert the list of Python Spectrum objects back to R with pyspec_to_rspec:

sps_r <- pyspec_to_rspec(res)
#' The normalized intensities
intensity(sps_r)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994
#' The original intensities
intensity(sps)
## NumericList of length 3
## [[1]] 3.407 47.494 3.094 100 13.24
## [[2]] 6.685 4.381 3.022 16.708 100 4.565 40.643
## [[3]] 0.459 2.585 2.446 0.508 8.968 0.524 0.974 100 40.994

Intensity values have now been normalized to values between 0 and 1.

Note however that, it would be much better (and likely more efficient) to directly use a Python function on the list of Spectrum objects instead of the mixed R and Python code used in the example above. Below we define a simple python script that iterates over the spectra in python and performs the normalization.

py_script <- paste0("from matchms.filtering import normalize_intensities\n",
                    "for i in range(len(pysps)):\n",
                    "    pysps[i] = normalize_intensities(pysps[i])\n")
py$pysps <- pysps
py_run_string(py_script)

tmp <- pyspec_to_rspec(pysps)
intensity(tmp)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994

At very last we need also to stop the Python environment created by basilisk.

basiliskStop(cl)

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] basilisk_1.19.0     reticulate_1.39.0   SpectriPy_0.2.0    
## [4] Spectra_1.17.0      BiocParallel_1.41.0 S4Vectors_0.45.1   
## [7] BiocGenerics_0.53.2 generics_0.1.3      BiocStyle_2.35.0   
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-1           jsonlite_1.8.9         compiler_4.4.2        
##  [4] BiocManager_1.30.25    filelock_1.0.3         Rcpp_1.0.13-1         
##  [7] parallel_4.4.2         cluster_2.1.6          jquerylib_0.1.4       
## [10] png_0.1-8              IRanges_2.41.0         yaml_2.3.10           
## [13] fastmap_1.2.0          lattice_0.22-6         R6_2.5.1              
## [16] ProtGenerics_1.39.0    knitr_1.49             MASS_7.3-61           
## [19] maketools_1.3.1        bslib_0.8.0            rlang_1.1.4           
## [22] dir.expiry_1.15.0      cachem_1.1.0           xfun_0.49             
## [25] fs_1.6.5               MsCoreUtils_1.19.0     sass_0.4.9            
## [28] sys_3.4.3              cli_3.6.3              withr_3.0.2           
## [31] grid_4.4.2             digest_0.6.37          MetaboCoreUtils_1.15.0
## [34] lifecycle_1.0.4        clue_0.3-66            evaluate_1.0.1        
## [37] codetools_0.2-20       buildtools_1.0.0       rmarkdown_2.29        
## [40] basilisk.utils_1.19.0  tools_4.4.2            htmltools_0.5.8.1