--- title: "Integrating Spectra with Python's matchms package" package: SpectriPy output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Integrating Spectra with Python's matchms package} %\VignetteEngine{knitr::rmarkdown} %\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure, Quantitative} %\VignettePackage{SpectriPy} %\VignetteEncoding{UTF-8} %\VignetteDepends{Spectra,BiocStyle,SpectriPy,basilisk,reticulate} --- ```{r style, echo = FALSE, results = 'asis', message = FALSE} BiocStyle::markdown() ``` **Compiled**: `r date()` # Introduction The `SpectriPy` package allows integration of Python MS packages into a `Spectra`-based MS analysis in `R`. Python functionality is wrapped into R functions allowing a seamless integration of the functionality of Python's `matchms` package into `R`. In addition, functions to convert between R's `Spectra` objects and Python's `matchms` spectrum objects are available to the advanced user or developer. # Installation The package requires a python environment to be available and can be installed with the `BiocManager` R package using the command `BiocManager::install("RforMassSpectrometry/SpectriPy")`. All required python packages are installed automatically on demand. # Spectra similarity calculations using `matchms` The `SpectriPy` package provides the `compareSpectriPy` function that allows to perform spectra similarity calculations using the scoring functions from the `matchms` Python package. Below all currently supported scoring functions are listed along with the *parameter* class that allows selecting and configuring the algorithm in the `compareSpectriPy` function. Additional functions will be added in future. - [*CosineGreedy*](https://matchms.readthedocs.io/en/latest/api/matchms.similarity.CosineGreedy.html): `CosineGreedyParam`. - [*CosineHungarian*](https://matchms.readthedocs.io/en/latest/api/matchms.similarity.CosineHungarian.html): `CosineHungarianParam`. - [*ModifiedCosineParam*](https://matchms.readthedocs.io/en/latest/api/matchms.similarity.ModifiedCosine.html): `ModifiedCosineParam`. We next create some simple example spectra and subsequently use the `compareSpectriPy` function to calculate pairwise similarities between these. ```{r, message = FALSE} library(Spectra) library(SpectriPy) ## Create a Spectra object with two MS2 spectra for Caffeine. caf <- DataFrame( msLevel = c(2L, 2L), name = "Caffeine", precursorMz = c(195.0877, 195.0877) ) caf$intensity <- list( c(340.0, 416, 2580, 412), c(388.0, 3270, 85, 54, 10111)) caf$mz <- list( c(135.0432, 138.0632, 163.0375, 195.0880), c(110.0710, 138.0655, 138.1057, 138.1742, 195.0864)) caf <- Spectra(caf) ## Create a Spectra object with two MS2 spectra for 1-Methylhistidine mhd <- DataFrame( msLevel = c(2L, 2L), precursorMz = c(170.0924, 170.0924), id = c("HMDB0000001", "HMDB0000001"), name = c("1-Methylhistidine", "1-Methylhistidine")) mhd$mz <- list( c(109.2, 124.2, 124.5, 170.16, 170.52), c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16)) mhd$intensity <- list( c(3.407, 47.494, 3.094, 100.0, 13.240), c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643)) mhd <- Spectra(mhd) ``` We first calculate pairwise similarities between all spectra defined above and those of caffeine using `Spectra`'s built-in `compareSpectra` function. ```{r} all <- c(caf, mhd) res_r <- compareSpectra(all, caf) res_r ``` Thus, `compareSpectra` returned the pairwise similarity scores (by default calculated using the normalized dot-product function) between all spectra in `all` (rows) and all spectra in `caf` (columns). `compareSpectriPy` works similar, with the difference that we need to specify and configure the similarity function (from `matchms`) using a dedicated parameter object. Below we calculate the similarity using the *CosineGreedy* function changing the `tolerance` to a value of `0.05` (instead of the default `0.1`). ```{r} res <- compareSpectriPy(all, caf, param = CosineGreedyParam(tolerance = 0.05)) res ``` As a result `compareSpectriPy` returns also a numeric matrix of similarities. Note also that the first `compareSpectriPy` call takes usually a little longer because the Python setup has to be initialized. Next we use the *ModifiedCosine* algorithm that considers also differences between the spectra's precursor m/z in the calculation. ```{r} res <- compareSpectriPy(all, caf, param = ModifiedCosineParam()) res ``` Note that for this calculation all spectra precursor m/z values need to be available, otherwise an error will be thrown. Thus, we should always ensure to remove spectra without precursor m/z values prior to similarity scoring with this similarity method. Below we remove the precursor m/z from one of our input spectra and then show how the `Spectra` object could be subsetted to *valid* spectra for this method. ```{r} ## Remove precursor m/z from the 3rd spectrum all$precursorMz[3] <- NA ## Filter the input spectra removing those with missing precursor. all <- all[!is.na(precursorMz(all))] compareSpectriPy(all, caf, param = ModifiedCosineParam()) ``` # Advanced use and internals For advanced users or developers, the `rspec_to_pyspec` and `pyspec_to_rspec` functions are available that enable conversion between MS data representations in `R` and Python (i.e. between the `Spectra` object and the Python `matchms` Spectrum object). To use these functions the `reticulate` package needs to be installed along with a Python environment and the `matchms` Python package. To illustrate their use we initialize below the Python environment that is bundled using the `r BiocStyle::Biocpkg("basilisk")` package within `SpectriPy`. ```{r setup-basilisk} library(SpectriPy) library(basilisk) cl <- basiliskStart(SpectriPy:::matchms_env) ``` We next create a simple `Spectra` object representing fragment spectra for some small compounds. ```{r} library(Spectra) # create example spectra spd <- DataFrame( msLevel = c(2L, 2L, 2L), id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"), name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine")) ## Assign m/z and intensity values. spd$mz <- list( c(109.2, 124.2, 124.5, 170.16, 170.52), c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16), c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712, 111.0551, 123.0429, 138.0662, 195.0876)) spd$intensity <- list( c(3.407, 47.494, 3.094, 100.0, 13.240), c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643), c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994)) sps <- Spectra(spd) ``` We next convert the `Spectra` to a `matchms` Spectrum object. ```{r} pysps <- rspec_to_pyspec(sps) pysps ``` As a result we got now a Python list with 3 Spectrum objects containing the peak data of `sps` as well as a reduced set of the available spectra variables in `sps`. Which spectra variables get copied to Python can be defined with the `mapping` parameter of `rspec_to_pyspec`. The default is to convert all variables defined by `spectraVariableMapping()`, but additional variables along with their respective names in the Python Spectrum object can be defined too. Below we list the pre-selected spectra variables that are converted by default. ```{r} spectraVariableMapping() ``` With our `Spectra` data converted to Python, we could directly call all routines from the `matchms` package using the `reticulate` R package. Below we normalize the intensities of the 3 spectra using the from `normalize_intensities` functions from the `matchms.filtering` Python package. We thus need to import first the functionality from that package and can then directly call the function on the objects. ```{r} library(reticulate) filters <- import("matchms.filtering") res <- vector("list", length(pysps)) for (i in (seq_along(pysps) - 1)) res[[i + 1]] <- filters$normalize_intensities(pysps[i]) res <- r_to_py(res) res ``` We can now convert the list of Python Spectrum objects back to R with `pyspec_to_rspec`: ```{r} sps_r <- pyspec_to_rspec(res) #' The normalized intensities intensity(sps_r) #' The original intensities intensity(sps) ``` Intensity values have now been normalized to values between 0 and 1. Note however that, it would be much better (and likely more efficient) to directly use a Python function on the list of Spectrum objects instead of the mixed R and Python code used in the example above. Below we define a simple python script that iterates over the spectra in python and performs the normalization. ```{r} py_script <- paste0("from matchms.filtering import normalize_intensities\n", "for i in range(len(pysps)):\n", " pysps[i] = normalize_intensities(pysps[i])\n") py$pysps <- pysps py_run_string(py_script) tmp <- pyspec_to_rspec(pysps) intensity(tmp) ``` At very last we need also to stop the Python environment created by `basilisk`. ```{r} basiliskStop(cl) ``` # Session information ```{r} sessionInfo() ```