--- title: "Predict formula and structure of chromatographic peaks from an XcmsExperiment object Sirius through the RuSirius package." output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Predict formula and structure of chromatographic peaks from an XcmsExperiment object Sirius through the RuSirius package.} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{RuSirius} %\VignetteDepends{Spectra, MsExperiment, xcms, RSirius, MetaboAnnotation, RuSirius, dplyr} --- ``` r library(Spectra) library(MsExperiment) library(xcms) #> Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (1.1.0) #> than is installed on your system (1.1.1). This might lead to errors #> when loading mzR. If you encounter such issues, please send a report, #> including the output of sessionInfo() to the Bioc support forum at #> https://support.bioconductor.org/. For details see also #> https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue. library(RSirius) library(MetaboAnnotation) library(RuSirius) library(dplyr) library(MsDataHub) ``` ## Introduction **Note**: this vignette is [**pre-computed**](https://ropensci.org/blog/2019/12/08/precompute-vignettes/). See the session info for information on packages used and the date the vignette was rendered. The vignette requires a running [Sirius](https://bio.informatik.uni-jena.de/software/sirius/) instance and a Sirius account for structure database searches. To reproduce this analysis, you will need to log in to your own Sirius account. This vignette demonstrates a basic workflow for importing detected *chromatographic peaks* from an `XcmsExperiment` object into *Sirius*. It then runs Sirius's main tools: formula identification, structure database search, compound class prediction, spectral library matching, *de novo* structure prediction, and finally retrieves the results. This is a foundational example and does not cover all the possible parameters for each Sirius tool. For detailed parameter information, consult the `run()` function documentation. More information can be found in the [Sirius documentation online](https://v6.docs.sirius-ms.io/). While this vignette focuses on chromatographic peaks detected with *xcms*, a similar workflow applies to features (grouped chromatographic peaks). The vignette for features is pending the availability of public data, but the steps for data preparation differ only slightly and can be adapted without issue. **IMPORTANT:** This is a work in progress. Feedback is highly valued, especially regarding enhancements or additions that could simplify your workflow. Your input as a user is essential. ## Preprocessing Here, we apply pre-optimized parameters for processing the example dataset. ``` r dda_file <- MsDataHub::PestMix1_DDA.mzML() dda_data <- readMsExperiment(dda_file) spectra(dda_data) <- setBackend(spectra(dda_data), MsBackendMemory()) dda_data <- filterRt(dda_data, rt = c(230, 610)) dda_data |> spectra() |> msLevel() |> table() #> #> 1 2 #> 1389 2214 prec_int <- estimatePrecursorIntensity(spectra(dda_data)) cwp <- CentWaveParam(snthresh = 5, noise = 100, ppm = 10, peakwidth = c(3, 30)) dda_data <- findChromPeaks(dda_data, param = cwp, msLevel = 1L) ``` ## MS1 and MS2 Extraction MS2 and MS1 spectra corresponding to the identified chromatographic peaks are extracted. Each feature imported in Sirius can only have one MS1 spectrum. However MS2 can also be loaded by itself wiht no MS1 if wanted. ``` r # Extract MS1 and MS2 spectra linked to chromatographic peaks ms2 <- chromPeakSpectra(dda_data, expandMz = 0.01, expandRt = 3, msLevel = 2L) low_int <- function(x, ...) x > max(x, na.rm = TRUE) * 0.05 ms2 <- filterIntensity(ms2, intensity = low_int) ms2 <- ms2[lengths(ms2) > 1] ms1 <- chromPeakSpectra(dda_data, expandMz = 0.01, expandRt = 3, msLevel = 1L, method = "closest_rt") ms1 <- applyProcessing(ms1) ms2 <- applyProcessing(ms2) ms1_ms2 <- concatenateSpectra(ms1, ms2) ``` ## Open Sirius and project set up The Sirius application is initialized via the API, requiring only a project ID. If the project exists, it is opened; otherwise, a new project is created. The `srs` object acts as the connection to Sirius and holds project details. Properly shut down the connection with `shutdown(srs)` after completing your work. This `srs` variable is needed for any task that necessitate to communicate with the application. You can learn more about this object class by running `?Sirius` in the console. Parameter `port` used below allows to configure the *Sirius* application to use a particular port, but generally the function can be used without specifying a port. ``` r # Initialize Sirius connection srs <- Sirius(projectId = "test_lcmsms", port = 9999) #> Error in `Sirius()`: #> ! unused argument (port = 9999) srs #> Error: #> ! object 'srs' not found ``` If the user wants to open and perform computations on a different project they can use the utility function below: ``` r srs <- openProject(srs, projectId = "test_lcmsms2", path = getwd()) #> Error in `openProject()`: #> ! The connection to the Sirius instance is not valid. srs #> Error: #> ! object 'srs' not found ``` You can find all the utility functions of this package by running `?utils` in the console. **NOTE** if you have any idea of utility functions that could be implemented do not hesitate to ask. ## Data import Preprocessed `xcms` data is imported into Sirius, and a summary `data.frame` is returned with feature information. This information can also be retrieved using the utility function `featuresInfo()`. ``` r ## load back our original project srs <- openProject(srs, "test_lcmsms") #> Error in `openProject()`: #> ! The connection to the Sirius instance is not valid. ## Import data into Sirius srs <- import(sirius = srs, spectra = ms1_ms2, ms_column_name = "chrom_peak_id", deleteExistingFeatures = TRUE) #> Error: #> ! object 'srs' not found ## See information about the features featuresInfo(srs) |> head() #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'srs' not found ``` Notes: - It could also be discussed that this `data.frame` could be stored direction into the `srs` object - When running `import()` i automatically create a mapping data.frame between the *xcms* feature ID and the *Sirius* feature ID. It is stored in the `srs` object, the `featureMap` slot. This can be used in the future so the user never need to interact with the *Sirius* IDs. Below is an example of how to extract features ID, the utility function `featuresId()` quickly extract all available ID either `sirius` or `xcms`. ``` r fts_id <- featuresId(srs, type = "sirius") #> Error: #> ! object 'srs' not found ``` ## Searchable database !!! Note: The database code does not work at the moment, please load them through the GUI until it's fixed. !!! Whether it is for structure prediction or spectral library matching, users can upload their custom databases into Sirius. In this vignette, we demonstrate how to test spectral library matching by creating and loading a custom database into Sirius. This process can also be completed easily via the Sirius graphical user interface (GUI). If you prefer an interactive approach, you can use the `openGUI(srs)` command to open the Sirius app and manage your database directly. In this example, we download the MassBank library from GNPS, which needs to be loaded into Sirius to generate a `.sirius.db` file. Below we will download in our current directory but you can precise where you want to save it using `location =` parameter. ``` r ## Download the MassBank EU library download.file("https://external.gnps2.org/gnpslibrary/MASSBANKEU.mgf", destfile = "MASSBANKEU.mgf") createDb(srs, databaseId = "massbankeuCustom", files = "MASSBANKEU.mgf") ``` NOTE: THis takes quite a while, will change to a smaller database later. Once the database is created and loaded, you can verify its successful import by running the following command: ``` r listDbs(srs) ``` Find more on how to handle databases in Sirius by typing `?siriusDBs` in the console. ## Submit job to Sirius - For structure DB search Annotation and prediction begin after data import. The `run()` function accepts parameters for each Sirius tool, such as formula identification, structure database search, and compound class prediction. Parameters can also specify adducts or custom databases. Detailed documentation for these parameters is available in the `run()` function's help file. The `wait` parameter ensures the function waits for job completion before proceeding. If set to `FALSE`, the job ID is returned, and the user must check the status using `jobInfo()`. ``` r ## Start computation job_id <- run(srs, fallbackAdducts = c("[M + H]+", "[M + Na]+", "[M + K]+"), formulaIdParams = formulaIdParam(numberOfCandidates = 5, instrument = "QTOF", numberOfCandidatesPerIonization = 2, massAccuracyMS2ppm = 10, filterByIsotopePattern = FALSE, isotopeMs2Settings = c("SCORE"), performDeNovoBelowMz = 600, minPeaksToInjectSpecLibMatch = 3), predictParams = predictParam(), structureDbSearchParams = structureDbSearchParam( structureSearchDbs = c("BIO") ), recompute = TRUE, wait = TRUE ) #> Error: #> ! object 'srs' not found srs #> Error: #> ! object 'srs' not found ``` ``` r ## Get more info for the job jobInfo(srs, job_id) |> cat() #> Error: #> ! object 'job_id' not found ``` ## Retrieve Results To obtain a summary of all results, including the top formulas, structures, and compound class predictions, use the following code. This summary table provides a quick overview to evaluate whether the results align with expectations. However, we recommend not relying on this table as-is for detailed analysis. Instead, use the functions described later in this vignette to explore the results in greater depth. An important aspect of the summary table is the confidence-related columns, which provide insight into the reliability of the predictions. ``` r summarytb <- summary(srs, result.type = "structure") #> Error: #> ! object 'srs' not found head(summarytb) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summarytb' not found ``` ## Formula identification results: For detailed results, the results() function can be used with the `result.type` parameter set to `"formulaId"`, `"structureDb"`, `"compoundClass"`, or `"deNovo"`. Note that all results are linked to a predicted formula. The parameters `topFormula` and `topStructure` allow users to specify how many formulas or structures should be included in the output. The results can be returned either as a list or a data.frame, based on the return.type parameter. Note: Suggestions for renaming the `results()` function or feedback on this implementation are welcome. We aim to adapt based on user needs. ``` r results(srs, return.type = "data.frame", result.type = "formulaId", topFormula = 5) #> Error: #> ! object 'srs' not found ``` ## Structure DBs search results The following example shows the top two structure annotations for the top five formulas of each feature. This can provide an insightful view into the structural predictions. ``` r finalstructredb <- results(srs, return.type = "data.frame", result.type = "structureDb", topFormula = 5, topStructure = 2) #> Error: #> ! object 'srs' not found head(finalstructredb) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'finalstructredb' not found ``` For a more visual exploration of the results, you can open the Sirius GUI with the commands below: ``` r openGUI(srs) closeGUI(srs) ``` ## Compound class prediction results To retrieve compound class predictions, use the following code. Below is an example showing all compound annotations with confidence scores above 50% for the top two formulas of each feature. ``` r finalcomp <- results(srs, return.type = "data.frame", result.type = "compoundClass", topFormula = 2) #> Error: #> ! object 'srs' not found head(finalcomp) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'finalcomp' not found ``` ## Spectral library matching results The following code gives you a summary of the best matches: ``` r summaryspectra <- summary(srs, result.type = "spectralDbMatch") #> Error: #> ! object 'srs' not found head(summaryspectra) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summaryspectra' not found ``` For detailed results, use the following code: ``` r full_spectral <- results(srs, return.type = "data.frame", result.type = "spectralDbMatch", topSpectralMatches = 2) #> Error: #> ! object 'srs' not found head(full_spectral) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'full_spectral' not found ``` ## Fragmentation tree results Below we show how to get the fragmentation tree for the top2 formula of some feautres. This is quite inefficient at the moment so limit it to a little number of feature. I will improve it. ``` r resulttree <- results(srs, features = featuresId(srs)[1:5], return.type = "list", result.type = "fragTree", topFormula = 4, ) #> Error: #> ! object 'srs' not found head(resulttree) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'resulttree' not found ``` ## Submit job to Sirius - For De Novo structure annotation. *De novo* structure annotation is computationally intensive and recommended only for specific features. ``` r # Determine features that do not have/have poor structure prediction fts_denovo <-summarytb$alignedFeatureId[which( summarytb$confidenceApproxMatch < 0.3 | summarytb$confidenceApproxMatch %in% c("NA", "-Infinity"))] #> Error: #> ! object 'summarytb' not found ``` ``` r # Compute with zodiac and denovo job_id <- run(srs, msNovelistParams = deNovoStructureParam(numberOfCandidateToPredict = 5), alignedFeaturesIds = fts_denovo, recompute = FALSE, wait = TRUE ) #> Error: #> ! object 'fts_denovo' not found ## Get info for the job jobInfo(srs, job_id) |> cat() #> Error: #> ! object 'job_id' not found ``` ## Retrieve results ``` r summraryDeNovo <- summary(srs, result.type = "deNovo") #> Error: #> ! object 'srs' not found head(summraryDeNovo) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summraryDeNovo' not found ``` Below is the full results. ``` r full_de_novo <- results(srs, return.type = "data.frame", result.type = "deNovo", topFormula = 5) #> Error: #> ! object 'srs' not found head(full_de_novo) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'full_de_novo' not found ``` ## Setting a Known Molecular Formula When you already know (or suspect) the molecular formula and adduct for a feature, you can restrict Sirius formula identification to that candidate. This is useful for targeted analyses where you want to compute the fragmentation tree for a known compound rather than performing an open search. Use the `candidateFormulas` parameter of `formulaIdParam()` together with `enforceAdducts` to lock in both the formula and adduct: ``` r # Select a feature with MS2 data (e.g. Carbaryl, [M+H]+ at m/z ~220.10) fi <- featuresInfo(srs) #> Error: #> ! object 'srs' not found target_ft <- fi[fi[, "hasMsMs"] == TRUE, "alignedFeatureId"][[1]] #> Error: #> ! object 'fi' not found # Run formula identification restricted to a single known formula job_id <- run(srs, alignedFeaturesIds = target_ft, enforceAdducts = "[M+H]+", formulaIdParams = formulaIdParam( candidateFormulas = "C12H13NO3" ), recompute = TRUE, wait = TRUE) #> Error in `formulaIdParam()`: #> ! unused argument (candidateFormulas = "C12H13NO3") ``` You can also supply multiple candidate formulas if you want to compare a small set of possibilities: ``` r job_id <- run(srs, alignedFeaturesIds = target_ft, enforceAdducts = "[M+H]+", formulaIdParams = formulaIdParam( candidateFormulas = c("C12H13NO3", "C10H12N2O") ), recompute = TRUE, wait = TRUE) #> Error in `formulaIdParam()`: #> ! unused argument (candidateFormulas = c("C12H13NO3", "C10H12N2O")) ``` After the job completes, retrieve the formula results and fragmentation tree as usual: ``` r # Formula results - should only contain the specified candidate(s) results(srs, features = target_ft, result.type = "formulaId", return.type = "data.frame") #> Error: #> ! object 'target_ft' not found # Fragmentation tree for the known formula results(srs, features = target_ft, result.type = "fragTree", topFormula = 1, return.type = "list") #> Error: #> ! object 'target_ft' not found ``` ## CleanUp ``` r # Close the Sirius session shutdown(srs) #> Warning in value[[3L]](cond): Could not retrieve open projects: object 'srs' not found #> Warning in doTryCatch(return(expr), name, parentenv, handler): restarting interrupted #> promise evaluation ``` # Session information The R code was run on: ``` r date() #> [1] "Mon Mar 23 11:26:36 2026" ``` Information on the R session: ``` r sessionInfo() #> R version 4.5.2 (2025-10-31 ucrt) #> Platform: x86_64-w64-mingw32/x64 #> Running under: Windows 11 x64 (build 26100) #> #> Matrix products: default #> LAPACK version 3.12.1 #> #> locale: #> [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 #> [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C #> [5] LC_TIME=English_United States.utf8 #> #> time zone: Europe/Rome #> tzcode source: internal #> #> attached base packages: #> [1] stats4 stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] MsDataHub_1.10.0 dplyr_1.2.0 RuSirius_0.2.0 #> [4] jsonlite_2.0.0 MetaboAnnotation_1.14.0 RSirius_6.3.3 #> [7] xcms_4.8.0 MsExperiment_1.12.0 ProtGenerics_1.42.0 #> [10] Spectra_1.20.1 BiocParallel_1.44.0 S4Vectors_0.48.0 #> [13] BiocGenerics_0.56.0 generics_0.1.4 #> #> loaded via a namespace (and not attached): #> [1] RColorBrewer_1.1-3 MultiAssayExperiment_1.36.1 magrittr_2.0.4 #> [4] farver_2.1.2 MALDIquant_1.22.3 fs_1.6.6 #> [7] vctrs_0.7.1 memoise_2.0.1 RCurl_1.98-1.17 #> [10] base64enc_0.1-6 htmltools_0.5.9 S4Arrays_1.10.1 #> [13] BiocBaseUtils_1.12.0 progress_1.2.3 curl_7.0.0 #> [16] AnnotationHub_4.0.0 SparseArray_1.10.8 mzID_1.48.0 #> [19] htmlwidgets_1.6.4 plyr_1.8.9 httr2_1.2.2 #> [22] impute_1.84.0 cachem_1.1.0 igraph_2.2.1 #> [25] lifecycle_1.0.5 iterators_1.0.14 pkgconfig_2.0.3 #> [28] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0 #> [31] MatrixGenerics_1.22.0 clue_0.3-66 digest_0.6.39 #> [34] pcaMethods_2.2.0 rsvg_2.7.0 AnnotationDbi_1.72.0 #> [37] ExperimentHub_3.0.0 GenomicRanges_1.62.1 RSQLite_2.4.5 #> [40] filelock_1.0.3 httr_1.4.7 abind_1.4-8 #> [43] compiler_4.5.2 withr_3.0.2 bit64_4.6.0-1 #> [46] doParallel_1.0.17 S7_0.2.1 DBI_1.2.3 #> [49] MASS_7.3-65 ChemmineR_3.62.0 rappdirs_0.3.4 #> [52] DelayedArray_0.36.0 rjson_0.2.23 mzR_2.44.0 #> [55] tools_4.5.2 PSMatch_1.14.0 otel_0.2.0 #> [58] CompoundDb_1.14.2 glue_1.8.0 QFeatures_1.20.0 #> [61] grid_4.5.2 cluster_2.1.8.1 reshape2_1.4.5 #> [64] snow_0.4-4 gtable_0.3.6 preprocessCore_1.72.0 #> [67] tidyr_1.3.2 data.table_1.18.2.1 hms_1.1.4 #> [70] MetaboCoreUtils_1.19.2 xml2_1.5.2 XVector_0.50.0 #> [73] BiocVersion_3.22.0 foreach_1.5.2 pillar_1.11.1 #> [76] stringr_1.6.0 limma_3.66.0 BiocFileCache_3.0.0 #> [79] lattice_0.22-7 bit_4.6.0 tidyselect_1.2.1 #> [82] Biostrings_2.78.0 knitr_1.51 gridExtra_2.3 #> [85] IRanges_2.44.0 Seqinfo_1.0.0 SummarizedExperiment_1.40.0 #> [88] xfun_0.56 Biobase_2.70.0 statmod_1.5.1 #> [91] MSnbase_2.36.0 matrixStats_1.5.0 DT_0.34.0 #> [94] stringi_1.8.7 yaml_2.3.12 lazyeval_0.2.2 #> [97] evaluate_1.0.5 codetools_0.2-20 MsCoreUtils_1.22.1 #> [100] tibble_3.3.1 BiocManager_1.30.27 cli_3.6.5 #> [103] affyio_1.80.0 Rcpp_1.1.1 MassSpecWavelet_1.76.0 #> [106] dbplyr_2.5.1 png_0.1-8 XML_3.99-0.20 #> [109] parallel_4.5.2 ggplot2_4.0.2 blob_1.3.0 #> [112] prettyunits_1.2.0 AnnotationFilter_1.34.0 bitops_1.0-9 #> [115] MsFeatures_1.18.0 scales_1.4.0 affy_1.88.0 #> [118] ncdf4_1.24 purrr_1.2.1 crayon_1.5.3 #> [121] rlang_1.1.7 KEGGREST_1.50.0 vsn_3.78.1 ```