---
title: "Predict formula and structure of chromatographic peaks from an XcmsExperiment object Sirius through the RuSirius package."
output:
    BiocStyle::html_document:
        toc_float: true
vignette: >
    %\VignetteIndexEntry{Predict formula and structure of chromatographic peaks from an XcmsExperiment object Sirius through the RuSirius package.}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
    %\VignettePackage{RuSirius}
    %\VignetteDepends{Spectra, MsExperiment, xcms, RSirius, MetaboAnnotation, RuSirius, dplyr}
---

<!--
# Pre-render with (in the vignettes folder)
knitr::knit("Chromatographic_peak_annotation_public_dataset.Rmd.orig", output = "Chromatographic_peak_annotation_public_dataset.Rmd")
-->


``` r
library(Spectra)
library(MsExperiment)
library(xcms)
#> Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (1.1.0)
#> than is installed on your system (1.1.1). This might lead to errors
#> when loading mzR. If you encounter such issues, please send a report,
#> including the output of sessionInfo() to the Bioc support forum at 
#> https://support.bioconductor.org/. For details see also
#> https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.
library(RSirius)
library(MetaboAnnotation)
library(RuSirius)
library(dplyr)
library(MsDataHub)
```

## Introduction

**Note**: this vignette is
[**pre-computed**](https://ropensci.org/blog/2019/12/08/precompute-vignettes/).
See the session info for information on packages used and the date the vignette
was rendered. The vignette requires a running
[Sirius](https://bio.informatik.uni-jena.de/software/sirius/) instance and
a Sirius account for structure database searches. To reproduce this analysis,
you will need to log in to your own Sirius account.

This vignette demonstrates a basic workflow for importing detected
*chromatographic peaks* from an `XcmsExperiment` object into *Sirius*. It then
runs Sirius's main tools: formula identification, structure database search,
compound class prediction, spectral library matching, *de novo* structure
prediction, and finally retrieves the results.

This is a foundational example and does not cover all the possible parameters
for each Sirius tool. For detailed parameter information, consult the `run()`
function documentation. More information can be found in the [Sirius
documentation online](https://v6.docs.sirius-ms.io/).

While this vignette focuses on chromatographic peaks detected with *xcms*, a
similar workflow applies to features (grouped chromatographic peaks). The
vignette for features is pending the availability of public data, but the steps
for data preparation differ only slightly and can be adapted without issue.

**IMPORTANT:** This is a work in progress. Feedback is highly valued, especially
regarding enhancements or additions that could simplify your workflow. Your
input as a user is essential.

## Preprocessing

Here, we apply pre-optimized parameters for processing the example dataset.


``` r
dda_file <- MsDataHub::PestMix1_DDA.mzML()
dda_data <- readMsExperiment(dda_file)
spectra(dda_data) <- setBackend(spectra(dda_data), MsBackendMemory())
dda_data <- filterRt(dda_data, rt = c(230, 610))

dda_data |>
    spectra() |>
    msLevel() |>
    table()
#> 
#>    1    2 
#> 1389 2214

prec_int <- estimatePrecursorIntensity(spectra(dda_data))
cwp <- CentWaveParam(snthresh = 5, noise = 100, ppm = 10,
                     peakwidth = c(3, 30))
dda_data <- findChromPeaks(dda_data, param = cwp, msLevel = 1L)
```

## MS1 and MS2 Extraction

MS2 and MS1 spectra corresponding to the identified chromatographic peaks are
extracted. Each feature imported in Sirius can only have one MS1 spectrum.
However MS2 can also be loaded by itself wiht no MS1 if wanted.


``` r
# Extract MS1 and MS2 spectra linked to chromatographic peaks
ms2 <- chromPeakSpectra(dda_data,
                        expandMz = 0.01, expandRt = 3,
                        msLevel = 2L)
low_int <- function(x, ...) x > max(x, na.rm = TRUE) * 0.05
ms2 <- filterIntensity(ms2, intensity = low_int)
ms2 <- ms2[lengths(ms2) > 1]

ms1 <- chromPeakSpectra(dda_data,
                        expandMz = 0.01, expandRt = 3,
                        msLevel = 1L, method = "closest_rt")

ms1 <- applyProcessing(ms1)
ms2 <- applyProcessing(ms2)
ms1_ms2 <- concatenateSpectra(ms1, ms2)
```

## Open Sirius and project set up

The Sirius application is initialized via the API, requiring only a project ID.
If the project exists, it is opened; otherwise, a new project is created. The
`srs` object acts as the connection to Sirius and holds project details.
Properly shut down the connection with `shutdown(srs)` after completing your
work.

This `srs` variable is needed for any task that necessitate to communicate with
the application. You can learn more about this object class by running `?Sirius`
in the console. Parameter `port` used below allows to configure the *Sirius*
application to use a particular port, but generally the function can be used
without specifying a port.


``` r
# Initialize Sirius connection
srs <- Sirius(projectId = "test_lcmsms", port = 9999)
#> Error in `Sirius()`:
#> ! unused argument (port = 9999)
srs
#> Error:
#> ! object 'srs' not found
```

If the user wants to open and perform computations on a different project they
can use the utility function below:


``` r
srs <- openProject(srs, projectId = "test_lcmsms2", path = getwd())
#> Error in `openProject()`:
#> ! The connection to the Sirius instance is not valid.
srs
#> Error:
#> ! object 'srs' not found
```

You can find all the utility functions of this package by running `?utils` in
the console.

**NOTE** if you have any idea of utility functions that could be implemented do
not hesitate to ask.

## Data import

Preprocessed `xcms` data is imported into Sirius, and a summary `data.frame` is
returned with feature information. This information can also be retrieved using
the utility function `featuresInfo()`.


``` r
## load back our original project
srs <- openProject(srs, "test_lcmsms")
#> Error in `openProject()`:
#> ! The connection to the Sirius instance is not valid.

## Import data into Sirius
srs <- import(sirius = srs,
              spectra = ms1_ms2,
              ms_column_name = "chrom_peak_id",
              deleteExistingFeatures = TRUE)
#> Error:
#> ! object 'srs' not found

## See information about the features
featuresInfo(srs) |> head()
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'srs' not found
```

Notes:

-   It could also be discussed that this `data.frame` could be stored direction
    into the `srs` object
-   When running `import()` i automatically create a mapping data.frame between
    the *xcms* feature ID and the *Sirius* feature ID. It is stored in the `srs`
    object, the `featureMap` slot. This can be used in the future so the user
    never need to interact with the *Sirius* IDs.

Below is an example of how to extract features ID, the utility function
`featuresId()` quickly extract all available ID either `sirius` or `xcms`.


``` r
fts_id <- featuresId(srs, type = "sirius")
#> Error:
#> ! object 'srs' not found
```

## Searchable database

!!! Note: The database code does not work at the moment, please load them through the GUI until it's fixed. !!!

Whether it is for structure prediction or spectral library matching, users can
upload their custom databases into Sirius. In this vignette, we demonstrate how
to test spectral library matching by creating and loading a custom database into
Sirius. This process can also be completed easily via the Sirius graphical user
interface (GUI). If you prefer an interactive approach, you can use the
`openGUI(srs)` command to open the Sirius app and manage your database directly.

In this example, we download the MassBank library from GNPS, which needs to be
loaded into Sirius to generate a `.sirius.db` file. Below we will download in
our current directory but you can precise where you want to save it using
`location =` parameter.


``` r
## Download the MassBank EU library
download.file("https://external.gnps2.org/gnpslibrary/MASSBANKEU.mgf",
              destfile = "MASSBANKEU.mgf")
createDb(srs, databaseId = "massbankeuCustom", files = "MASSBANKEU.mgf")
```

NOTE: THis takes quite a while, will change to a smaller database later. Once
the database is created and loaded, you can verify its successful import by
running the following command:


``` r
listDbs(srs)
```

Find more on how to handle databases in Sirius by typing `?siriusDBs` in the
console.

## Submit job to Sirius - For structure DB search

Annotation and prediction begin after data import. The `run()` function accepts
parameters for each Sirius tool, such as formula identification, structure
database search, and compound class prediction. Parameters can also specify
adducts or custom databases. Detailed documentation for these parameters is
available in the `run()` function's help file.

The `wait` parameter ensures the function waits for job completion before
proceeding. If set to `FALSE`, the job ID is returned, and the user must check
the status using `jobInfo()`.


``` r
## Start computation
job_id <- run(srs,
              fallbackAdducts = c("[M + H]+", "[M + Na]+", "[M + K]+"),
              formulaIdParams = formulaIdParam(numberOfCandidates = 5,
                                               instrument = "QTOF",
                                numberOfCandidatesPerIonization = 2,
                                massAccuracyMS2ppm = 10,
                                filterByIsotopePattern = FALSE,
                                isotopeMs2Settings = c("SCORE"),
                                performDeNovoBelowMz = 600,
                                minPeaksToInjectSpecLibMatch = 3),
              predictParams = predictParam(),
              structureDbSearchParams = structureDbSearchParam(
                  structureSearchDbs = c("BIO")
              ),
              recompute = TRUE,
              wait = TRUE
)
#> Error:
#> ! object 'srs' not found

srs
#> Error:
#> ! object 'srs' not found
```


``` r
## Get more info for the job
jobInfo(srs, job_id) |> cat()
#> Error:
#> ! object 'job_id' not found
```

## Retrieve Results

To obtain a summary of all results, including the top formulas, structures, and
compound class predictions, use the following code. This summary table provides
a quick overview to evaluate whether the results align with expectations.
However, we recommend not relying on this table as-is for detailed analysis.
Instead, use the functions described later in this vignette to explore the
results in greater depth.

An important aspect of the summary table is the confidence-related columns,
which provide insight into the reliability of the predictions.


``` r
summarytb <- summary(srs, result.type = "structure")
#> Error:
#> ! object 'srs' not found
head(summarytb)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summarytb' not found
```

## Formula identification results:

For detailed results, the results() function can be used with the `result.type`
parameter set to `"formulaId"`, `"structureDb"`, `"compoundClass"`, or
`"deNovo"`. Note that all results are linked to a predicted formula.

The parameters `topFormula` and `topStructure` allow users to specify how many
formulas or structures should be included in the output. The results can be
returned either as a list or a data.frame, based on the return.type parameter.

Note: Suggestions for renaming the `results()` function or feedback on this
implementation are welcome. We aim to adapt based on user needs.


``` r
results(srs,
       return.type = "data.frame",
       result.type = "formulaId",
       topFormula = 5)
#> Error:
#> ! object 'srs' not found
```

## Structure DBs search results

The following example shows the top two structure annotations for the top five
formulas of each feature. This can provide an insightful view into the
structural predictions.


``` r
finalstructredb <- results(srs,
                           return.type = "data.frame",
                           result.type = "structureDb",
                           topFormula = 5,
                           topStructure = 2)
#> Error:
#> ! object 'srs' not found

head(finalstructredb)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'finalstructredb' not found
```

For a more visual exploration of the results, you can open the Sirius GUI with
the commands below:


``` r
openGUI(srs)
closeGUI(srs)
```

## Compound class prediction results

To retrieve compound class predictions, use the following code. Below is an
example showing all compound annotations with confidence scores above 50% for
the top two formulas of each feature.


``` r
finalcomp <- results(srs,
                     return.type = "data.frame",
                     result.type = "compoundClass",
                     topFormula = 2)
#> Error:
#> ! object 'srs' not found
head(finalcomp)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'finalcomp' not found
```

## Spectral library matching results

The following code gives you a summary of the best matches:


``` r
summaryspectra <- summary(srs, result.type = "spectralDbMatch")
#> Error:
#> ! object 'srs' not found
head(summaryspectra)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summaryspectra' not found
```

For detailed results, use the following code:


``` r
full_spectral <- results(srs,
                         return.type = "data.frame",
                         result.type = "spectralDbMatch",
                         topSpectralMatches = 2)
#> Error:
#> ! object 'srs' not found
head(full_spectral)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'full_spectral' not found
```

## Fragmentation tree results

Below we show how to get the fragmentation tree for the top2 formula of some
feautres. This is quite inefficient at the moment so limit it to a little number
of feature. I will improve it.


``` r
resulttree <- results(srs,
                      features = featuresId(srs)[1:5],
                      return.type = "list",
                      result.type = "fragTree",
                      topFormula = 4,
                     )
#> Error:
#> ! object 'srs' not found

head(resulttree)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'resulttree' not found
```

## Submit job to Sirius - For De Novo structure annotation.

*De novo* structure annotation is computationally intensive and recommended only
for specific features.


``` r
# Determine features that do not have/have poor structure prediction
fts_denovo <-summarytb$alignedFeatureId[which(
    summarytb$confidenceApproxMatch < 0.3 |
        summarytb$confidenceApproxMatch %in% c("NA", "-Infinity"))]
#> Error:
#> ! object 'summarytb' not found
```


``` r
# Compute with zodiac and denovo
job_id <- run(srs,
    msNovelistParams = deNovoStructureParam(numberOfCandidateToPredict = 5),
    alignedFeaturesIds = fts_denovo,
    recompute = FALSE,
    wait = TRUE
)
#> Error:
#> ! object 'fts_denovo' not found

## Get info for the job
jobInfo(srs, job_id) |> cat()
#> Error:
#> ! object 'job_id' not found
```

## Retrieve results


``` r
summraryDeNovo <- summary(srs, result.type = "deNovo")
#> Error:
#> ! object 'srs' not found
head(summraryDeNovo)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summraryDeNovo' not found
```

Below is the full results.


``` r
full_de_novo <- results(srs,
                        return.type = "data.frame",
                        result.type = "deNovo",
                        topFormula = 5)
#> Error:
#> ! object 'srs' not found

head(full_de_novo)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'full_de_novo' not found
```

## Setting a Known Molecular Formula

When you already know (or suspect) the molecular formula and adduct for a
feature, you can restrict Sirius formula identification to that candidate. This
is useful for targeted analyses where you want to compute the fragmentation tree
for a known compound rather than performing an open search.

Use the `candidateFormulas` parameter of `formulaIdParam()` together with
`enforceAdducts` to lock in both the formula and adduct:


``` r
# Select a feature with MS2 data (e.g. Carbaryl, [M+H]+ at m/z ~220.10)
fi <- featuresInfo(srs)
#> Error:
#> ! object 'srs' not found
target_ft <- fi[fi[, "hasMsMs"] == TRUE, "alignedFeatureId"][[1]]
#> Error:
#> ! object 'fi' not found

# Run formula identification restricted to a single known formula
job_id <- run(srs,
              alignedFeaturesIds = target_ft,
              enforceAdducts = "[M+H]+",
              formulaIdParams = formulaIdParam(
                  candidateFormulas = "C12H13NO3"
              ),
              recompute = TRUE,
              wait = TRUE)
#> Error in `formulaIdParam()`:
#> ! unused argument (candidateFormulas = "C12H13NO3")
```

You can also supply multiple candidate formulas if you want to compare a small
set of possibilities:


``` r
job_id <- run(srs,
              alignedFeaturesIds = target_ft,
              enforceAdducts = "[M+H]+",
              formulaIdParams = formulaIdParam(
                  candidateFormulas = c("C12H13NO3", "C10H12N2O")
              ),
              recompute = TRUE,
              wait = TRUE)
#> Error in `formulaIdParam()`:
#> ! unused argument (candidateFormulas = c("C12H13NO3", "C10H12N2O"))
```

After the job completes, retrieve the formula results and fragmentation tree as
usual:


``` r
# Formula results - should only contain the specified candidate(s)
results(srs,
        features = target_ft,
        result.type = "formulaId",
        return.type = "data.frame")
#> Error:
#> ! object 'target_ft' not found

# Fragmentation tree for the known formula
results(srs,
        features = target_ft,
        result.type = "fragTree",
        topFormula = 1,
        return.type = "list")
#> Error:
#> ! object 'target_ft' not found
```

## CleanUp


``` r
# Close the Sirius session
shutdown(srs)
#> Warning in value[[3L]](cond): Could not retrieve open projects: object 'srs' not found
#> Warning in doTryCatch(return(expr), name, parentenv, handler): restarting interrupted
#> promise evaluation
```

# Session information

The R code was run on:


``` r
date()
#> [1] "Mon Mar 23 11:26:36 2026"
```

Information on the R session:


``` r
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Rome
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] MsDataHub_1.10.0        dplyr_1.2.0             RuSirius_0.2.0         
#>  [4] jsonlite_2.0.0          MetaboAnnotation_1.14.0 RSirius_6.3.3          
#>  [7] xcms_4.8.0              MsExperiment_1.12.0     ProtGenerics_1.42.0    
#> [10] Spectra_1.20.1          BiocParallel_1.44.0     S4Vectors_0.48.0       
#> [13] BiocGenerics_0.56.0     generics_0.1.4         
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          MultiAssayExperiment_1.36.1 magrittr_2.0.4             
#>   [4] farver_2.1.2                MALDIquant_1.22.3           fs_1.6.6                   
#>   [7] vctrs_0.7.1                 memoise_2.0.1               RCurl_1.98-1.17            
#>  [10] base64enc_0.1-6             htmltools_0.5.9             S4Arrays_1.10.1            
#>  [13] BiocBaseUtils_1.12.0        progress_1.2.3              curl_7.0.0                 
#>  [16] AnnotationHub_4.0.0         SparseArray_1.10.8          mzID_1.48.0                
#>  [19] htmlwidgets_1.6.4           plyr_1.8.9                  httr2_1.2.2                
#>  [22] impute_1.84.0               cachem_1.1.0                igraph_2.2.1               
#>  [25] lifecycle_1.0.5             iterators_1.0.14            pkgconfig_2.0.3            
#>  [28] Matrix_1.7-4                R6_2.6.1                    fastmap_1.2.0              
#>  [31] MatrixGenerics_1.22.0       clue_0.3-66                 digest_0.6.39              
#>  [34] pcaMethods_2.2.0            rsvg_2.7.0                  AnnotationDbi_1.72.0       
#>  [37] ExperimentHub_3.0.0         GenomicRanges_1.62.1        RSQLite_2.4.5              
#>  [40] filelock_1.0.3              httr_1.4.7                  abind_1.4-8                
#>  [43] compiler_4.5.2              withr_3.0.2                 bit64_4.6.0-1              
#>  [46] doParallel_1.0.17           S7_0.2.1                    DBI_1.2.3                  
#>  [49] MASS_7.3-65                 ChemmineR_3.62.0            rappdirs_0.3.4             
#>  [52] DelayedArray_0.36.0         rjson_0.2.23                mzR_2.44.0                 
#>  [55] tools_4.5.2                 PSMatch_1.14.0              otel_0.2.0                 
#>  [58] CompoundDb_1.14.2           glue_1.8.0                  QFeatures_1.20.0           
#>  [61] grid_4.5.2                  cluster_2.1.8.1             reshape2_1.4.5             
#>  [64] snow_0.4-4                  gtable_0.3.6                preprocessCore_1.72.0      
#>  [67] tidyr_1.3.2                 data.table_1.18.2.1         hms_1.1.4                  
#>  [70] MetaboCoreUtils_1.19.2      xml2_1.5.2                  XVector_0.50.0             
#>  [73] BiocVersion_3.22.0          foreach_1.5.2               pillar_1.11.1              
#>  [76] stringr_1.6.0               limma_3.66.0                BiocFileCache_3.0.0        
#>  [79] lattice_0.22-7              bit_4.6.0                   tidyselect_1.2.1           
#>  [82] Biostrings_2.78.0           knitr_1.51                  gridExtra_2.3              
#>  [85] IRanges_2.44.0              Seqinfo_1.0.0               SummarizedExperiment_1.40.0
#>  [88] xfun_0.56                   Biobase_2.70.0              statmod_1.5.1              
#>  [91] MSnbase_2.36.0              matrixStats_1.5.0           DT_0.34.0                  
#>  [94] stringi_1.8.7               yaml_2.3.12                 lazyeval_0.2.2             
#>  [97] evaluate_1.0.5              codetools_0.2-20            MsCoreUtils_1.22.1         
#> [100] tibble_3.3.1                BiocManager_1.30.27         cli_3.6.5                  
#> [103] affyio_1.80.0               Rcpp_1.1.1                  MassSpecWavelet_1.76.0     
#> [106] dbplyr_2.5.1                png_0.1-8                   XML_3.99-0.20              
#> [109] parallel_4.5.2              ggplot2_4.0.2               blob_1.3.0                 
#> [112] prettyunits_1.2.0           AnnotationFilter_1.34.0     bitops_1.0-9               
#> [115] MsFeatures_1.18.0           scales_1.4.0                affy_1.88.0                
#> [118] ncdf4_1.24                  purrr_1.2.1                 crayon_1.5.3               
#> [121] rlang_1.1.7                 KEGGREST_1.50.0             vsn_3.78.1
```