--- title: "Using Custom Databases in Sirius" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Using Custom Databases in Sirius} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{RuSirius} %\VignetteDepends{RSirius, RuSirius, MsDataHub, Spectra} --- ``` r library(RSirius) library(RuSirius) library(MsDataHub) library(Spectra) ``` ## Introduction **Note**: this vignette is [**pre-computed**](https://ropensci.org/blog/2019/12/08/precompute-vignettes/). See the session info for information on packages used and the date the vignette was rendered. The vignette requires a running [Sirius](https://bio.informatik.uni-jena.de/software/sirius/) instance. To reproduce this analysis, you will need Sirius 6.3 installed and running. Sirius can search against custom databases in addition to the built-in databases (BIO, PubChem, etc.). This is useful when you have: - A list of suspect compounds specific to your study - A custom spectral library (e.g., from MassBank) - Target compounds you want to prioritize in the search This vignette demonstrates how to create and use custom databases, and shows the impact on structure identification results. ## Managing Databases ### Listing Available Databases ``` r srs <- Sirius(port = 9999) #> Error in `Sirius()`: #> ! unused argument (port = 9999) # List all searchable databases dbs <- listDbs(srs) #> Error in `listDbs()`: #> ! The connection to the Sirius instance is not valid. dbs[, c("databaseId", "displayName")] #> Error: #> ! object 'dbs' not found ``` ### Database Information ``` r # Get details about a specific database infoDb(srs, databaseId = "BIO") #> Error in `infoDb()`: #> ! The connection to the Sirius instance is not valid. ``` ## Creating a Custom Database Custom databases can be created from files containing compound information. Supported formats include `.tsv`, `.csv`, or `.mgf` files with structure information. ### From a Compound List (TSV/CSV) The file should contain columns for compound name, SMILES (or InChI), and optionally the molecular formula. ``` r # Create database from a TSV file createDb(srs, databaseId = "my_suspects", files = "path/to/suspects.tsv", location = getwd()) # Verify it was created listDbs(srs) ``` ### From a Spectral Library (MGF) Spectral libraries in MGF format can also be imported. An example MGF file is included in the package: ``` r # Path to example MassBank MGF file mgf_file <- system.file("vignettes", "MASSBANKEU.mgf", package = "RuSirius") createDb(srs, databaseId = "massbank_custom", files = mgf_file, location = getwd()) #> Error in `createDb()`: #> ! The connection to the Sirius instance is not valid. ``` ## Comparing Results: Default vs Custom Database Let's demonstrate how using a custom database affects structure identification. ### Setup: Import Sample Data ``` r # Load example data dda_file <- MsDataHub::PestMix1_DDA.mzML() sp <- Spectra(dda_file) sp <- setBackend(sp, MsBackendMemory()) sp <- filterEmptySpectra(sp) # Group spectra idxs <- fragmentGroupIndex(sp) sp$Msn_idx <- idxs # Create project and import srs <- Sirius(projectId = "db_comparison", path = getwd(), port = 9999) #> Error in `Sirius()`: #> ! unused argument (port = 9999) sp_subset <- sp[sp$Msn_idx %in% c(421, 707)] srs <- import(srs, spectra = sp_subset, ms_column_name = "Msn_idx") #> Error: #> ! object 'srs' not found ``` ### Run with Default Database (BIO) ``` r # Run structure search with BIO database only run(srs, formulaIdParams = formulaIdParam(numberOfCandidates = 5), predictParams = predictParam(), structureDbSearchParams = structureDbSearchParam( structureSearchDbs = c("BIO") ), recompute = TRUE, wait = TRUE) #> Error: #> ! object 'srs' not found # Get results results_bio <- summary(srs, result.type = "structure") #> Error: #> ! object 'srs' not found results_bio[, c("alignedFeatureId", "molecularFormula", "structureName", "confidenceExactMatch")] #> Error: #> ! object 'results_bio' not found ``` ### Run with Custom Database Added ``` r # Now include custom database in search run(srs, formulaIdParams = formulaIdParam(numberOfCandidates = 5), predictParams = predictParam(), structureDbSearchParams = structureDbSearchParam( structureSearchDbs = c("BIO", "massbank_custom") ), recompute = TRUE, wait = TRUE) #> Error: #> ! object 'srs' not found # Get results with custom DB results_custom <- summary(srs, result.type = "structure") #> Error: #> ! object 'srs' not found results_custom[, c("alignedFeatureId", "molecularFormula", "structureName", "confidenceExactMatch")] #> Error: #> ! object 'results_custom' not found ``` ### Compare Results ``` r # Compare confidence scores comparison <- merge( results_bio[, c("alignedFeatureId", "confidenceExactMatch")], results_custom[, c("alignedFeatureId", "confidenceExactMatch")], by = "alignedFeatureId", suffixes = c("_bio", "_custom") ) #> Error in `h()`: #> ! error in evaluating the argument 'x' in selecting a method for function 'merge': object 'results_bio' not found comparison #> Error: #> ! object 'comparison' not found ``` Including relevant custom databases can improve identification confidence when your compounds are well-represented in the custom database. ## Removing a Database ``` r # Remove a custom database when no longer needed removeDb(srs, databaseId = "massbank_custom") #> Error in `removeDb()`: #> ! The connection to the Sirius instance is not valid. # Verify removal listDbs(srs) #> Error in `listDbs()`: #> ! The connection to the Sirius instance is not valid. ``` ## Best Practices 1. **Targeted databases**: Create focused databases with compounds relevant to your study rather than very large generic databases. 2. **Quality over quantity**: Ensure your custom database has accurate structure information (SMILES/InChI). 3. **Combine strategically**: Use custom databases alongside BIO for best coverage - BIO for general metabolites, custom for your specific targets. 4. **Spectral libraries**: When available, spectral libraries (MGF) provide additional matching power through spectral similarity. ## Clean Up ``` r shutdown(srs) #> Warning in value[[3L]](cond): Could not retrieve open projects: object 'srs' not found #> Warning in doTryCatch(return(expr), name, parentenv, handler): restarting interrupted #> promise evaluation ``` # Session information The R code was run on: ``` r date() #> [1] "Mon Mar 23 11:26:54 2026" ``` Information on the R session: ``` r sessionInfo() #> R version 4.5.2 (2025-10-31 ucrt) #> Platform: x86_64-w64-mingw32/x64 #> Running under: Windows 11 x64 (build 26100) #> #> Matrix products: default #> LAPACK version 3.12.1 #> #> locale: #> [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 #> [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C #> [5] LC_TIME=English_United States.utf8 #> #> time zone: Europe/Rome #> tzcode source: internal #> #> attached base packages: #> [1] stats4 stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] MsDataHub_1.10.0 dplyr_1.2.0 RuSirius_0.2.0 #> [4] jsonlite_2.0.0 MetaboAnnotation_1.14.0 RSirius_6.3.3 #> [7] xcms_4.8.0 MsExperiment_1.12.0 ProtGenerics_1.42.0 #> [10] Spectra_1.20.1 BiocParallel_1.44.0 S4Vectors_0.48.0 #> [13] BiocGenerics_0.56.0 generics_0.1.4 #> #> loaded via a namespace (and not attached): #> [1] RColorBrewer_1.1-3 MultiAssayExperiment_1.36.1 magrittr_2.0.4 #> [4] farver_2.1.2 MALDIquant_1.22.3 fs_1.6.6 #> [7] vctrs_0.7.1 memoise_2.0.1 RCurl_1.98-1.17 #> [10] base64enc_0.1-6 htmltools_0.5.9 S4Arrays_1.10.1 #> [13] BiocBaseUtils_1.12.0 progress_1.2.3 curl_7.0.0 #> [16] AnnotationHub_4.0.0 SparseArray_1.10.8 mzID_1.48.0 #> [19] htmlwidgets_1.6.4 plyr_1.8.9 httr2_1.2.2 #> [22] impute_1.84.0 cachem_1.1.0 igraph_2.2.1 #> [25] lifecycle_1.0.5 iterators_1.0.14 pkgconfig_2.0.3 #> [28] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0 #> [31] MatrixGenerics_1.22.0 clue_0.3-66 digest_0.6.39 #> [34] pcaMethods_2.2.0 rsvg_2.7.0 AnnotationDbi_1.72.0 #> [37] ExperimentHub_3.0.0 GenomicRanges_1.62.1 RSQLite_2.4.5 #> [40] filelock_1.0.3 httr_1.4.7 abind_1.4-8 #> [43] compiler_4.5.2 withr_3.0.2 bit64_4.6.0-1 #> [46] doParallel_1.0.17 S7_0.2.1 DBI_1.2.3 #> [49] MASS_7.3-65 ChemmineR_3.62.0 rappdirs_0.3.4 #> [52] DelayedArray_0.36.0 rjson_0.2.23 mzR_2.44.0 #> [55] tools_4.5.2 PSMatch_1.14.0 otel_0.2.0 #> [58] CompoundDb_1.14.2 glue_1.8.0 QFeatures_1.20.0 #> [61] grid_4.5.2 cluster_2.1.8.1 reshape2_1.4.5 #> [64] snow_0.4-4 gtable_0.3.6 preprocessCore_1.72.0 #> [67] tidyr_1.3.2 data.table_1.18.2.1 hms_1.1.4 #> [70] MetaboCoreUtils_1.19.2 xml2_1.5.2 XVector_0.50.0 #> [73] BiocVersion_3.22.0 foreach_1.5.2 pillar_1.11.1 #> [76] stringr_1.6.0 limma_3.66.0 BiocFileCache_3.0.0 #> [79] lattice_0.22-7 bit_4.6.0 tidyselect_1.2.1 #> [82] Biostrings_2.78.0 knitr_1.51 gridExtra_2.3 #> [85] IRanges_2.44.0 Seqinfo_1.0.0 SummarizedExperiment_1.40.0 #> [88] xfun_0.56 Biobase_2.70.0 statmod_1.5.1 #> [91] MSnbase_2.36.0 matrixStats_1.5.0 DT_0.34.0 #> [94] stringi_1.8.7 yaml_2.3.12 lazyeval_0.2.2 #> [97] evaluate_1.0.5 codetools_0.2-20 MsCoreUtils_1.22.1 #> [100] tibble_3.3.1 BiocManager_1.30.27 cli_3.6.5 #> [103] affyio_1.80.0 Rcpp_1.1.1 MassSpecWavelet_1.76.0 #> [106] dbplyr_2.5.1 png_0.1-8 XML_3.99-0.20 #> [109] parallel_4.5.2 ggplot2_4.0.2 blob_1.3.0 #> [112] prettyunits_1.2.0 AnnotationFilter_1.34.0 bitops_1.0-9 #> [115] MsFeatures_1.18.0 scales_1.4.0 affy_1.88.0 #> [118] ncdf4_1.24 purrr_1.2.1 crayon_1.5.3 #> [121] rlang_1.1.7 KEGGREST_1.50.0 vsn_3.78.1 ```