Skip to contents

Introduction

Note: this vignette is pre-computed. See the session info for information on packages used and the date the vignette was rendered. The vignette requires a running Sirius instance. To reproduce this analysis, you will need Sirius 6.3 installed and running.

Sirius can search against custom databases in addition to the built-in databases (BIO, PubChem, etc.). This is useful when you have:

  • A list of suspect compounds specific to your study
  • A custom spectral library (e.g., from MassBank)
  • Target compounds you want to prioritize in the search

This vignette demonstrates how to create and use custom databases, and shows the impact on structure identification results.

Managing Databases

Listing Available Databases

srs <- Sirius(port = 9999)
#> Error in `Sirius()`:
#> ! unused argument (port = 9999)

# List all searchable databases
dbs <- listDbs(srs)
#> Error in `listDbs()`:
#> ! The connection to the Sirius instance is not valid.
dbs[, c("databaseId", "displayName")]
#> Error:
#> ! object 'dbs' not found

Database Information

# Get details about a specific database
infoDb(srs, databaseId = "BIO")
#> Error in `infoDb()`:
#> ! The connection to the Sirius instance is not valid.

Creating a Custom Database

Custom databases can be created from files containing compound information. Supported formats include .tsv, .csv, or .mgf files with structure information.

From a Compound List (TSV/CSV)

The file should contain columns for compound name, SMILES (or InChI), and optionally the molecular formula.

# Create database from a TSV file
createDb(srs,
         databaseId = "my_suspects",
         files = "path/to/suspects.tsv",
         location = getwd())

# Verify it was created
listDbs(srs)

From a Spectral Library (MGF)

Spectral libraries in MGF format can also be imported. An example MGF file is included in the package:

# Path to example MassBank MGF file
mgf_file <- system.file("vignettes", "MASSBANKEU.mgf", package = "RuSirius")

createDb(srs,
         databaseId = "massbank_custom",
         files = mgf_file,
         location = getwd())
#> Error in `createDb()`:
#> ! The connection to the Sirius instance is not valid.

Comparing Results: Default vs Custom Database

Let’s demonstrate how using a custom database affects structure identification.

Setup: Import Sample Data

# Load example data
dda_file <- MsDataHub::PestMix1_DDA.mzML()
sp <- Spectra(dda_file)
sp <- setBackend(sp, MsBackendMemory())
sp <- filterEmptySpectra(sp)

# Group spectra
idxs <- fragmentGroupIndex(sp)
sp$Msn_idx <- idxs

# Create project and import
srs <- Sirius(projectId = "db_comparison", path = getwd(), port = 9999)
#> Error in `Sirius()`:
#> ! unused argument (port = 9999)
sp_subset <- sp[sp$Msn_idx %in% c(421, 707)]
srs <- import(srs, spectra = sp_subset, ms_column_name = "Msn_idx")
#> Error:
#> ! object 'srs' not found

Run with Default Database (BIO)

# Run structure search with BIO database only
run(srs,
    formulaIdParams = formulaIdParam(numberOfCandidates = 5),
    predictParams = predictParam(),
    structureDbSearchParams = structureDbSearchParam(
        structureSearchDbs = c("BIO")
    ),
    recompute = TRUE,
    wait = TRUE)
#> Error:
#> ! object 'srs' not found

# Get results
results_bio <- summary(srs, result.type = "structure")
#> Error:
#> ! object 'srs' not found
results_bio[, c("alignedFeatureId", "molecularFormula",
                "structureName", "confidenceExactMatch")]
#> Error:
#> ! object 'results_bio' not found

Run with Custom Database Added

# Now include custom database in search
run(srs,
    formulaIdParams = formulaIdParam(numberOfCandidates = 5),
    predictParams = predictParam(),
    structureDbSearchParams = structureDbSearchParam(
        structureSearchDbs = c("BIO", "massbank_custom")
    ),
    recompute = TRUE,
    wait = TRUE)
#> Error:
#> ! object 'srs' not found

# Get results with custom DB
results_custom <- summary(srs, result.type = "structure")
#> Error:
#> ! object 'srs' not found
results_custom[, c("alignedFeatureId", "molecularFormula",
                   "structureName", "confidenceExactMatch")]
#> Error:
#> ! object 'results_custom' not found

Compare Results

# Compare confidence scores
comparison <- merge(
    results_bio[, c("alignedFeatureId", "confidenceExactMatch")],
    results_custom[, c("alignedFeatureId", "confidenceExactMatch")],
    by = "alignedFeatureId",
    suffixes = c("_bio", "_custom")
)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'merge': object 'results_bio' not found
comparison
#> Error:
#> ! object 'comparison' not found

Including relevant custom databases can improve identification confidence when your compounds are well-represented in the custom database.

Removing a Database

# Remove a custom database when no longer needed
removeDb(srs, databaseId = "massbank_custom")
#> Error in `removeDb()`:
#> ! The connection to the Sirius instance is not valid.

# Verify removal
listDbs(srs)
#> Error in `listDbs()`:
#> ! The connection to the Sirius instance is not valid.

Best Practices

  1. Targeted databases: Create focused databases with compounds relevant to your study rather than very large generic databases.

  2. Quality over quantity: Ensure your custom database has accurate structure information (SMILES/InChI).

  3. Combine strategically: Use custom databases alongside BIO for best coverage - BIO for general metabolites, custom for your specific targets.

  4. Spectral libraries: When available, spectral libraries (MGF) provide additional matching power through spectral similarity.

Clean Up

shutdown(srs)
#> Warning in value[[3L]](cond): Could not retrieve open projects: object 'srs' not found
#> Warning in doTryCatch(return(expr), name, parentenv, handler): restarting interrupted
#> promise evaluation

Session information

The R code was run on:

date()
#> [1] "Mon Mar 23 11:26:54 2026"

Information on the R session:

sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Rome
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] MsDataHub_1.10.0        dplyr_1.2.0             RuSirius_0.2.0         
#>  [4] jsonlite_2.0.0          MetaboAnnotation_1.14.0 RSirius_6.3.3          
#>  [7] xcms_4.8.0              MsExperiment_1.12.0     ProtGenerics_1.42.0    
#> [10] Spectra_1.20.1          BiocParallel_1.44.0     S4Vectors_0.48.0       
#> [13] BiocGenerics_0.56.0     generics_0.1.4         
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          MultiAssayExperiment_1.36.1 magrittr_2.0.4             
#>   [4] farver_2.1.2                MALDIquant_1.22.3           fs_1.6.6                   
#>   [7] vctrs_0.7.1                 memoise_2.0.1               RCurl_1.98-1.17            
#>  [10] base64enc_0.1-6             htmltools_0.5.9             S4Arrays_1.10.1            
#>  [13] BiocBaseUtils_1.12.0        progress_1.2.3              curl_7.0.0                 
#>  [16] AnnotationHub_4.0.0         SparseArray_1.10.8          mzID_1.48.0                
#>  [19] htmlwidgets_1.6.4           plyr_1.8.9                  httr2_1.2.2                
#>  [22] impute_1.84.0               cachem_1.1.0                igraph_2.2.1               
#>  [25] lifecycle_1.0.5             iterators_1.0.14            pkgconfig_2.0.3            
#>  [28] Matrix_1.7-4                R6_2.6.1                    fastmap_1.2.0              
#>  [31] MatrixGenerics_1.22.0       clue_0.3-66                 digest_0.6.39              
#>  [34] pcaMethods_2.2.0            rsvg_2.7.0                  AnnotationDbi_1.72.0       
#>  [37] ExperimentHub_3.0.0         GenomicRanges_1.62.1        RSQLite_2.4.5              
#>  [40] filelock_1.0.3              httr_1.4.7                  abind_1.4-8                
#>  [43] compiler_4.5.2              withr_3.0.2                 bit64_4.6.0-1              
#>  [46] doParallel_1.0.17           S7_0.2.1                    DBI_1.2.3                  
#>  [49] MASS_7.3-65                 ChemmineR_3.62.0            rappdirs_0.3.4             
#>  [52] DelayedArray_0.36.0         rjson_0.2.23                mzR_2.44.0                 
#>  [55] tools_4.5.2                 PSMatch_1.14.0              otel_0.2.0                 
#>  [58] CompoundDb_1.14.2           glue_1.8.0                  QFeatures_1.20.0           
#>  [61] grid_4.5.2                  cluster_2.1.8.1             reshape2_1.4.5             
#>  [64] snow_0.4-4                  gtable_0.3.6                preprocessCore_1.72.0      
#>  [67] tidyr_1.3.2                 data.table_1.18.2.1         hms_1.1.4                  
#>  [70] MetaboCoreUtils_1.19.2      xml2_1.5.2                  XVector_0.50.0             
#>  [73] BiocVersion_3.22.0          foreach_1.5.2               pillar_1.11.1              
#>  [76] stringr_1.6.0               limma_3.66.0                BiocFileCache_3.0.0        
#>  [79] lattice_0.22-7              bit_4.6.0                   tidyselect_1.2.1           
#>  [82] Biostrings_2.78.0           knitr_1.51                  gridExtra_2.3              
#>  [85] IRanges_2.44.0              Seqinfo_1.0.0               SummarizedExperiment_1.40.0
#>  [88] xfun_0.56                   Biobase_2.70.0              statmod_1.5.1              
#>  [91] MSnbase_2.36.0              matrixStats_1.5.0           DT_0.34.0                  
#>  [94] stringi_1.8.7               yaml_2.3.12                 lazyeval_0.2.2             
#>  [97] evaluate_1.0.5              codetools_0.2-20            MsCoreUtils_1.22.1         
#> [100] tibble_3.3.1                BiocManager_1.30.27         cli_3.6.5                  
#> [103] affyio_1.80.0               Rcpp_1.1.1                  MassSpecWavelet_1.76.0     
#> [106] dbplyr_2.5.1                png_0.1-8                   XML_3.99-0.20              
#> [109] parallel_4.5.2              ggplot2_4.0.2               blob_1.3.0                 
#> [112] prettyunits_1.2.0           AnnotationFilter_1.34.0     bitops_1.0-9               
#> [115] MsFeatures_1.18.0           scales_1.4.0                affy_1.88.0                
#> [118] ncdf4_1.24                  purrr_1.2.1                 crayon_1.5.3               
#> [121] rlang_1.1.7                 KEGGREST_1.50.0             vsn_3.78.1