Predict formula and structure of chromatographic peaks from an XcmsExperiment object Sirius through the RuSirius package. • RuSirius

library(Spectra)
library(MsExperiment)
library(xcms)
#> Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (1.1.0)
#> than is installed on your system (1.1.1). This might lead to errors
#> when loading mzR. If you encounter such issues, please send a report,
#> including the output of sessionInfo() to the Bioc support forum at 
#> https://support.bioconductor.org/. For details see also
#> https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.
library(RSirius)
library(MetaboAnnotation)
library(RuSirius)
library(dplyr)
library(MsDataHub)

Introduction

Note: this vignette is pre-computed. See the session info for information on packages used and the date the vignette was rendered. The vignette requires a running Sirius instance and a Sirius account for structure database searches. To reproduce this analysis, you will need to log in to your own Sirius account.

This vignette demonstrates a basic workflow for importing detected chromatographic peaks from an XcmsExperiment object into Sirius. It then runs Sirius’s main tools: formula identification, structure database search, compound class prediction, spectral library matching, de novo structure prediction, and finally retrieves the results.

This is a foundational example and does not cover all the possible parameters for each Sirius tool. For detailed parameter information, consult the run() function documentation. More information can be found in the Sirius documentation online.

While this vignette focuses on chromatographic peaks detected with xcms, a similar workflow applies to features (grouped chromatographic peaks). The vignette for features is pending the availability of public data, but the steps for data preparation differ only slightly and can be adapted without issue.

IMPORTANT: This is a work in progress. Feedback is highly valued, especially regarding enhancements or additions that could simplify your workflow. Your input as a user is essential.

Preprocessing

Here, we apply pre-optimized parameters for processing the example dataset.

dda_file <- MsDataHub::PestMix1_DDA.mzML()
dda_data <- readMsExperiment(dda_file)
spectra(dda_data) <- setBackend(spectra(dda_data), MsBackendMemory())
dda_data <- filterRt(dda_data, rt = c(230, 610))

dda_data |>
    spectra() |>
    msLevel() |>
    table()
#> 
#>    1    2 
#> 1389 2214

prec_int <- estimatePrecursorIntensity(spectra(dda_data))
cwp <- CentWaveParam(snthresh = 5, noise = 100, ppm = 10,
                     peakwidth = c(3, 30))
dda_data <- findChromPeaks(dda_data, param = cwp, msLevel = 1L)

MS1 and MS2 Extraction

MS2 and MS1 spectra corresponding to the identified chromatographic peaks are extracted. Each feature imported in Sirius can only have one MS1 spectrum. However MS2 can also be loaded by itself wiht no MS1 if wanted.

# Extract MS1 and MS2 spectra linked to chromatographic peaks
ms2 <- chromPeakSpectra(dda_data,
                        expandMz = 0.01, expandRt = 3,
                        msLevel = 2L)
low_int <- function(x, ...) x > max(x, na.rm = TRUE) * 0.05
ms2 <- filterIntensity(ms2, intensity = low_int)
ms2 <- ms2[lengths(ms2) > 1]

ms1 <- chromPeakSpectra(dda_data,
                        expandMz = 0.01, expandRt = 3,
                        msLevel = 1L, method = "closest_rt")

ms1 <- applyProcessing(ms1)
ms2 <- applyProcessing(ms2)
ms1_ms2 <- concatenateSpectra(ms1, ms2)

Open Sirius and project set up

The Sirius application is initialized via the API, requiring only a project ID. If the project exists, it is opened; otherwise, a new project is created. The srs object acts as the connection to Sirius and holds project details. Properly shut down the connection with shutdown(srs) after completing your work.

This srs variable is needed for any task that necessitate to communicate with the application. You can learn more about this object class by running ?Sirius in the console. Parameter port used below allows to configure the Sirius application to use a particular port, but generally the function can be used without specifying a port.

# Initialize Sirius connection
srs <- Sirius(projectId = "test_lcmsms", port = 9999)
#> Error in `Sirius()`:
#> ! unused argument (port = 9999)
srs
#> Error:
#> ! object 'srs' not found

If the user wants to open and perform computations on a different project they can use the utility function below:

srs <- openProject(srs, projectId = "test_lcmsms2", path = getwd())
#> Error in `openProject()`:
#> ! The connection to the Sirius instance is not valid.
srs
#> Error:
#> ! object 'srs' not found

You can find all the utility functions of this package by running ?utils in the console.

NOTE if you have any idea of utility functions that could be implemented do not hesitate to ask.

Data import

Preprocessed xcms data is imported into Sirius, and a summary data.frame is returned with feature information. This information can also be retrieved using the utility function featuresInfo().

## load back our original project
srs <- openProject(srs, "test_lcmsms")
#> Error in `openProject()`:
#> ! The connection to the Sirius instance is not valid.

## Import data into Sirius
srs <- import(sirius = srs,
              spectra = ms1_ms2,
              ms_column_name = "chrom_peak_id",
              deleteExistingFeatures = TRUE)
#> Error:
#> ! object 'srs' not found

## See information about the features
featuresInfo(srs) |> head()
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'srs' not found

Notes:

It could also be discussed that this data.frame could be stored direction into the srs object
When running import() i automatically create a mapping data.frame between the xcms feature ID and the Sirius feature ID. It is stored in the srs object, the featureMap slot. This can be used in the future so the user never need to interact with the Sirius IDs.

Below is an example of how to extract features ID, the utility function featuresId() quickly extract all available ID either sirius or xcms.

fts_id <- featuresId(srs, type = "sirius")
#> Error:
#> ! object 'srs' not found

Searchable database

!!! Note: The database code does not work at the moment, please load them through the GUI until it’s fixed. !!!

Whether it is for structure prediction or spectral library matching, users can upload their custom databases into Sirius. In this vignette, we demonstrate how to test spectral library matching by creating and loading a custom database into Sirius. This process can also be completed easily via the Sirius graphical user interface (GUI). If you prefer an interactive approach, you can use the openGUI(srs) command to open the Sirius app and manage your database directly.

In this example, we download the MassBank library from GNPS, which needs to be loaded into Sirius to generate a .sirius.db file. Below we will download in our current directory but you can precise where you want to save it using location = parameter.

## Download the MassBank EU library
download.file("https://external.gnps2.org/gnpslibrary/MASSBANKEU.mgf",
              destfile = "MASSBANKEU.mgf")
createDb(srs, databaseId = "massbankeuCustom", files = "MASSBANKEU.mgf")

NOTE: THis takes quite a while, will change to a smaller database later. Once the database is created and loaded, you can verify its successful import by running the following command:

listDbs(srs)

Find more on how to handle databases in Sirius by typing ?siriusDBs in the console.

Submit job to Sirius - For structure DB search

Annotation and prediction begin after data import. The run() function accepts parameters for each Sirius tool, such as formula identification, structure database search, and compound class prediction. Parameters can also specify adducts or custom databases. Detailed documentation for these parameters is available in the run() function’s help file.

The wait parameter ensures the function waits for job completion before proceeding. If set to FALSE, the job ID is returned, and the user must check the status using jobInfo().

## Start computation
job_id <- run(srs,
              fallbackAdducts = c("[M + H]+", "[M + Na]+", "[M + K]+"),
              formulaIdParams = formulaIdParam(numberOfCandidates = 5,
                                               instrument = "QTOF",
                                numberOfCandidatesPerIonization = 2,
                                massAccuracyMS2ppm = 10,
                                filterByIsotopePattern = FALSE,
                                isotopeMs2Settings = c("SCORE"),
                                performDeNovoBelowMz = 600,
                                minPeaksToInjectSpecLibMatch = 3),
              predictParams = predictParam(),
              structureDbSearchParams = structureDbSearchParam(
                  structureSearchDbs = c("BIO")
              ),
              recompute = TRUE,
              wait = TRUE
)
#> Error:
#> ! object 'srs' not found

srs
#> Error:
#> ! object 'srs' not found

## Get more info for the job
jobInfo(srs, job_id) |> cat()
#> Error:
#> ! object 'job_id' not found

Retrieve Results

To obtain a summary of all results, including the top formulas, structures, and compound class predictions, use the following code. This summary table provides a quick overview to evaluate whether the results align with expectations. However, we recommend not relying on this table as-is for detailed analysis. Instead, use the functions described later in this vignette to explore the results in greater depth.

An important aspect of the summary table is the confidence-related columns, which provide insight into the reliability of the predictions.

summarytb <- summary(srs, result.type = "structure")
#> Error:
#> ! object 'srs' not found
head(summarytb)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summarytb' not found

Formula identification results:

For detailed results, the results() function can be used with the result.type parameter set to "formulaId", "structureDb", "compoundClass", or "deNovo". Note that all results are linked to a predicted formula.

The parameters topFormula and topStructure allow users to specify how many formulas or structures should be included in the output. The results can be returned either as a list or a data.frame, based on the return.type parameter.

Note: Suggestions for renaming the results() function or feedback on this implementation are welcome. We aim to adapt based on user needs.

results(srs,
       return.type = "data.frame",
       result.type = "formulaId",
       topFormula = 5)
#> Error:
#> ! object 'srs' not found

Structure DBs search results

The following example shows the top two structure annotations for the top five formulas of each feature. This can provide an insightful view into the structural predictions.

finalstructredb <- results(srs,
                           return.type = "data.frame",
                           result.type = "structureDb",
                           topFormula = 5,
                           topStructure = 2)
#> Error:
#> ! object 'srs' not found

head(finalstructredb)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'finalstructredb' not found

For a more visual exploration of the results, you can open the Sirius GUI with the commands below:

openGUI(srs)
closeGUI(srs)

Compound class prediction results

To retrieve compound class predictions, use the following code. Below is an example showing all compound annotations with confidence scores above 50% for the top two formulas of each feature.

finalcomp <- results(srs,
                     return.type = "data.frame",
                     result.type = "compoundClass",
                     topFormula = 2)
#> Error:
#> ! object 'srs' not found
head(finalcomp)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'finalcomp' not found

Spectral library matching results

The following code gives you a summary of the best matches:

summaryspectra <- summary(srs, result.type = "spectralDbMatch")
#> Error:
#> ! object 'srs' not found
head(summaryspectra)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summaryspectra' not found

For detailed results, use the following code:

full_spectral <- results(srs,
                         return.type = "data.frame",
                         result.type = "spectralDbMatch",
                         topSpectralMatches = 2)
#> Error:
#> ! object 'srs' not found
head(full_spectral)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'full_spectral' not found

Fragmentation tree results

Below we show how to get the fragmentation tree for the top2 formula of some feautres. This is quite inefficient at the moment so limit it to a little number of feature. I will improve it.

resulttree <- results(srs,
                      features = featuresId(srs)[1:5],
                      return.type = "list",
                      result.type = "fragTree",
                      topFormula = 4,
                     )
#> Error:
#> ! object 'srs' not found

head(resulttree)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'resulttree' not found

Submit job to Sirius - For De Novo structure annotation.

De novo structure annotation is computationally intensive and recommended only for specific features.

# Determine features that do not have/have poor structure prediction
fts_denovo <-summarytb$alignedFeatureId[which(
    summarytb$confidenceApproxMatch < 0.3 |
        summarytb$confidenceApproxMatch %in% c("NA", "-Infinity"))]
#> Error:
#> ! object 'summarytb' not found

# Compute with zodiac and denovo
job_id <- run(srs,
    msNovelistParams = deNovoStructureParam(numberOfCandidateToPredict = 5),
    alignedFeaturesIds = fts_denovo,
    recompute = FALSE,
    wait = TRUE
)
#> Error:
#> ! object 'fts_denovo' not found

## Get info for the job
jobInfo(srs, job_id) |> cat()
#> Error:
#> ! object 'job_id' not found

Retrieve results

summraryDeNovo <- summary(srs, result.type = "deNovo")
#> Error:
#> ! object 'srs' not found
head(summraryDeNovo)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summraryDeNovo' not found

Below is the full results.

full_de_novo <- results(srs,
                        return.type = "data.frame",
                        result.type = "deNovo",
                        topFormula = 5)
#> Error:
#> ! object 'srs' not found

head(full_de_novo)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'full_de_novo' not found

Setting a Known Molecular Formula

When you already know (or suspect) the molecular formula and adduct for a feature, you can restrict Sirius formula identification to that candidate. This is useful for targeted analyses where you want to compute the fragmentation tree for a known compound rather than performing an open search.

Use the candidateFormulas parameter of formulaIdParam() together with enforceAdducts to lock in both the formula and adduct:

# Select a feature with MS2 data (e.g. Carbaryl, [M+H]+ at m/z ~220.10)
fi <- featuresInfo(srs)
#> Error:
#> ! object 'srs' not found
target_ft <- fi[fi[, "hasMsMs"] == TRUE, "alignedFeatureId"][[1]]
#> Error:
#> ! object 'fi' not found

# Run formula identification restricted to a single known formula
job_id <- run(srs,
              alignedFeaturesIds = target_ft,
              enforceAdducts = "[M+H]+",
              formulaIdParams = formulaIdParam(
                  candidateFormulas = "C12H13NO3"
              ),
              recompute = TRUE,
              wait = TRUE)
#> Error in `formulaIdParam()`:
#> ! unused argument (candidateFormulas = "C12H13NO3")

You can also supply multiple candidate formulas if you want to compare a small set of possibilities:

job_id <- run(srs,
              alignedFeaturesIds = target_ft,
              enforceAdducts = "[M+H]+",
              formulaIdParams = formulaIdParam(
                  candidateFormulas = c("C12H13NO3", "C10H12N2O")
              ),
              recompute = TRUE,
              wait = TRUE)
#> Error in `formulaIdParam()`:
#> ! unused argument (candidateFormulas = c("C12H13NO3", "C10H12N2O"))

After the job completes, retrieve the formula results and fragmentation tree as usual:

# Formula results - should only contain the specified candidate(s)
results(srs,
        features = target_ft,
        result.type = "formulaId",
        return.type = "data.frame")
#> Error:
#> ! object 'target_ft' not found

# Fragmentation tree for the known formula
results(srs,
        features = target_ft,
        result.type = "fragTree",
        topFormula = 1,
        return.type = "list")
#> Error:
#> ! object 'target_ft' not found

CleanUp

# Close the Sirius session
shutdown(srs)
#> Warning in value[[3L]](cond): Could not retrieve open projects: object 'srs' not found
#> Warning in doTryCatch(return(expr), name, parentenv, handler): restarting interrupted
#> promise evaluation

Session information

The R code was run on:

date()
#> [1] "Mon Mar 23 11:26:36 2026"

Information on the R session:

sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Rome
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] MsDataHub_1.10.0        dplyr_1.2.0             RuSirius_0.2.0         
#>  [4] jsonlite_2.0.0          MetaboAnnotation_1.14.0 RSirius_6.3.3          
#>  [7] xcms_4.8.0              MsExperiment_1.12.0     ProtGenerics_1.42.0    
#> [10] Spectra_1.20.1          BiocParallel_1.44.0     S4Vectors_0.48.0       
#> [13] BiocGenerics_0.56.0     generics_0.1.4         
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          MultiAssayExperiment_1.36.1 magrittr_2.0.4             
#>   [4] farver_2.1.2                MALDIquant_1.22.3           fs_1.6.6                   
#>   [7] vctrs_0.7.1                 memoise_2.0.1               RCurl_1.98-1.17            
#>  [10] base64enc_0.1-6             htmltools_0.5.9             S4Arrays_1.10.1            
#>  [13] BiocBaseUtils_1.12.0        progress_1.2.3              curl_7.0.0                 
#>  [16] AnnotationHub_4.0.0         SparseArray_1.10.8          mzID_1.48.0                
#>  [19] htmlwidgets_1.6.4           plyr_1.8.9                  httr2_1.2.2                
#>  [22] impute_1.84.0               cachem_1.1.0                igraph_2.2.1               
#>  [25] lifecycle_1.0.5             iterators_1.0.14            pkgconfig_2.0.3            
#>  [28] Matrix_1.7-4                R6_2.6.1                    fastmap_1.2.0              
#>  [31] MatrixGenerics_1.22.0       clue_0.3-66                 digest_0.6.39              
#>  [34] pcaMethods_2.2.0            rsvg_2.7.0                  AnnotationDbi_1.72.0       
#>  [37] ExperimentHub_3.0.0         GenomicRanges_1.62.1        RSQLite_2.4.5              
#>  [40] filelock_1.0.3              httr_1.4.7                  abind_1.4-8                
#>  [43] compiler_4.5.2              withr_3.0.2                 bit64_4.6.0-1              
#>  [46] doParallel_1.0.17           S7_0.2.1                    DBI_1.2.3                  
#>  [49] MASS_7.3-65                 ChemmineR_3.62.0            rappdirs_0.3.4             
#>  [52] DelayedArray_0.36.0         rjson_0.2.23                mzR_2.44.0                 
#>  [55] tools_4.5.2                 PSMatch_1.14.0              otel_0.2.0                 
#>  [58] CompoundDb_1.14.2           glue_1.8.0                  QFeatures_1.20.0           
#>  [61] grid_4.5.2                  cluster_2.1.8.1             reshape2_1.4.5             
#>  [64] snow_0.4-4                  gtable_0.3.6                preprocessCore_1.72.0      
#>  [67] tidyr_1.3.2                 data.table_1.18.2.1         hms_1.1.4                  
#>  [70] MetaboCoreUtils_1.19.2      xml2_1.5.2                  XVector_0.50.0             
#>  [73] BiocVersion_3.22.0          foreach_1.5.2               pillar_1.11.1              
#>  [76] stringr_1.6.0               limma_3.66.0                BiocFileCache_3.0.0        
#>  [79] lattice_0.22-7              bit_4.6.0                   tidyselect_1.2.1           
#>  [82] Biostrings_2.78.0           knitr_1.51                  gridExtra_2.3              
#>  [85] IRanges_2.44.0              Seqinfo_1.0.0               SummarizedExperiment_1.40.0
#>  [88] xfun_0.56                   Biobase_2.70.0              statmod_1.5.1              
#>  [91] MSnbase_2.36.0              matrixStats_1.5.0           DT_0.34.0                  
#>  [94] stringi_1.8.7               yaml_2.3.12                 lazyeval_0.2.2             
#>  [97] evaluate_1.0.5              codetools_0.2-20            MsCoreUtils_1.22.1         
#> [100] tibble_3.3.1                BiocManager_1.30.27         cli_3.6.5                  
#> [103] affyio_1.80.0               Rcpp_1.1.1                  MassSpecWavelet_1.76.0     
#> [106] dbplyr_2.5.1                png_0.1-8                   XML_3.99-0.20              
#> [109] parallel_4.5.2              ggplot2_4.0.2               blob_1.3.0                 
#> [112] prettyunits_1.2.0           AnnotationFilter_1.34.0     bitops_1.0-9               
#> [115] MsFeatures_1.18.0           scales_1.4.0                affy_1.88.0                
#> [118] ncdf4_1.24                  purrr_1.2.1                 crayon_1.5.3               
#> [121] rlang_1.1.7                 KEGGREST_1.50.0             vsn_3.78.1