Predict formula and structure of chromatographic peaks from an XcmsExperiment object Sirius through the RuSirius package.
Source:vignettes/Chromatographic_peak_annotation_public_dataset.Rmd
Chromatographic_peak_annotation_public_dataset.Rmd
library(Spectra)
library(MsExperiment)
library(xcms)
#> Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (1.1.0)
#> than is installed on your system (1.1.1). This might lead to errors
#> when loading mzR. If you encounter such issues, please send a report,
#> including the output of sessionInfo() to the Bioc support forum at
#> https://support.bioconductor.org/. For details see also
#> https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.
library(RSirius)
library(MetaboAnnotation)
library(RuSirius)
library(dplyr)
library(MsDataHub)Introduction
Note: this vignette is pre-computed. See the session info for information on packages used and the date the vignette was rendered. The vignette requires a running Sirius instance and a Sirius account for structure database searches. To reproduce this analysis, you will need to log in to your own Sirius account.
This vignette demonstrates a basic workflow for importing detected
chromatographic peaks from an XcmsExperiment
object into Sirius. It then runs Sirius’s main tools: formula
identification, structure database search, compound class prediction,
spectral library matching, de novo structure prediction, and
finally retrieves the results.
This is a foundational example and does not cover all the possible
parameters for each Sirius tool. For detailed parameter information,
consult the run() function documentation. More information
can be found in the Sirius
documentation online.
While this vignette focuses on chromatographic peaks detected with xcms, a similar workflow applies to features (grouped chromatographic peaks). The vignette for features is pending the availability of public data, but the steps for data preparation differ only slightly and can be adapted without issue.
IMPORTANT: This is a work in progress. Feedback is highly valued, especially regarding enhancements or additions that could simplify your workflow. Your input as a user is essential.
Preprocessing
Here, we apply pre-optimized parameters for processing the example dataset.
dda_file <- MsDataHub::PestMix1_DDA.mzML()
dda_data <- readMsExperiment(dda_file)
spectra(dda_data) <- setBackend(spectra(dda_data), MsBackendMemory())
dda_data <- filterRt(dda_data, rt = c(230, 610))
dda_data |>
spectra() |>
msLevel() |>
table()
#>
#> 1 2
#> 1389 2214
prec_int <- estimatePrecursorIntensity(spectra(dda_data))
cwp <- CentWaveParam(snthresh = 5, noise = 100, ppm = 10,
peakwidth = c(3, 30))
dda_data <- findChromPeaks(dda_data, param = cwp, msLevel = 1L)MS1 and MS2 Extraction
MS2 and MS1 spectra corresponding to the identified chromatographic peaks are extracted. Each feature imported in Sirius can only have one MS1 spectrum. However MS2 can also be loaded by itself wiht no MS1 if wanted.
# Extract MS1 and MS2 spectra linked to chromatographic peaks
ms2 <- chromPeakSpectra(dda_data,
expandMz = 0.01, expandRt = 3,
msLevel = 2L)
low_int <- function(x, ...) x > max(x, na.rm = TRUE) * 0.05
ms2 <- filterIntensity(ms2, intensity = low_int)
ms2 <- ms2[lengths(ms2) > 1]
ms1 <- chromPeakSpectra(dda_data,
expandMz = 0.01, expandRt = 3,
msLevel = 1L, method = "closest_rt")
ms1 <- applyProcessing(ms1)
ms2 <- applyProcessing(ms2)
ms1_ms2 <- concatenateSpectra(ms1, ms2)Open Sirius and project set up
The Sirius application is initialized via the API, requiring only a
project ID. If the project exists, it is opened; otherwise, a new
project is created. The srs object acts as the connection
to Sirius and holds project details. Properly shut down the connection
with shutdown(srs) after completing your work.
This srs variable is needed for any task that
necessitate to communicate with the application. You can learn more
about this object class by running ?Sirius in the console.
Parameter port used below allows to configure the
Sirius application to use a particular port, but generally the
function can be used without specifying a port.
# Initialize Sirius connection
srs <- Sirius(projectId = "test_lcmsms", port = 9999)
#> Error in `Sirius()`:
#> ! unused argument (port = 9999)
srs
#> Error:
#> ! object 'srs' not foundIf the user wants to open and perform computations on a different project they can use the utility function below:
srs <- openProject(srs, projectId = "test_lcmsms2", path = getwd())
#> Error in `openProject()`:
#> ! The connection to the Sirius instance is not valid.
srs
#> Error:
#> ! object 'srs' not foundYou can find all the utility functions of this package by running
?utils in the console.
NOTE if you have any idea of utility functions that could be implemented do not hesitate to ask.
Data import
Preprocessed xcms data is imported into Sirius, and a
summary data.frame is returned with feature information.
This information can also be retrieved using the utility function
featuresInfo().
## load back our original project
srs <- openProject(srs, "test_lcmsms")
#> Error in `openProject()`:
#> ! The connection to the Sirius instance is not valid.
## Import data into Sirius
srs <- import(sirius = srs,
spectra = ms1_ms2,
ms_column_name = "chrom_peak_id",
deleteExistingFeatures = TRUE)
#> Error:
#> ! object 'srs' not found
## See information about the features
featuresInfo(srs) |> head()
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'srs' not foundNotes:
- It could also be discussed that this
data.framecould be stored direction into thesrsobject - When running
import()i automatically create a mapping data.frame between the xcms feature ID and the Sirius feature ID. It is stored in thesrsobject, thefeatureMapslot. This can be used in the future so the user never need to interact with the Sirius IDs.
Below is an example of how to extract features ID, the utility
function featuresId() quickly extract all available ID
either sirius or xcms.
fts_id <- featuresId(srs, type = "sirius")
#> Error:
#> ! object 'srs' not foundSearchable database
!!! Note: The database code does not work at the moment, please load them through the GUI until it’s fixed. !!!
Whether it is for structure prediction or spectral library matching,
users can upload their custom databases into Sirius. In this vignette,
we demonstrate how to test spectral library matching by creating and
loading a custom database into Sirius. This process can also be
completed easily via the Sirius graphical user interface (GUI). If you
prefer an interactive approach, you can use the
openGUI(srs) command to open the Sirius app and manage your
database directly.
In this example, we download the MassBank library from GNPS, which
needs to be loaded into Sirius to generate a .sirius.db
file. Below we will download in our current directory but you can
precise where you want to save it using location =
parameter.
## Download the MassBank EU library
download.file("https://external.gnps2.org/gnpslibrary/MASSBANKEU.mgf",
destfile = "MASSBANKEU.mgf")
createDb(srs, databaseId = "massbankeuCustom", files = "MASSBANKEU.mgf")NOTE: THis takes quite a while, will change to a smaller database later. Once the database is created and loaded, you can verify its successful import by running the following command:
listDbs(srs)Find more on how to handle databases in Sirius by typing
?siriusDBs in the console.
Submit job to Sirius - For structure DB search
Annotation and prediction begin after data import. The
run() function accepts parameters for each Sirius tool,
such as formula identification, structure database search, and compound
class prediction. Parameters can also specify adducts or custom
databases. Detailed documentation for these parameters is available in
the run() function’s help file.
The wait parameter ensures the function waits for job
completion before proceeding. If set to FALSE, the job ID
is returned, and the user must check the status using
jobInfo().
## Start computation
job_id <- run(srs,
fallbackAdducts = c("[M + H]+", "[M + Na]+", "[M + K]+"),
formulaIdParams = formulaIdParam(numberOfCandidates = 5,
instrument = "QTOF",
numberOfCandidatesPerIonization = 2,
massAccuracyMS2ppm = 10,
filterByIsotopePattern = FALSE,
isotopeMs2Settings = c("SCORE"),
performDeNovoBelowMz = 600,
minPeaksToInjectSpecLibMatch = 3),
predictParams = predictParam(),
structureDbSearchParams = structureDbSearchParam(
structureSearchDbs = c("BIO")
),
recompute = TRUE,
wait = TRUE
)
#> Error:
#> ! object 'srs' not found
srs
#> Error:
#> ! object 'srs' not foundRetrieve Results
To obtain a summary of all results, including the top formulas, structures, and compound class predictions, use the following code. This summary table provides a quick overview to evaluate whether the results align with expectations. However, we recommend not relying on this table as-is for detailed analysis. Instead, use the functions described later in this vignette to explore the results in greater depth.
An important aspect of the summary table is the confidence-related columns, which provide insight into the reliability of the predictions.
Formula identification results:
For detailed results, the results() function can be used with the
result.type parameter set to "formulaId",
"structureDb", "compoundClass", or
"deNovo". Note that all results are linked to a predicted
formula.
The parameters topFormula and topStructure
allow users to specify how many formulas or structures should be
included in the output. The results can be returned either as a list or
a data.frame, based on the return.type parameter.
Note: Suggestions for renaming the results() function or
feedback on this implementation are welcome. We aim to adapt based on
user needs.
results(srs,
return.type = "data.frame",
result.type = "formulaId",
topFormula = 5)
#> Error:
#> ! object 'srs' not foundStructure DBs search results
The following example shows the top two structure annotations for the top five formulas of each feature. This can provide an insightful view into the structural predictions.
finalstructredb <- results(srs,
return.type = "data.frame",
result.type = "structureDb",
topFormula = 5,
topStructure = 2)
#> Error:
#> ! object 'srs' not found
head(finalstructredb)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'finalstructredb' not foundFor a more visual exploration of the results, you can open the Sirius GUI with the commands below:
Compound class prediction results
To retrieve compound class predictions, use the following code. Below is an example showing all compound annotations with confidence scores above 50% for the top two formulas of each feature.
Spectral library matching results
The following code gives you a summary of the best matches:
summaryspectra <- summary(srs, result.type = "spectralDbMatch")
#> Error:
#> ! object 'srs' not found
head(summaryspectra)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summaryspectra' not foundFor detailed results, use the following code:
full_spectral <- results(srs,
return.type = "data.frame",
result.type = "spectralDbMatch",
topSpectralMatches = 2)
#> Error:
#> ! object 'srs' not found
head(full_spectral)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'full_spectral' not foundFragmentation tree results
Below we show how to get the fragmentation tree for the top2 formula of some feautres. This is quite inefficient at the moment so limit it to a little number of feature. I will improve it.
resulttree <- results(srs,
features = featuresId(srs)[1:5],
return.type = "list",
result.type = "fragTree",
topFormula = 4,
)
#> Error:
#> ! object 'srs' not found
head(resulttree)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'resulttree' not foundSubmit job to Sirius - For De Novo structure annotation.
De novo structure annotation is computationally intensive and recommended only for specific features.
# Determine features that do not have/have poor structure prediction
fts_denovo <-summarytb$alignedFeatureId[which(
summarytb$confidenceApproxMatch < 0.3 |
summarytb$confidenceApproxMatch %in% c("NA", "-Infinity"))]
#> Error:
#> ! object 'summarytb' not found
# Compute with zodiac and denovo
job_id <- run(srs,
msNovelistParams = deNovoStructureParam(numberOfCandidateToPredict = 5),
alignedFeaturesIds = fts_denovo,
recompute = FALSE,
wait = TRUE
)
#> Error:
#> ! object 'fts_denovo' not found
## Get info for the job
jobInfo(srs, job_id) |> cat()
#> Error:
#> ! object 'job_id' not foundRetrieve results
summraryDeNovo <- summary(srs, result.type = "deNovo")
#> Error:
#> ! object 'srs' not found
head(summraryDeNovo)
#> Error in `h()`:
#> ! error in evaluating the argument 'x' in selecting a method for function 'head': object 'summraryDeNovo' not foundBelow is the full results.
Setting a Known Molecular Formula
When you already know (or suspect) the molecular formula and adduct for a feature, you can restrict Sirius formula identification to that candidate. This is useful for targeted analyses where you want to compute the fragmentation tree for a known compound rather than performing an open search.
Use the candidateFormulas parameter of
formulaIdParam() together with enforceAdducts
to lock in both the formula and adduct:
# Select a feature with MS2 data (e.g. Carbaryl, [M+H]+ at m/z ~220.10)
fi <- featuresInfo(srs)
#> Error:
#> ! object 'srs' not found
target_ft <- fi[fi[, "hasMsMs"] == TRUE, "alignedFeatureId"][[1]]
#> Error:
#> ! object 'fi' not found
# Run formula identification restricted to a single known formula
job_id <- run(srs,
alignedFeaturesIds = target_ft,
enforceAdducts = "[M+H]+",
formulaIdParams = formulaIdParam(
candidateFormulas = "C12H13NO3"
),
recompute = TRUE,
wait = TRUE)
#> Error in `formulaIdParam()`:
#> ! unused argument (candidateFormulas = "C12H13NO3")You can also supply multiple candidate formulas if you want to compare a small set of possibilities:
job_id <- run(srs,
alignedFeaturesIds = target_ft,
enforceAdducts = "[M+H]+",
formulaIdParams = formulaIdParam(
candidateFormulas = c("C12H13NO3", "C10H12N2O")
),
recompute = TRUE,
wait = TRUE)
#> Error in `formulaIdParam()`:
#> ! unused argument (candidateFormulas = c("C12H13NO3", "C10H12N2O"))After the job completes, retrieve the formula results and fragmentation tree as usual:
# Formula results - should only contain the specified candidate(s)
results(srs,
features = target_ft,
result.type = "formulaId",
return.type = "data.frame")
#> Error:
#> ! object 'target_ft' not found
# Fragmentation tree for the known formula
results(srs,
features = target_ft,
result.type = "fragTree",
topFormula = 1,
return.type = "list")
#> Error:
#> ! object 'target_ft' not foundCleanUp
# Close the Sirius session
shutdown(srs)
#> Warning in value[[3L]](cond): Could not retrieve open projects: object 'srs' not found
#> Warning in doTryCatch(return(expr), name, parentenv, handler): restarting interrupted
#> promise evaluationSession information
The R code was run on:
date()
#> [1] "Mon Mar 23 11:26:36 2026"Information on the R session:
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Rome
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MsDataHub_1.10.0 dplyr_1.2.0 RuSirius_0.2.0
#> [4] jsonlite_2.0.0 MetaboAnnotation_1.14.0 RSirius_6.3.3
#> [7] xcms_4.8.0 MsExperiment_1.12.0 ProtGenerics_1.42.0
#> [10] Spectra_1.20.1 BiocParallel_1.44.0 S4Vectors_0.48.0
#> [13] BiocGenerics_0.56.0 generics_0.1.4
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 MultiAssayExperiment_1.36.1 magrittr_2.0.4
#> [4] farver_2.1.2 MALDIquant_1.22.3 fs_1.6.6
#> [7] vctrs_0.7.1 memoise_2.0.1 RCurl_1.98-1.17
#> [10] base64enc_0.1-6 htmltools_0.5.9 S4Arrays_1.10.1
#> [13] BiocBaseUtils_1.12.0 progress_1.2.3 curl_7.0.0
#> [16] AnnotationHub_4.0.0 SparseArray_1.10.8 mzID_1.48.0
#> [19] htmlwidgets_1.6.4 plyr_1.8.9 httr2_1.2.2
#> [22] impute_1.84.0 cachem_1.1.0 igraph_2.2.1
#> [25] lifecycle_1.0.5 iterators_1.0.14 pkgconfig_2.0.3
#> [28] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0
#> [31] MatrixGenerics_1.22.0 clue_0.3-66 digest_0.6.39
#> [34] pcaMethods_2.2.0 rsvg_2.7.0 AnnotationDbi_1.72.0
#> [37] ExperimentHub_3.0.0 GenomicRanges_1.62.1 RSQLite_2.4.5
#> [40] filelock_1.0.3 httr_1.4.7 abind_1.4-8
#> [43] compiler_4.5.2 withr_3.0.2 bit64_4.6.0-1
#> [46] doParallel_1.0.17 S7_0.2.1 DBI_1.2.3
#> [49] MASS_7.3-65 ChemmineR_3.62.0 rappdirs_0.3.4
#> [52] DelayedArray_0.36.0 rjson_0.2.23 mzR_2.44.0
#> [55] tools_4.5.2 PSMatch_1.14.0 otel_0.2.0
#> [58] CompoundDb_1.14.2 glue_1.8.0 QFeatures_1.20.0
#> [61] grid_4.5.2 cluster_2.1.8.1 reshape2_1.4.5
#> [64] snow_0.4-4 gtable_0.3.6 preprocessCore_1.72.0
#> [67] tidyr_1.3.2 data.table_1.18.2.1 hms_1.1.4
#> [70] MetaboCoreUtils_1.19.2 xml2_1.5.2 XVector_0.50.0
#> [73] BiocVersion_3.22.0 foreach_1.5.2 pillar_1.11.1
#> [76] stringr_1.6.0 limma_3.66.0 BiocFileCache_3.0.0
#> [79] lattice_0.22-7 bit_4.6.0 tidyselect_1.2.1
#> [82] Biostrings_2.78.0 knitr_1.51 gridExtra_2.3
#> [85] IRanges_2.44.0 Seqinfo_1.0.0 SummarizedExperiment_1.40.0
#> [88] xfun_0.56 Biobase_2.70.0 statmod_1.5.1
#> [91] MSnbase_2.36.0 matrixStats_1.5.0 DT_0.34.0
#> [94] stringi_1.8.7 yaml_2.3.12 lazyeval_0.2.2
#> [97] evaluate_1.0.5 codetools_0.2-20 MsCoreUtils_1.22.1
#> [100] tibble_3.3.1 BiocManager_1.30.27 cli_3.6.5
#> [103] affyio_1.80.0 Rcpp_1.1.1 MassSpecWavelet_1.76.0
#> [106] dbplyr_2.5.1 png_0.1-8 XML_3.99-0.20
#> [109] parallel_4.5.2 ggplot2_4.0.2 blob_1.3.0
#> [112] prettyunits_1.2.0 AnnotationFilter_1.34.0 bitops_1.0-9
#> [115] MsFeatures_1.18.0 scales_1.4.0 affy_1.88.0
#> [118] ncdf4_1.24 purrr_1.2.1 crayon_1.5.3
#> [121] rlang_1.1.7 KEGGREST_1.50.0 vsn_3.78.1