LC-MS/MS Data Annotation using R and Python • Metabonaut

Introduction

The SpectriPy R package enables powerful mass spectrometry (MS) data analysis workflows combining the strengths of Python and R MS libraries. General concepts and examples of the package are available in the package’s main vignette.

As an example for a combined R/Python workflow, we annotate the LC-MS/MS spectra from the main end-to-end vignette using Python’s matchms library.

The MS2 processing methods which will be demonstrated on the used spectral reference library consists of the default filtering functionality from matchms (Huber et al. 2020) (default_filters(), add_parent_mass() and normalize_intensities()).

The MS2 spectral similarity algorithm which is demonstrated here is the ModifiedCosine() from matchms (Huber et al. 2020).

The spectral reference library used for the annotation of the unknown features in this tutorial originates from a small in-house reference library (provided in MGF format) available in the SpectriPy package.

The analysis in this document is performed using R and Python code chunks. A comment #’ R session: or #’ Python session: is used for easier distinction of these.

Load SpectriPy

Load the required R SpectriPy package. If you already have a Python environment opened, please restart your Integrated Development Environment and run this as a first command, to load the required package reticulate and setup the conda environment ‘r-spectripy’ correctly. Please see the Detailed information on installation and configuration document for other options.

#' R session:

library(SpectriPy)

Load query MS2 data

The LC-MS/MS query data used in this tutorial, are derived from the Metabonaut resource (Louail and Rainer 2025). Introduction and a thorough description of the preprocessing steps performed are described in A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R.

First, we load the Spectra object with the MS2 spectra of the unknown features found to be significant in the “Differential abundance analysis”, see section MS2-based annotation. This Spectra object is shared as part of the Metabonaut package.

#' R session:

#' R MS package
library(Spectra)

#' Load the MS2 spectra of significant features
load(system.file("extdata", "spectra_significant_fts.RData",
                 package = "Metabonaut"))
ms2_ctr_fts

MSn data (Spectra) with 315 spectra in a MsBackendMemory backend:
      msLevel     rtime scanIndex
    <integer> <numeric> <integer>
1           2   147.357      2043
2           2   148.587      2061
3           2   149.817      2079
4           2   152.297      2115
5           2   147.376      2041
...       ...       ...       ...
311         2   178.082      2481
312         2   179.322      2499
313         2   180.572      2517
314         2   181.822      2535
315         2   183.072      2553
 ... 39 more variables/columns.
Processing:
 Filter: select retention time [10..240] on MS level(s) 1 2 [Tue Mar 18 11:56:42 2025]
 Filter: select MS level(s) 2 [Tue Mar 18 11:56:50 2025]
 Remove peaks based on their intensities and a user-provided function in spectra of MS level(s) 2. [Tue Mar 18 11:56:50 2025]
 ...19 more processings. Use 'processingLog' to list all.

#' Print the available metadata, stored in the Spectra object
spectraVariables(ms2_ctr_fts)

 [1] "msLevel"                    "rtime"
 [3] "acquisitionNum"             "scanIndex"
 [5] "dataStorage"                "dataOrigin"
 [7] "centroided"                 "smoothed"
 [9] "polarity"                   "precScanNum"
[11] "precursorMz"                "precursorIntensity"
[13] "precursorCharge"            "collisionEnergy"
[15] "isolationWindowLowerMz"     "isolationWindowTargetMz"
[17] "isolationWindowUpperMz"     "peaksCount"
[19] "totIonCurrent"              "basePeakMZ"
[21] "basePeakIntensity"          "ionisationEnergy"
[23] "lowMZ"                      "highMZ"
[25] "mergedScan"                 "mergedResultScanNum"
[27] "mergedResultStartScanNum"   "mergedResultEndScanNum"
[29] "injectionTime"              "filterString"
[31] "spectrumId"                 "ionMobilityDriftTime"
[33] "scanWindowLowerLimit"       "scanWindowUpperLimit"
[35] "electronBeamEnergy"         "mtbls_id"
[37] "mtbls_assay_name"           "derived_spectral_data_file"
[39] "collision_energy"           "feature_id"

#' Print the feature_id of the first spectrum
ms2_ctr_fts$feature_id[1]

[1] "FT0371"

Filter query data

To ensure this Spectra object only contains MS2 data, we filter to only MS2 spectra with more than 2 fragment peaks per spectrum using some classical filtering functions from the Spectra package.

#' R session:

#' Filter MS2 level data
ms2_ctr_fts <- filterMsLevel(ms2_ctr_fts, 2L)

#' filter minimum 3 fragment peaks
ms2_ctr_fts <- ms2_ctr_fts[lengths(ms2_ctr_fts) >= 3]
ms2_ctr_fts

MSn data (Spectra) with 291 spectra in a MsBackendMemory backend:
      msLevel     rtime scanIndex
    <integer> <numeric> <integer>
1           2   147.357      2043
2           2   148.587      2061
3           2   152.297      2115
4           2   147.376      2041
5           2   148.616      2059
...       ...       ...       ...
287         2   178.082      2481
288         2   179.322      2499
289         2   180.572      2517
290         2   181.822      2535
291         2   183.072      2553
 ... 39 more variables/columns.
Processing:
 Filter: select retention time [10..240] on MS level(s) 1 2 [Tue Mar 18 11:56:42 2025]
 Filter: select MS level(s) 2 [Tue Mar 18 11:56:50 2025]
 Remove peaks based on their intensities and a user-provided function in spectra of MS level(s) 2. [Tue Mar 18 11:56:50 2025]
 ...20 more processings. Use 'processingLog' to list all.

Load reference MS2 data

As the in-house spectral library, we import a small test data file in MGF format. This file is part of the SpectriPy package and we below define its path on the local computer.

#' R session:

#' Define a variable with the path and file name of the MGF file
mgf_file <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")

The loading of this file is then performed using the Python matchms library. The variable with the MGF file name can be accessed in the associated Python session using r.mgf_file. The loaded object is a Python list of matchms.Spectrum objects.

#' Python session:

from matchms.importing import load_from_mgf

#' Read spectra from an MGF formatted file, as Spectrum object
mgf_py = list(load_from_mgf(r.mgf_file))

#' Nr of spectra
len(mgf_py)

#' Access the first spectrum
mgf_py[0]

Spectrum(precursor m/z=259.06, 3 fragments between 213.1 and 259.1)

Note that we can also access the first spectrum from an R session, by starting the command with py$.

#' R session:

#' Access the first spectrum
py$mgf_py[[1]]

Spectrum(precursor m/z=259.06, 3 fragments between 213.1 and 259.1)

Translate query MS data to Python

First, we check if the R Spectra object containing the query MS2 data can be accessed in Python using the r. prefix.

#' Python session:

#' check if the r Spectra object can be accessed in python using the 'r.'
#' prefix. Print the first spectrum
r.ms2_ctr_fts[1]

Spectrum(precursor m/z=138.05, 4 fragments between 73.1 and 92.1)

#' Show which metadata is available in the first spectrum
r.ms2_ctr_fts[0].metadata.keys()

dict_keys(['precursor_mz', 'precursor_intensity', 'charge', 'retention_time', 'collision_energy', 'isolation_window_target_mz', 'ms_level'])

Second, we translate the Spectra object ms2_ctr_fts to respective Python matchms.Spectrum objects using the rspec_to_pyspec() function. With the py_set_attr() function we assign the variable directly to an attribute in the Python session (which avoids repeated cross-programming language references).

#' R session:

#' Add mapping for additional spectra variables to the default mapping in R and
#' python, respectively
map = c(defaultSpectraVariableMapping(),
        feature_id = 'feature_id')

#' Convert to py Spectrum
py_set_attr(py, "ms2_ctr_fts_py", rspec_to_pyspec(ms2_ctr_fts, mapping = map))

We can now access the converted Spectra object in Python.

#' Python session:

#' Access the first converted spectrum
ms2_ctr_fts_py[0]

Spectrum(precursor m/z=138.05, 4 fragments between 73.1 and 92.1)

Filter the reference library

Before we run the spectral comparisons of our query data to the MGF reference library, we first apply some MS2 processing from matchms. Default filtering from matchms is performed to standardize ion mode, correct charge and more. See the matchms filtering documentation. Also, we keep only reference spectra with a precursor m/z as the similarity algorithm we use later requires spectra to have a precursor m/z.

#' Python session:

from matchms.filtering import default_filters, normalize_intensities, add_parent_mass

#' Apply filters to clean and enhance each spectrum
clean_mgf_py = []
for spectrum in mgf_py:
    #' Apply default filter to standardize ion mode, correct charge and more.
    #' Default filter is fully explained at
    #' https://matchms.readthedocs.io/en/latest/api/matchms.filtering.html
    spectrum = default_filters(spectrum)
    #' For missing precursor_mz field: check if there is “pepmass”” entry instead
    spectrum = add_parent_mass(spectrum)
    #' Scale peak intensities to maximum of 1
    spectrum = normalize_intensities(spectrum)
    #' Only add spectra that have a precursor m/z
    if "precursor_mz" in spectrum.metadata:
        clean_mgf_py.append(spectrum)

#' Nr of spectra
len(clean_mgf_py)

Calculating spectra similarities using the Modified Cosine algorithm from matchms

We calculate the pairwise spectral similarity between the query spectra and the reference library spectra using Python’s matchms library.

Here, we use the spectral similarity algorithm ModifiedCosine from matchms, from source matchms. This algorithm can easily be exchanged for another spectral similarity calculation from matchms (see here).

#' Python session:

from matchms import calculate_scores
from matchms.similarity import ModifiedCosine

#' Calculate Cosine similarity scores between all spectra
#' For other similarity score methods see
#' https://matchms.readthedocs.io/en/latest/api/matchms.similarity.html
similarity_score = ModifiedCosine(tolerance = 0.1)
scores = calculate_scores(references = clean_mgf_py,
                          queries = ms2_ctr_fts_py,
                          similarity_function = similarity_score)
scores

<78x291x2 stacked sparse array containing scores for ('ModifiedCosine_score', 'ModifiedCosine_matches') with 12376 stored elements in COOrdinate format>

We next rearrange the spectra similarity results to make a data frame containing the best matched compound name (derived from the reference library) per queried spectrum.

First, we extract and transpose the scores as a python array. Each row of the array will contain the similarity scores of one spectrum from our query spectra ms2_ctr_fts_py against the cleaned reference library clean_mgf_py.

#' Python session:

#' Convert to array and transpose
sim_matchms = scores.to_array()["ModifiedCosine_score"]
sim_matchms = sim_matchms.T

#' Contains 1 row for each spectrum in query
sim_matchms.shape

(291, 78)

Next, we create a data frame with per queried spectrum from our unknown variables, the compound name of the higest matching spectra from the reference library and the corresponding similarity score.

#' Python session:

import numpy as np
import pandas as pd

#' Prepare results list
results = []
for i in range(sim_matchms.shape[0]):
    #' row is the query, keep nr in the results instead of replacing by eg id
    name_row = ms2_ctr_fts_py[i].get('feature_id')
    row_values = sim_matchms[i].copy()
    #' match with higest col nr from the references
    max_col = np.argmax(row_values)  #' Find column index of max value
    max_value = row_values[max_col]  #' Get max value
    #' replace the nr of refererences with the name
    name_max_col = clean_mgf_py[max_col].get('compound_name')
    results.append({"query": i + 1,
                    "query_feature_id": name_row,
                    "reference": max_col,
                    "reference_compound_name": name_max_col,
                    "ModifiedCosine_score": max_value})


#' Convert to DataFrame
df = pd.DataFrame(results)

#' Print the first 5 rows of the unfiltered DataFrame
df.head()

   query query_feature_id  ...  reference_compound_name ModifiedCosine_score
0      1           FT0371  ...         Benzyl succinate             0.554071
1      2           FT0371  ...         Benzyl succinate             0.557885
2      3           FT0371  ...      L-(+)-Ergothioneine             0.103269
3      4           FT0371  ...         Benzyl succinate             0.464949
4      5           FT0371  ...         Benzyl succinate             0.508411

[5 rows x 5 columns]

[!] Caution: As the higest score is taken as criteria for the annotation, a lot of caution is needed evaluation the trueness of the match. A low score is not reliable, as the similarity algorithm will calculate a score for each pair of spectra. Therefore, a match will always be found. In addition, if your unknown compound is absent in the reference library, it will match wrongly to another compound that is present in the database.

Below we filter the results retaining only matches with a similarity value above 0.6. Above this value, the potential annotations need to be validated using e.g. rerunning samples in the presence of commercial standards.

#' Python session:

#' Keep only rows where score >= 0.6
df_filtered = df[df["ModifiedCosine_score"] >= 0.6]

#' R session:

library(pander)
#' Print the filtered DataFrame
pandoc.table(py$df_filtered, style = "rmarkdown", split.table = Inf)

	query	query_feature_id	reference	reference_compound_name	ModifiedCosine_score
14	15	FT0371	52	Nordihydroguaiaretic acid	0.6667
33	34	FT0565	35	Ethambutol	0.9623
34	35	FT0565	35	Ethambutol	0.8809
37	38	FT0565	35	Ethambutol	0.9412
38	39	FT0565	35	Ethambutol	0.8681
40	41	FT0565	35	Ethambutol	0.935
41	42	FT0565	35	Ethambutol	0.7319
44	45	FT0565	35	Ethambutol	0.89
48	49	FT0565	35	Ethambutol	0.6822
51	52	FT0565	38	Benzyl succinate	0.7218
52	53	FT0565	38	Benzyl succinate	0.7516
53	54	FT0565	47	Bisphenol AF	0.6977
56	57	FT0565	39	Isethionate	0.6666
57	58	FT0565	38	Benzyl succinate	0.7443
67	68	FT0732	72	Prednisolone_Tebutate	0.7048
68	69	FT0732	72	Prednisolone_Tebutate	0.7729
70	71	FT0732	72	Prednisolone_Tebutate	0.7103
71	72	FT0732	72	Prednisolone_Tebutate	0.6087
73	74	FT0732	72	Prednisolone_Tebutate	0.6523
74	75	FT0732	72	Prednisolone_Tebutate	0.8334
87	88	FT0732	19	Isoproturon-didemethyl	0.6235
89	90	FT0732	72	Prednisolone_Tebutate	0.6793
97	98	FT0732	9	N-Methyl-2-pyrrolidone	0.7953
103	104	FT0732	72	Prednisolone_Tebutate	0.6041
104	105	FT0732	72	Prednisolone_Tebutate	0.6836
109	110	FT0732	72	Prednisolone_Tebutate	0.6295
111	112	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.6005
112	113	FT0732	16	Caffeine	0.6181
113	114	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.6843
125	126	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.6863
126	127	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7172
127	128	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7157
128	129	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.721
129	130	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.6315
134	135	FT0732	72	Prednisolone_Tebutate	0.7612
137	138	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.6478
139	140	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.6301
140	141	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7013
141	142	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7017
142	143	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7018
143	144	FT0732	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7863
186	187	FT0732	9	N-Methyl-2-pyrrolidone	0.636
220	221	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.737
221	222	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.733
222	223	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7256
223	224	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7305
227	228	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.6283
228	229	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7359
229	230	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7296
230	231	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.73
239	240	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7352
240	241	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7351
247	248	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7391
248	249	FT0845	11	3-(4-Isopropylphenyl)isobutyraldehyde	0.7332

To visually inspect how good the query and reference spectra match, we refer to the Metabonaut resource on how to generate the mirror plots and perform precursor m/z filtering (e.g. maximum 1 Da difference).

Session information

#' R session:

sessionInfo()

R Under development (unstable) (2025-03-08 r87910)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] pander_0.6.6        Spectra_1.17.10     BiocParallel_1.41.5
[4] S4Vectors_0.45.4    BiocGenerics_0.53.6 generics_0.1.3
[7] SpectriPy_0.5.3     reticulate_1.42.0

loaded via a namespace (and not attached):
 [1] cli_3.6.4              knitr_1.50             rlang_1.1.5
 [4] xfun_0.52              ProtGenerics_1.39.2    png_0.1-8
 [7] jsonlite_2.0.0         clue_0.3-66            htmltools_0.5.8.1
[10] rmarkdown_2.29         grid_4.5.0             evaluate_1.0.3
[13] MASS_7.3-65            fastmap_1.2.0          yaml_2.3.10
[16] IRanges_2.41.3         MsCoreUtils_1.19.2     cluster_2.1.8.1
[19] compiler_4.5.0         codetools_0.2-20       fs_1.6.5
[22] Rcpp_1.0.14            MetaboCoreUtils_1.15.0 lattice_0.22-7
[25] digest_0.6.37          parallel_4.5.0         Matrix_1.7-3
[28] tools_4.5.0

References

Huber, Florian, Stefan Verhoeven, Christiaan Meijer, Hanno Spreeuw, Efraín Castilla, Cunliang Geng, Justin Van Der Hooft, et al. 2020. “Matchms - Processing and Similarity Evaluation of Mass Spectrometry Data.” Journal of Open Source Software 5 (52): 2411. https://doi.org/10.21105/joss.02411.

Louail, Philippine, and Johannes Rainer. 2025. “Rformassspectrometry/Metabonaut: Metabonaut Version 1.0.0.” Zenodo. https://doi.org/10.5281/ZENODO.15062930.