
LC-MS/MS Data Annotation using R and Python
2025-04-10
Source:vignettes/SpectriPy_tutorial_metabonaut.qmd
Introduction
The SpectriPy R package enables powerful mass spectrometry (MS) data analysis workflows combining the strengths of Python and R MS libraries. General concepts and examples of the package are available in the package’s main vignette.
As an example for a combined R/Python workflow, we annotate the LC-MS/MS spectra from the main end-to-end vignette using Python’s matchms library.
The MS2 processing methods which will be demonstrated on the used spectral reference library consists of the default filtering functionality from matchms (Huber et al. 2020) (default_filters()
, add_parent_mass()
and normalize_intensities()
).
The MS2 spectral similarity algorithm which is demonstrated here is the ModifiedCosine()
from matchms (Huber et al. 2020).
The spectral reference library used for the annotation of the unknown features in this tutorial originates from a small in-house reference library (provided in MGF format) available in the SpectriPy package.
The analysis in this document is performed using R and Python code chunks. A comment #’ R session: or #’ Python session: is used for easier distinction of these.
Load SpectriPy
Load the required R SpectriPy package. If you already have a Python environment opened, please restart your Integrated Development Environment and run this as a first command, to load the required package reticulate and setup the conda environment ‘r-spectripy’ correctly. Please see the Detailed information on installation and configuration document for other options.
Load query MS2 data
The LC-MS/MS query data used in this tutorial, are derived from the Metabonaut resource (Louail and Rainer 2025). Introduction and a thorough description of the preprocessing steps performed are described in A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R.
First, we load the Spectra
object with the MS2 spectra of the unknown features found to be significant in the “Differential abundance analysis”, see section MS2-based annotation. This Spectra
object is shared as part of the Metabonaut package.
#' R session:
#' R MS package
library(Spectra)
#' Load the MS2 spectra of significant features
load(system.file("extdata", "spectra_significant_fts.RData",
package = "Metabonaut"))
ms2_ctr_fts
MSn data (Spectra) with 315 spectra in a MsBackendMemory backend:
msLevel rtime scanIndex
<integer> <numeric> <integer>
1 2 147.357 2043
2 2 148.587 2061
3 2 149.817 2079
4 2 152.297 2115
5 2 147.376 2041
... ... ... ...
311 2 178.082 2481
312 2 179.322 2499
313 2 180.572 2517
314 2 181.822 2535
315 2 183.072 2553
... 39 more variables/columns.
Processing:
Filter: select retention time [10..240] on MS level(s) 1 2 [Tue Mar 18 11:56:42 2025]
Filter: select MS level(s) 2 [Tue Mar 18 11:56:50 2025]
Remove peaks based on their intensities and a user-provided function in spectra of MS level(s) 2. [Tue Mar 18 11:56:50 2025]
...19 more processings. Use 'processingLog' to list all.
#' Print the available metadata, stored in the Spectra object
spectraVariables(ms2_ctr_fts)
[1] "msLevel" "rtime"
[3] "acquisitionNum" "scanIndex"
[5] "dataStorage" "dataOrigin"
[7] "centroided" "smoothed"
[9] "polarity" "precScanNum"
[11] "precursorMz" "precursorIntensity"
[13] "precursorCharge" "collisionEnergy"
[15] "isolationWindowLowerMz" "isolationWindowTargetMz"
[17] "isolationWindowUpperMz" "peaksCount"
[19] "totIonCurrent" "basePeakMZ"
[21] "basePeakIntensity" "ionisationEnergy"
[23] "lowMZ" "highMZ"
[25] "mergedScan" "mergedResultScanNum"
[27] "mergedResultStartScanNum" "mergedResultEndScanNum"
[29] "injectionTime" "filterString"
[31] "spectrumId" "ionMobilityDriftTime"
[33] "scanWindowLowerLimit" "scanWindowUpperLimit"
[35] "electronBeamEnergy" "mtbls_id"
[37] "mtbls_assay_name" "derived_spectral_data_file"
[39] "collision_energy" "feature_id"
#' Print the feature_id of the first spectrum
ms2_ctr_fts$feature_id[1]
[1] "FT0371"
Filter query data
To ensure this Spectra
object only contains MS2 data, we filter to only MS2 spectra with more than 2 fragment peaks per spectrum using some classical filtering functions from the Spectra package.
#' R session:
#' Filter MS2 level data
ms2_ctr_fts <- filterMsLevel(ms2_ctr_fts, 2L)
#' filter minimum 3 fragment peaks
ms2_ctr_fts <- ms2_ctr_fts[lengths(ms2_ctr_fts) >= 3]
ms2_ctr_fts
MSn data (Spectra) with 291 spectra in a MsBackendMemory backend:
msLevel rtime scanIndex
<integer> <numeric> <integer>
1 2 147.357 2043
2 2 148.587 2061
3 2 152.297 2115
4 2 147.376 2041
5 2 148.616 2059
... ... ... ...
287 2 178.082 2481
288 2 179.322 2499
289 2 180.572 2517
290 2 181.822 2535
291 2 183.072 2553
... 39 more variables/columns.
Processing:
Filter: select retention time [10..240] on MS level(s) 1 2 [Tue Mar 18 11:56:42 2025]
Filter: select MS level(s) 2 [Tue Mar 18 11:56:50 2025]
Remove peaks based on their intensities and a user-provided function in spectra of MS level(s) 2. [Tue Mar 18 11:56:50 2025]
...20 more processings. Use 'processingLog' to list all.
Load reference MS2 data
As the in-house spectral library, we import a small test data file in MGF format. This file is part of the SpectriPy package and we below define its path on the local computer.
#' R session:
#' Define a variable with the path and file name of the MGF file
mgf_file <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
The loading of this file is then performed using the Python matchms library. The variable with the MGF file name can be accessed in the associated Python session using r.mgf_file
. The loaded object is a Python list of matchms.Spectrum
objects.
#' Python session:
from matchms.importing import load_from_mgf
#' Read spectra from an MGF formatted file, as Spectrum object
= list(load_from_mgf(r.mgf_file))
mgf_py
#' Nr of spectra
len(mgf_py)
100
#' Access the first spectrum
0] mgf_py[
Spectrum(precursor m/z=259.06, 3 fragments between 213.1 and 259.1)
Note that we can also access the first spectrum from an R session, by starting the command with py$
.
#' R session:
#' Access the first spectrum
py$mgf_py[[1]]
Spectrum(precursor m/z=259.06, 3 fragments between 213.1 and 259.1)
Translate query MS data to Python
First, we check if the R Spectra
object containing the query MS2 data can be accessed in Python using the r.
prefix.
#' Python session:
#' check if the r Spectra object can be accessed in python using the 'r.'
#' prefix. Print the first spectrum
1] r.ms2_ctr_fts[
Spectrum(precursor m/z=138.05, 4 fragments between 73.1 and 92.1)
#' Show which metadata is available in the first spectrum
0].metadata.keys() r.ms2_ctr_fts[
dict_keys(['precursor_mz', 'precursor_intensity', 'charge', 'retention_time', 'collision_energy', 'isolation_window_target_mz', 'ms_level'])
Second, we translate the Spectra
object ms2_ctr_fts
to respective Python matchms.Spectrum
objects using the rspec_to_pyspec()
function. With the py_set_attr()
function we assign the variable directly to an attribute in the Python session (which avoids repeated cross-programming language references).
#' R session:
#' Add mapping for additional spectra variables to the default mapping in R and
#' python, respectively
map = c(defaultSpectraVariableMapping(),
feature_id = 'feature_id')
#' Convert to py Spectrum
py_set_attr(py, "ms2_ctr_fts_py", rspec_to_pyspec(ms2_ctr_fts, mapping = map))
We can now access the converted Spectra
object in Python.
#' Python session:
#' Access the first converted spectrum
0] ms2_ctr_fts_py[
Spectrum(precursor m/z=138.05, 4 fragments between 73.1 and 92.1)
Filter the reference library
Before we run the spectral comparisons of our query data to the MGF reference library, we first apply some MS2 processing from matchms. Default filtering from matchms is performed to standardize ion mode, correct charge and more. See the matchms filtering documentation. Also, we keep only reference spectra with a precursor m/z as the similarity algorithm we use later requires spectra to have a precursor m/z.
#' Python session:
from matchms.filtering import default_filters, normalize_intensities, add_parent_mass
#' Apply filters to clean and enhance each spectrum
= []
clean_mgf_py for spectrum in mgf_py:
#' Apply default filter to standardize ion mode, correct charge and more.
#' Default filter is fully explained at
#' https://matchms.readthedocs.io/en/latest/api/matchms.filtering.html
= default_filters(spectrum)
spectrum #' For missing precursor_mz field: check if there is “pepmass”” entry instead
= add_parent_mass(spectrum)
spectrum #' Scale peak intensities to maximum of 1
= normalize_intensities(spectrum)
spectrum #' Only add spectra that have a precursor m/z
if "precursor_mz" in spectrum.metadata:
clean_mgf_py.append(spectrum)
#' Nr of spectra
len(clean_mgf_py)
78
Calculating spectra similarities using the Modified Cosine algorithm from matchms
We calculate the pairwise spectral similarity between the query spectra and the reference library spectra using Python’s matchms library.
Here, we use the spectral similarity algorithm ModifiedCosine from matchms, from source matchms. This algorithm can easily be exchanged for another spectral similarity calculation from matchms (see here).
#' Python session:
from matchms import calculate_scores
from matchms.similarity import ModifiedCosine
#' Calculate Cosine similarity scores between all spectra
#' For other similarity score methods see
#' https://matchms.readthedocs.io/en/latest/api/matchms.similarity.html
= ModifiedCosine(tolerance = 0.1)
similarity_score = calculate_scores(references = clean_mgf_py,
scores = ms2_ctr_fts_py,
queries = similarity_score)
similarity_function scores
<78x291x2 stacked sparse array containing scores for ('ModifiedCosine_score', 'ModifiedCosine_matches') with 12376 stored elements in COOrdinate format>
We next rearrange the spectra similarity results to make a data frame containing the best matched compound name (derived from the reference library) per queried spectrum.
First, we extract and transpose the scores as a python array. Each row of the array will contain the similarity scores of one spectrum from our query spectra ms2_ctr_fts_py
against the cleaned reference library clean_mgf_py
.
#' Python session:
#' Convert to array and transpose
= scores.to_array()["ModifiedCosine_score"]
sim_matchms = sim_matchms.T
sim_matchms
#' Contains 1 row for each spectrum in query
sim_matchms.shape
(291, 78)
Next, we create a data frame with per queried spectrum from our unknown variables, the compound name of the higest matching spectra from the reference library and the corresponding similarity score.
#' Python session:
import numpy as np
import pandas as pd
#' Prepare results list
= []
results for i in range(sim_matchms.shape[0]):
#' row is the query, keep nr in the results instead of replacing by eg id
= ms2_ctr_fts_py[i].get('feature_id')
name_row = sim_matchms[i].copy()
row_values #' match with higest col nr from the references
= np.argmax(row_values) #' Find column index of max value
max_col = row_values[max_col] #' Get max value
max_value #' replace the nr of refererences with the name
= clean_mgf_py[max_col].get('compound_name')
name_max_col "query": i + 1,
results.append({"query_feature_id": name_row,
"reference": max_col,
"reference_compound_name": name_max_col,
"ModifiedCosine_score": max_value})
#' Convert to DataFrame
= pd.DataFrame(results)
df
#' Print the first 5 rows of the unfiltered DataFrame
df.head()
query query_feature_id ... reference_compound_name ModifiedCosine_score
0 1 FT0371 ... Benzyl succinate 0.554071
1 2 FT0371 ... Benzyl succinate 0.557885
2 3 FT0371 ... L-(+)-Ergothioneine 0.103269
3 4 FT0371 ... Benzyl succinate 0.464949
4 5 FT0371 ... Benzyl succinate 0.508411
[5 rows x 5 columns]
[!] Caution: As the higest score is taken as criteria for the annotation, a lot of caution is needed evaluation the trueness of the match. A low score is not reliable, as the similarity algorithm will calculate a score for each pair of spectra. Therefore, a match will always be found. In addition, if your unknown compound is absent in the reference library, it will match wrongly to another compound that is present in the database.
Below we filter the results retaining only matches with a similarity value above 0.6. Above this value, the potential annotations need to be validated using e.g. rerunning samples in the presence of commercial standards.
#' Python session:
#' Keep only rows where score >= 0.6
= df[df["ModifiedCosine_score"] >= 0.6] df_filtered
#' R session:
library(pander)
#' Print the filtered DataFrame
pandoc.table(py$df_filtered, style = "rmarkdown", split.table = Inf)
query | query_feature_id | reference | reference_compound_name | ModifiedCosine_score | |
---|---|---|---|---|---|
14 | 15 | FT0371 | 52 | Nordihydroguaiaretic acid | 0.6667 |
33 | 34 | FT0565 | 35 | Ethambutol | 0.9623 |
34 | 35 | FT0565 | 35 | Ethambutol | 0.8809 |
37 | 38 | FT0565 | 35 | Ethambutol | 0.9412 |
38 | 39 | FT0565 | 35 | Ethambutol | 0.8681 |
40 | 41 | FT0565 | 35 | Ethambutol | 0.935 |
41 | 42 | FT0565 | 35 | Ethambutol | 0.7319 |
44 | 45 | FT0565 | 35 | Ethambutol | 0.89 |
48 | 49 | FT0565 | 35 | Ethambutol | 0.6822 |
51 | 52 | FT0565 | 38 | Benzyl succinate | 0.7218 |
52 | 53 | FT0565 | 38 | Benzyl succinate | 0.7516 |
53 | 54 | FT0565 | 47 | Bisphenol AF | 0.6977 |
56 | 57 | FT0565 | 39 | Isethionate | 0.6666 |
57 | 58 | FT0565 | 38 | Benzyl succinate | 0.7443 |
67 | 68 | FT0732 | 72 | Prednisolone_Tebutate | 0.7048 |
68 | 69 | FT0732 | 72 | Prednisolone_Tebutate | 0.7729 |
70 | 71 | FT0732 | 72 | Prednisolone_Tebutate | 0.7103 |
71 | 72 | FT0732 | 72 | Prednisolone_Tebutate | 0.6087 |
73 | 74 | FT0732 | 72 | Prednisolone_Tebutate | 0.6523 |
74 | 75 | FT0732 | 72 | Prednisolone_Tebutate | 0.8334 |
87 | 88 | FT0732 | 19 | Isoproturon-didemethyl | 0.6235 |
89 | 90 | FT0732 | 72 | Prednisolone_Tebutate | 0.6793 |
97 | 98 | FT0732 | 9 | N-Methyl-2-pyrrolidone | 0.7953 |
103 | 104 | FT0732 | 72 | Prednisolone_Tebutate | 0.6041 |
104 | 105 | FT0732 | 72 | Prednisolone_Tebutate | 0.6836 |
109 | 110 | FT0732 | 72 | Prednisolone_Tebutate | 0.6295 |
111 | 112 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.6005 |
112 | 113 | FT0732 | 16 | Caffeine | 0.6181 |
113 | 114 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.6843 |
125 | 126 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.6863 |
126 | 127 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7172 |
127 | 128 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7157 |
128 | 129 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.721 |
129 | 130 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.6315 |
134 | 135 | FT0732 | 72 | Prednisolone_Tebutate | 0.7612 |
137 | 138 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.6478 |
139 | 140 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.6301 |
140 | 141 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7013 |
141 | 142 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7017 |
142 | 143 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7018 |
143 | 144 | FT0732 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7863 |
186 | 187 | FT0732 | 9 | N-Methyl-2-pyrrolidone | 0.636 |
220 | 221 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.737 |
221 | 222 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.733 |
222 | 223 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7256 |
223 | 224 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7305 |
227 | 228 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.6283 |
228 | 229 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7359 |
229 | 230 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7296 |
230 | 231 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.73 |
239 | 240 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7352 |
240 | 241 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7351 |
247 | 248 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7391 |
248 | 249 | FT0845 | 11 | 3-(4-Isopropylphenyl)isobutyraldehyde | 0.7332 |
To visually inspect how good the query and reference spectra match, we refer to the Metabonaut resource on how to generate the mirror plots and perform precursor m/z filtering (e.g. maximum 1 Da difference).
Session information
#' R session:
sessionInfo()
R Under development (unstable) (2025-03-08 r87910)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] pander_0.6.6 Spectra_1.17.10 BiocParallel_1.41.5
[4] S4Vectors_0.45.4 BiocGenerics_0.53.6 generics_0.1.3
[7] SpectriPy_0.5.3 reticulate_1.42.0
loaded via a namespace (and not attached):
[1] cli_3.6.4 knitr_1.50 rlang_1.1.5
[4] xfun_0.52 ProtGenerics_1.39.2 png_0.1-8
[7] jsonlite_2.0.0 clue_0.3-66 htmltools_0.5.8.1
[10] rmarkdown_2.29 grid_4.5.0 evaluate_1.0.3
[13] MASS_7.3-65 fastmap_1.2.0 yaml_2.3.10
[16] IRanges_2.41.3 MsCoreUtils_1.19.2 cluster_2.1.8.1
[19] compiler_4.5.0 codetools_0.2-20 fs_1.6.5
[22] Rcpp_1.0.14 MetaboCoreUtils_1.15.0 lattice_0.22-7
[25] digest_0.6.37 parallel_4.5.0 Matrix_1.7-3
[28] tools_4.5.0