Compiled: Thu Feb 6 16:06:26 2025
Introduction
The SpectriPy package allows integration of Python MS
packages into a Spectra-based MS analysis in R. Python
functionality is wrapped into R functions allowing a seamless
integration of the functionality of Python’s matchms package
into R. In addition, functions to convert between R’s
Spectra
objects and Python’s matchms spectrum
objects are available to the advanced user or developer enabling to
create custom functions or workflows on Spectra
objects in
Python and executing them in R using the reticulate R
package.
Installation
If the BiocManager package is not already available, please
install it with install.packages("BiocManager")
To install the development version of SpectriPy from GitHub, please install the remotes package with `install.packages(“remotes”).
The package requires a python environment to be available and can be
installed with the BiocManager R package using the command
BiocManager::install("RforMassSpectrometry/SpectriPy")
.
This will install the latest version of the package from GitHub. All
required python libraries are installed automatically on demand. Note
that first installation or first invocation of specific functions might
take long because of this installation process.
Spectra similarity calculations using matchms
The SpectriPy package provides the
compareSpectriPy()
function that allows to perform spectra
similarity calculations using the scoring functions from the
matchms Python package. Below all currently supported scoring
functions are listed along with the parameter class that allows
selecting and configuring the algorithm in the
compareSpectriPy()
function. Additional functions will be
added in future.
-
CosineGreedy:
CosineGreedyParam
. -
CosineHungarian:
CosineHungarianParam
. -
ModifiedCosineParam:
ModifiedCosineParam
.
We next create some simple example spectra and subsequently use the
compareSpectriPy()
function to calculate pairwise
similarities between these.
library(Spectra)
library(SpectriPy)
## Create a Spectra object with two MS2 spectra for Caffeine.
caf <- DataFrame(
msLevel = c(2L, 2L),
name = "Caffeine",
precursorMz = c(195.0877, 195.0877)
)
caf$intensity <- list(
c(340.0, 416, 2580, 412),
c(388.0, 3270, 85, 54, 10111))
caf$mz <- list(
c(135.0432, 138.0632, 163.0375, 195.0880),
c(110.0710, 138.0655, 138.1057, 138.1742, 195.0864))
caf <- Spectra(caf)
## Create a Spectra object with two MS2 spectra for 1-Methylhistidine
mhd <- DataFrame(
msLevel = c(2L, 2L),
precursorMz = c(170.0924, 170.0924),
id = c("HMDB0000001", "HMDB0000001"),
name = c("1-Methylhistidine", "1-Methylhistidine"))
mhd$mz <- list(
c(109.2, 124.2, 124.5, 170.16, 170.52),
c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16))
mhd$intensity <- list(
c(3.407, 47.494, 3.094, 100.0, 13.240),
c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643))
mhd <- Spectra(mhd)
We first calculate pairwise similarities between all spectra defined
above and those of caffeine using Spectra’s built-in
compareSpectra()
function.
all <- c(caf, mhd)
res_r <- compareSpectra(all, caf)
res_r
## 1 2
## 1 1.0000000 0.1973448
## 2 0.1973448 1.0000000
## 3 0.0000000 0.0000000
## 4 0.0000000 0.0000000
Thus, compareSpectra()
returned the pairwise similarity
scores (by default calculated using the normalized dot-product function)
between all spectra in all
(rows) and all spectra in
caf
(columns). compareSpectriPy()
works
similar, with the difference that we need to specify and configure the
similarity function (from matchms) using a dedicated parameter
object. Below we calculate the similarity using the
CosineGreedy function changing the tolerance
to a
value of 0.05
(instead of the default
0.1
).
Executing the following step will automatically download and install conda mini forge to manage Python package dependencies.
res <- compareSpectriPy(all, caf, param = CosineGreedyParam(tolerance = 0.05))
res
## [,1] [,2]
## [1,] 1.0000000 0.1948181
## [2,] 0.1948181 1.0000000
## [3,] 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000
As a result compareSpectriPy()
returns also a numeric
matrix of similarities. Note also that the first
compareSpectriPy()
call takes usually a little longer
because the Python setup has to be initialized.
Next we use the ModifiedCosine algorithm that considers also differences between the spectra’s precursor m/z in the calculation.
res <- compareSpectriPy(all, caf, param = ModifiedCosineParam())
res
## [,1] [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.13841831 0.8520549
## [4,] 0.05724816 0.3523997
A third example is the CosineHungarian algorithm, also a
similarity function from matchms. This algorithm calculates the
cosine similarity score as with CosineGreedyParam
,
but using the Hungarian algorithm to find the best matching peaks
between the compared spectra. The algorithm can be configured with
parameters tolerance
, mzPower
and
intensityPower
(see parameter description for more
details).
res <- compareSpectriPy(all, caf, param = CosineHungarianParam())
res
## [,1] [,2]
## [1,] 1.0000000 0.1948181
## [2,] 0.1948181 1.0000000
## [3,] 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000
A fourth example is the NeutralLossesCosine algorithm, also
a similarity function from matchms. The neutral losses cosine
score aims at quantifying the similarity between two mass spectra. The
score is calculated by finding best possible matches between peaks of
two spectra. Two peaks are considered a potential match if their
m/z ratios lie within the given tolerance
once a
mass-shift is applied. The mass shift is the difference in
precursor-m/z between the two spectra.
res <- compareSpectriPy(all, caf, param = NeutralLossesCosineParam())
res
## [,1] [,2]
## [1,] 1.00000000 0.04853991
## [2,] 0.04853991 1.00000000
## [3,] 0.00000000 0.00000000
## [4,] 0.00000000 0.00000000
Note that for the abovementioned similarity calculations all spectra
precursor m/z values need to be available, otherwise an error
will be thrown. Thus, we should always ensure to remove spectra without
precursor m/z values prior to similarity scoring with this
similarity method. Below we remove the precursor m/z from one
of our input spectra and then show how the Spectra
object
could be subsetted to valid spectra for this method.
## Remove precursor m/z from the 3rd spectrum
all$precursorMz[3] <- NA
## Filter the input spectra removing those with missing precursor.
all <- all[!is.na(precursorMz(all))]
compareSpectriPy(all, caf, param = ModifiedCosineParam())
## [,1] [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.05724816 0.3523997
Advanced use and internals
For advanced users or developers, the rspec_to_pyspec()
and pyspec_to_rspec()
functions are available that enable
conversion between MS data representations in R and Python (i.e. between
the Spectra::Spectra
object and the Python
matchms.Spectrum
object). To use these functions the
reticulate R package needs to be installed along with a Python
environment and the matchms Python package.
To illustrate their use we initialize below the Python environment that is bundled using the basilisk package within SpectriPy.
## Loading required package: reticulate
cl <- basiliskStart(SpectriPy:::matchms_env)
We next create a simple Spectra
object representing
fragment spectra for some small compounds.
library(Spectra)
# create example spectra
spd <- DataFrame(
msLevel = c(2L, 2L, 2L),
id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"),
name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine"))
## Assign m/z and intensity values.
spd$mz <- list(
c(109.2, 124.2, 124.5, 170.16, 170.52),
c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16),
c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712,
111.0551, 123.0429, 138.0662, 195.0876))
spd$intensity <- list(
c(3.407, 47.494, 3.094, 100.0, 13.240),
c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643),
c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994))
sps <- Spectra(spd)
We next convert the Spectra
to a
matchms.Spectrum
object.
pysps <- rspec_to_pyspec(sps)
pysps
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]
As a result we got now a Python list with 3 Spectrum objects
containing the peak data of sps
as well as a reduced set of
the available spectra variables in sps
. Which spectra
variables get copied to Python can be defined with the
mapping
parameter of rspec_to_pyspec
. The
default is to convert all variables defined by
spectraVariableMapping()
, but additional variables along
with their respective names in the Python Spectrum object can be defined
too. Below we list the pre-selected spectra variables that are converted
by default.
## precursorMz precursorIntensity
## "precursor_mz" "precursor_intensity"
## precursorCharge rtime
## "charge" "retention_time"
## collisionEnergy isolationWindowTargetMz
## "collision_energy" "isolation_window_target_mz"
## msLevel
## "ms_level"
With our Spectra
data converted to Python, we could
directly call all routines from the matchms
package using
the reticulate R package. Below we normalize the intensities of
the 3 spectra using the normalize_intensities()
function
from the matchms.filtering library. We thus need to first
import the functionality from that package and can then call the
function directly on the (Python) objects.
library(reticulate)
filters <- import("matchms.filtering")
res <- vector("list", length(pysps))
for (i in (seq_along(pysps) - 1))
res[[i + 1]] <- filters$normalize_intensities(pysps[i])
res <- r_to_py(res)
res
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]
We can now convert the list of Python matchms.Spectrum
objects back to R with pyspec_to_rspec()
:
sps_r <- pyspec_to_rspec(res)
#' The normalized intensities
intensity(sps_r)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994
#' The original intensities
intensity(sps)
## NumericList of length 3
## [[1]] 3.407 47.494 3.094 100 13.24
## [[2]] 6.685 4.381 3.022 16.708 100 4.565 40.643
## [[3]] 0.459 2.585 2.446 0.508 8.968 0.524 0.974 100 40.994
Intensity values have now been normalized to values between 0 and 1.
Note however that, it would be much better (and likely more efficient) to directly use a Python function on the list of Spectrum objects instead of the mixed R and Python code used in the example above. Below we define a simple python script that iterates over the spectra in python and performs the normalization.
py_script <- paste0("from matchms.filtering import normalize_intensities\n",
"for i in range(len(pysps)):\n",
" pysps[i] = normalize_intensities(pysps[i])\n")
py$pysps <- pysps
py_run_string(py_script)
tmp <- pyspec_to_rspec(pysps)
intensity(tmp)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994
At very last we need also to stop the Python environment enabled by basilisk.
basiliskStop(cl)
Session information
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] basilisk_1.18.0 reticulate_1.40.0 SpectriPy_0.2.1
## [4] Spectra_1.16.0 BiocParallel_1.40.0 S4Vectors_0.44.0
## [7] BiocGenerics_0.52.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-2 jsonlite_1.8.9 compiler_4.4.2
## [4] BiocManager_1.30.25 filelock_1.0.3 Rcpp_1.0.14
## [7] parallel_4.4.2 cluster_2.1.8 jquerylib_0.1.4
## [10] png_0.1-8 systemfonts_1.2.1 IRanges_2.40.1
## [13] textshaping_1.0.0 yaml_2.3.10 fastmap_1.2.0
## [16] lattice_0.22-6 R6_2.5.1 ProtGenerics_1.38.0
## [19] knitr_1.49 htmlwidgets_1.6.4 MASS_7.3-64
## [22] bookdown_0.42 desc_1.4.3 bslib_0.9.0
## [25] rlang_1.1.5 dir.expiry_1.14.0 cachem_1.1.0
## [28] xfun_0.50 fs_1.6.5 MsCoreUtils_1.18.0
## [31] sass_0.4.9 cli_3.6.3 withr_3.0.2
## [34] pkgdown_2.1.1.9000 grid_4.4.2 digest_0.6.37
## [37] MetaboCoreUtils_1.14.0 lifecycle_1.0.4 clue_0.3-66
## [40] evaluate_1.0.3 codetools_0.2-20 ragg_1.3.3
## [43] rmarkdown_2.29 basilisk.utils_1.18.0 tools_4.4.2
## [46] htmltools_0.5.8.1