Compiled: Fri Jan 17 08:59:18 2025
Introduction
The SpectriPy package allows integration of Python MS
packages into a Spectra-based MS analysis in R. Python
functionality is wrapped into R functions allowing a seamless
integration of the functionality of Python’s matchms package
into R. In addition, functions to convert between R’s
Spectra
objects and Python’s matchms spectrum
objects are available to the advanced user or developer enabling to
create custom functions or workflows on Spectra
objects in
Python and executing them in R using the reticulate R
package.
Installation
The package requires a python environment to be available and can be
installed with the BiocManager R package using the command
BiocManager::install("RforMassSpectrometry/SpectriPy")
. All
required python libraries are installed automatically on demand. Note
that first installation or first invocation of specific functions might
take long because of this installation process.
Spectra similarity calculations using matchms
The SpectriPy package provides the
compareSpectriPy()
function that allows to perform spectra
similarity calculations using the scoring functions from the
matchms Python package. Below all currently supported scoring
functions are listed along with the parameter class that allows
selecting and configuring the algorithm in the
compareSpectriPy()
function. Additional functions will be
added in future.
-
CosineGreedy:
CosineGreedyParam
. -
CosineHungarian:
CosineHungarianParam
. -
ModifiedCosineParam:
ModifiedCosineParam
.
We next create some simple example spectra and subsequently use the
compareSpectriPy()
function to calculate pairwise
similarities between these.
library(Spectra)
library(SpectriPy)
## Create a Spectra object with two MS2 spectra for Caffeine.
caf <- DataFrame(
msLevel = c(2L, 2L),
name = "Caffeine",
precursorMz = c(195.0877, 195.0877)
)
caf$intensity <- list(
c(340.0, 416, 2580, 412),
c(388.0, 3270, 85, 54, 10111))
caf$mz <- list(
c(135.0432, 138.0632, 163.0375, 195.0880),
c(110.0710, 138.0655, 138.1057, 138.1742, 195.0864))
caf <- Spectra(caf)
## Create a Spectra object with two MS2 spectra for 1-Methylhistidine
mhd <- DataFrame(
msLevel = c(2L, 2L),
precursorMz = c(170.0924, 170.0924),
id = c("HMDB0000001", "HMDB0000001"),
name = c("1-Methylhistidine", "1-Methylhistidine"))
mhd$mz <- list(
c(109.2, 124.2, 124.5, 170.16, 170.52),
c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16))
mhd$intensity <- list(
c(3.407, 47.494, 3.094, 100.0, 13.240),
c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643))
mhd <- Spectra(mhd)
We first calculate pairwise similarities between all spectra defined
above and those of caffeine using Spectra’s built-in
compareSpectra()
function.
all <- c(caf, mhd)
res_r <- compareSpectra(all, caf)
res_r
## 1 2
## 1 1.0000000 0.1973448
## 2 0.1973448 1.0000000
## 3 0.0000000 0.0000000
## 4 0.0000000 0.0000000
Thus, compareSpectra()
returned the pairwise similarity
scores (by default calculated using the normalized dot-product function)
between all spectra in all
(rows) and all spectra in
caf
(columns). compareSpectriPy()
works
similar, with the difference that we need to specify and configure the
similarity function (from matchms) using a dedicated parameter
object. Below we calculate the similarity using the
CosineGreedy function changing the tolerance
to a
value of 0.05
(instead of the default
0.1
).
res <- compareSpectriPy(all, caf, param = CosineGreedyParam(tolerance = 0.05))
res
## [,1] [,2]
## [1,] 1.0000000 0.1948181
## [2,] 0.1948181 1.0000000
## [3,] 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000
As a result compareSpectriPy
returns also a numeric
matrix of similarities. Note also that the first
compareSpectriPy
call takes usually a little longer because
the Python setup has to be initialized.
Next we use the ModifiedCosine algorithm that considers also differences between the spectra’s precursor m/z in the calculation.
res <- compareSpectriPy(all, caf, param = ModifiedCosineParam())
res
## [,1] [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.13841831 0.8520549
## [4,] 0.05724816 0.3523997
Note that for this calculation all spectra precursor m/z values need
to be available, otherwise an error will be thrown. Thus, we should
always ensure to remove spectra without precursor m/z values prior to
similarity scoring with this similarity method. Below we remove the
precursor m/z from one of our input spectra and then show how the
Spectra
object could be subsetted to valid spectra
for this method.
## Remove precursor m/z from the 3rd spectrum
all$precursorMz[3] <- NA
## Filter the input spectra removing those with missing precursor.
all <- all[!is.na(precursorMz(all))]
compareSpectriPy(all, caf, param = ModifiedCosineParam())
## [,1] [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.05724816 0.3523997
Advanced use and internals
For advanced users or developers, the rspec_to_pyspec
and pyspec_to_rspec
functions are available that enable
conversion between MS data representations in R
and Python
(i.e. between the Spectra
object and the Python
matchms
Spectrum object). To use these functions the
reticulate
package needs to be installed along with a
Python environment and the matchms
Python package.
To illustrate their use we initialize below the Python environment
that is bundled using the basilisk
package within SpectriPy
.
## Loading required package: reticulate
cl <- basiliskStart(SpectriPy:::matchms_env)
We next create a simple Spectra
object representing
fragment spectra for some small compounds.
library(Spectra)
# create example spectra
spd <- DataFrame(
msLevel = c(2L, 2L, 2L),
id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"),
name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine"))
## Assign m/z and intensity values.
spd$mz <- list(
c(109.2, 124.2, 124.5, 170.16, 170.52),
c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16),
c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712,
111.0551, 123.0429, 138.0662, 195.0876))
spd$intensity <- list(
c(3.407, 47.494, 3.094, 100.0, 13.240),
c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643),
c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994))
sps <- Spectra(spd)
We next convert the Spectra
to a matchms
Spectrum object.
pysps <- rspec_to_pyspec(sps)
pysps
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]
As a result we got now a Python list with 3 Spectrum objects
containing the peak data of sps
as well as a reduced set of
the available spectra variables in sps
. Which spectra
variables get copied to Python can be defined with the
mapping
parameter of rspec_to_pyspec
. The
default is to convert all variables defined by
spectraVariableMapping()
, but additional variables along
with their respective names in the Python Spectrum object can be defined
too. Below we list the pre-selected spectra variables that are converted
by default.
## precursorMz precursorIntensity
## "precursor_mz" "precursor_intensity"
## precursorCharge rtime
## "charge" "retention_time"
## collisionEnergy isolationWindowTargetMz
## "collision_energy" "isolation_window_target_mz"
## msLevel
## "ms_level"
With our Spectra
data converted to Python, we could
directly call all routines from the matchms
package using
the reticulate
R package. Below we normalize the
intensities of the 3 spectra using the from
normalize_intensities
functions from the
matchms.filtering
Python package. We thus need to import
first the functionality from that package and can then directly call the
function on the objects.
library(reticulate)
filters <- import("matchms.filtering")
res <- vector("list", length(pysps))
for (i in (seq_along(pysps) - 1))
res[[i + 1]] <- filters$normalize_intensities(pysps[i])
res <- r_to_py(res)
res
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]
We can now convert the list of Python Spectrum objects back to R with
pyspec_to_rspec
:
sps_r <- pyspec_to_rspec(res)
#' The normalized intensities
intensity(sps_r)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994
#' The original intensities
intensity(sps)
## NumericList of length 3
## [[1]] 3.407 47.494 3.094 100 13.24
## [[2]] 6.685 4.381 3.022 16.708 100 4.565 40.643
## [[3]] 0.459 2.585 2.446 0.508 8.968 0.524 0.974 100 40.994
Intensity values have now been normalized to values between 0 and 1.
Note however that, it would be much better (and likely more efficient) to directly use a Python function on the list of Spectrum objects instead of the mixed R and Python code used in the example above. Below we define a simple python script that iterates over the spectra in python and performs the normalization.
py_script <- paste0("from matchms.filtering import normalize_intensities\n",
"for i in range(len(pysps)):\n",
" pysps[i] = normalize_intensities(pysps[i])\n")
py$pysps <- pysps
py_run_string(py_script)
tmp <- pyspec_to_rspec(pysps)
intensity(tmp)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994
At very last we need also to stop the Python environment created by
basilisk
.
basiliskStop(cl)
Session information
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] basilisk_1.18.0 reticulate_1.40.0 SpectriPy_0.2.0
## [4] Spectra_1.16.0 BiocParallel_1.40.0 S4Vectors_0.44.0
## [7] BiocGenerics_0.52.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 jsonlite_1.8.9 compiler_4.4.2
## [4] BiocManager_1.30.25 filelock_1.0.3 Rcpp_1.0.14
## [7] parallel_4.4.2 cluster_2.1.8 jquerylib_0.1.4
## [10] png_0.1-8 systemfonts_1.1.0 IRanges_2.40.1
## [13] textshaping_0.4.1 yaml_2.3.10 fastmap_1.2.0
## [16] lattice_0.22-6 R6_2.5.1 ProtGenerics_1.38.0
## [19] knitr_1.49 htmlwidgets_1.6.4 MASS_7.3-64
## [22] bookdown_0.42 desc_1.4.3 bslib_0.8.0
## [25] rlang_1.1.4 dir.expiry_1.14.0 cachem_1.1.0
## [28] xfun_0.50 fs_1.6.5 MsCoreUtils_1.18.0
## [31] sass_0.4.9 cli_3.6.3 withr_3.0.2
## [34] pkgdown_2.1.1.9000 grid_4.4.2 digest_0.6.37
## [37] MetaboCoreUtils_1.14.0 lifecycle_1.0.4 clue_0.3-66
## [40] evaluate_1.0.3 codetools_0.2-20 ragg_1.3.3
## [43] rmarkdown_2.29 basilisk.utils_1.18.0 tools_4.4.2
## [46] htmltools_0.5.8.1