Skip to contents

Compiled: Thu Feb 6 16:06:26 2025

Introduction

The SpectriPy package allows integration of Python MS packages into a Spectra-based MS analysis in R. Python functionality is wrapped into R functions allowing a seamless integration of the functionality of Python’s matchms package into R. In addition, functions to convert between R’s Spectra objects and Python’s matchms spectrum objects are available to the advanced user or developer enabling to create custom functions or workflows on Spectra objects in Python and executing them in R using the reticulate R package.

Installation

If the BiocManager package is not already available, please install it with install.packages("BiocManager")

To install the development version of SpectriPy from GitHub, please install the remotes package with `install.packages(“remotes”).

The package requires a python environment to be available and can be installed with the BiocManager R package using the command BiocManager::install("RforMassSpectrometry/SpectriPy"). This will install the latest version of the package from GitHub. All required python libraries are installed automatically on demand. Note that first installation or first invocation of specific functions might take long because of this installation process.

Spectra similarity calculations using matchms

The SpectriPy package provides the compareSpectriPy() function that allows to perform spectra similarity calculations using the scoring functions from the matchms Python package. Below all currently supported scoring functions are listed along with the parameter class that allows selecting and configuring the algorithm in the compareSpectriPy() function. Additional functions will be added in future.

We next create some simple example spectra and subsequently use the compareSpectriPy() function to calculate pairwise similarities between these.

library(Spectra)
library(SpectriPy)

## Create a Spectra object with two MS2 spectra for Caffeine.
caf <- DataFrame(
    msLevel = c(2L, 2L),
    name = "Caffeine",
    precursorMz = c(195.0877, 195.0877)
)
caf$intensity <- list(
    c(340.0, 416, 2580, 412),
    c(388.0, 3270, 85, 54, 10111))
caf$mz <- list(
    c(135.0432, 138.0632, 163.0375, 195.0880),
    c(110.0710, 138.0655, 138.1057, 138.1742, 195.0864))
caf <- Spectra(caf)

## Create a Spectra object with two MS2 spectra for 1-Methylhistidine
mhd <- DataFrame(
    msLevel = c(2L, 2L),
    precursorMz = c(170.0924, 170.0924),
    id = c("HMDB0000001", "HMDB0000001"),
    name = c("1-Methylhistidine", "1-Methylhistidine"))
mhd$mz <- list(
    c(109.2, 124.2, 124.5, 170.16, 170.52),
    c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16))
mhd$intensity <- list(
    c(3.407, 47.494, 3.094, 100.0, 13.240),
    c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643))
mhd <- Spectra(mhd)

We first calculate pairwise similarities between all spectra defined above and those of caffeine using Spectra’s built-in compareSpectra() function.

all <- c(caf, mhd)
res_r <- compareSpectra(all, caf)
res_r
##           1         2
## 1 1.0000000 0.1973448
## 2 0.1973448 1.0000000
## 3 0.0000000 0.0000000
## 4 0.0000000 0.0000000

Thus, compareSpectra() returned the pairwise similarity scores (by default calculated using the normalized dot-product function) between all spectra in all (rows) and all spectra in caf (columns). compareSpectriPy() works similar, with the difference that we need to specify and configure the similarity function (from matchms) using a dedicated parameter object. Below we calculate the similarity using the CosineGreedy function changing the tolerance to a value of 0.05 (instead of the default 0.1).

Executing the following step will automatically download and install conda mini forge to manage Python package dependencies.

res <- compareSpectriPy(all, caf, param = CosineGreedyParam(tolerance = 0.05))
res
##           [,1]      [,2]
## [1,] 1.0000000 0.1948181
## [2,] 0.1948181 1.0000000
## [3,] 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000

As a result compareSpectriPy() returns also a numeric matrix of similarities. Note also that the first compareSpectriPy() call takes usually a little longer because the Python setup has to be initialized.

Next we use the ModifiedCosine algorithm that considers also differences between the spectra’s precursor m/z in the calculation.

res <- compareSpectriPy(all, caf, param = ModifiedCosineParam())
res
##            [,1]      [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.13841831 0.8520549
## [4,] 0.05724816 0.3523997

A third example is the CosineHungarian algorithm, also a similarity function from matchms. This algorithm calculates the cosine similarity score as with CosineGreedyParam, but using the Hungarian algorithm to find the best matching peaks between the compared spectra. The algorithm can be configured with parameters tolerance, mzPower and intensityPower (see parameter description for more details).

res <- compareSpectriPy(all, caf, param = CosineHungarianParam())
res
##           [,1]      [,2]
## [1,] 1.0000000 0.1948181
## [2,] 0.1948181 1.0000000
## [3,] 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000

A fourth example is the NeutralLossesCosine algorithm, also a similarity function from matchms. The neutral losses cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance once a mass-shift is applied. The mass shift is the difference in precursor-m/z between the two spectra.

res <- compareSpectriPy(all, caf, param = NeutralLossesCosineParam())
res
##            [,1]       [,2]
## [1,] 1.00000000 0.04853991
## [2,] 0.04853991 1.00000000
## [3,] 0.00000000 0.00000000
## [4,] 0.00000000 0.00000000

Note that for the abovementioned similarity calculations all spectra precursor m/z values need to be available, otherwise an error will be thrown. Thus, we should always ensure to remove spectra without precursor m/z values prior to similarity scoring with this similarity method. Below we remove the precursor m/z from one of our input spectra and then show how the Spectra object could be subsetted to valid spectra for this method.

## Remove precursor m/z from the 3rd spectrum
all$precursorMz[3] <- NA

## Filter the input spectra removing those with missing precursor.
all <- all[!is.na(precursorMz(all))]

compareSpectriPy(all, caf, param = ModifiedCosineParam())
##            [,1]      [,2]
## [1,] 1.00000000 0.1948181
## [2,] 0.19481813 1.0000000
## [3,] 0.05724816 0.3523997

Advanced use and internals

For advanced users or developers, the rspec_to_pyspec() and pyspec_to_rspec() functions are available that enable conversion between MS data representations in R and Python (i.e. between the Spectra::Spectra object and the Python matchms.Spectrum object). To use these functions the reticulate R package needs to be installed along with a Python environment and the matchms Python package.

To illustrate their use we initialize below the Python environment that is bundled using the basilisk package within SpectriPy.

## Loading required package: reticulate
cl <- basiliskStart(SpectriPy:::matchms_env)

We next create a simple Spectra object representing fragment spectra for some small compounds.

library(Spectra)

# create example spectra
spd <- DataFrame(
  msLevel = c(2L, 2L, 2L),
  id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"),
  name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine"))

## Assign m/z and intensity values.
spd$mz <- list(
  c(109.2, 124.2, 124.5, 170.16, 170.52),
  c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16),
  c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712,
    111.0551, 123.0429, 138.0662, 195.0876))
spd$intensity <- list(
  c(3.407, 47.494, 3.094, 100.0, 13.240),
  c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643),
  c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994))

sps <- Spectra(spd)

We next convert the Spectra to a matchms.Spectrum object.

pysps <- rspec_to_pyspec(sps)
pysps
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]

As a result we got now a Python list with 3 Spectrum objects containing the peak data of sps as well as a reduced set of the available spectra variables in sps. Which spectra variables get copied to Python can be defined with the mapping parameter of rspec_to_pyspec. The default is to convert all variables defined by spectraVariableMapping(), but additional variables along with their respective names in the Python Spectrum object can be defined too. Below we list the pre-selected spectra variables that are converted by default.

##                  precursorMz           precursorIntensity 
##               "precursor_mz"        "precursor_intensity" 
##              precursorCharge                        rtime 
##                     "charge"             "retention_time" 
##              collisionEnergy      isolationWindowTargetMz 
##           "collision_energy" "isolation_window_target_mz" 
##                      msLevel 
##                   "ms_level"

With our Spectra data converted to Python, we could directly call all routines from the matchms package using the reticulate R package. Below we normalize the intensities of the 3 spectra using the normalize_intensities() function from the matchms.filtering library. We thus need to first import the functionality from that package and can then call the function directly on the (Python) objects.

library(reticulate)
filters <- import("matchms.filtering")

res <- vector("list", length(pysps))
for (i in (seq_along(pysps) - 1))
    res[[i + 1]] <- filters$normalize_intensities(pysps[i])
res <- r_to_py(res)
res
## [Spectrum(precursor m/z=nan, 5 fragments between 109.2 and 170.5), Spectrum(precursor m/z=nan, 7 fragments between 83.1 and 170.2), Spectrum(precursor m/z=nan, 9 fragments between 56.0 and 195.1)]

We can now convert the list of Python matchms.Spectrum objects back to R with pyspec_to_rspec():

sps_r <- pyspec_to_rspec(res)
#' The normalized intensities
intensity(sps_r)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994
#' The original intensities
intensity(sps)
## NumericList of length 3
## [[1]] 3.407 47.494 3.094 100 13.24
## [[2]] 6.685 4.381 3.022 16.708 100 4.565 40.643
## [[3]] 0.459 2.585 2.446 0.508 8.968 0.524 0.974 100 40.994

Intensity values have now been normalized to values between 0 and 1.

Note however that, it would be much better (and likely more efficient) to directly use a Python function on the list of Spectrum objects instead of the mixed R and Python code used in the example above. Below we define a simple python script that iterates over the spectra in python and performs the normalization.

py_script <- paste0("from matchms.filtering import normalize_intensities\n",
                    "for i in range(len(pysps)):\n",
                    "    pysps[i] = normalize_intensities(pysps[i])\n")
py$pysps <- pysps
py_run_string(py_script)

tmp <- pyspec_to_rspec(pysps)
intensity(tmp)
## NumericList of length 3
## [[1]] 0.03407 0.47494 0.03094 1 0.1324
## [[2]] 0.06685 0.04381 0.03022 0.16708 1 0.04565 0.40643
## [[3]] 0.00459 0.02585 0.02446 0.00508 0.08968 0.00524 0.00974 1 0.40994

At very last we need also to stop the Python environment enabled by basilisk.

Session information

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] basilisk_1.18.0     reticulate_1.40.0   SpectriPy_0.2.1    
## [4] Spectra_1.16.0      BiocParallel_1.40.0 S4Vectors_0.44.0   
## [7] BiocGenerics_0.52.0 BiocStyle_2.34.0   
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-2           jsonlite_1.8.9         compiler_4.4.2        
##  [4] BiocManager_1.30.25    filelock_1.0.3         Rcpp_1.0.14           
##  [7] parallel_4.4.2         cluster_2.1.8          jquerylib_0.1.4       
## [10] png_0.1-8              systemfonts_1.2.1      IRanges_2.40.1        
## [13] textshaping_1.0.0      yaml_2.3.10            fastmap_1.2.0         
## [16] lattice_0.22-6         R6_2.5.1               ProtGenerics_1.38.0   
## [19] knitr_1.49             htmlwidgets_1.6.4      MASS_7.3-64           
## [22] bookdown_0.42          desc_1.4.3             bslib_0.9.0           
## [25] rlang_1.1.5            dir.expiry_1.14.0      cachem_1.1.0          
## [28] xfun_0.50              fs_1.6.5               MsCoreUtils_1.18.0    
## [31] sass_0.4.9             cli_3.6.3              withr_3.0.2           
## [34] pkgdown_2.1.1.9000     grid_4.4.2             digest_0.6.37         
## [37] MetaboCoreUtils_1.14.0 lifecycle_1.0.4        clue_0.3-66           
## [40] evaluate_1.0.3         codetools_0.2-20       ragg_1.3.3            
## [43] rmarkdown_2.29         basilisk.utils_1.18.0  tools_4.4.2           
## [46] htmltools_0.5.8.1