A MS data backend for MS data stored in Python

The MsBackendPy allows to access MS data stored as matchms.Spectrum or spectrum_utils.spectrum.MsmsSpectrum objects from the matchms respectively spectrum_utils Python library directly from R. The MS data (peaks data or spectra variables) are translated on-the-fly when accessed. Thus, the MsBackendPy allows a seamless integration of Python MS data structures into Spectra::Spectra() based analysis workflows.

The MsBackendPy object is considered read-only, i.e. it does not provide functionality to replace the peaks data from R. However, it is possible to directly change the data in the referenced Python variable.

Usage

# S4 method for class 'MsBackendPy'
backendInitialize(
  object,
  pythonVariableName = character(),
  spectraVariableMapping = defaultSpectraVariableMapping(),
  pythonLibrary = c("matchms", "spectrum_utils"),
  ...,
  data
)

# S4 method for class 'MsBackendPy'
length(x)

# S4 method for class 'MsBackendPy'
spectraVariables(object)

# S4 method for class 'MsBackendPy'
spectraData(object, columns = spectraVariables(object), drop = FALSE)

# S4 method for class 'MsBackendPy'
peaksData(object, columns = c("mz", "intensity"), drop = FALSE)

# S4 method for class 'MsBackendPy'
x$name

# S4 method for class 'MsBackendPy'
spectraVariableMapping(object) <- value

# S4 method for class 'Spectra'
spectraVariableMapping(object) <- value

# S4 method for class 'MsBackendPy'
spectraVariableMapping(object, value)

reindex(object)

Arguments

object: A MsBackendPy object.
pythonVariableName: For backendInitialize(): character(1) with the name of the variable/Python attribute that contains the list of matchms.Spectrum objects with the MS data.
spectraVariableMapping: For backendInitialize(): named character with the mapping between spectra variable names and (matchms.Spectrum) metadata names. See defaultSpectraVariableMapping() for more information and details.
pythonLibrary: For backendInitialize(): character(1) specifying the Python library used to represent the MS data in Python. Can be either pythonLibrary = "matchms" (the default) or pythonLibrary = "spectrum_utils".
...: Additional parameters.
data: For backendInitialize(): DataFrame with the full MS data (peaks data and spectra data) such as extracted with the Spectra::spectraData() method on another MsBackend instance. Importantly, the DataFrame must have columns "mz" and "intensity" with the full MS data.
x: A MsBackendPy object
columns: For spectraData(): character with the names of columns (spectra variables) to retrieve. Defaults to spectraVariables(object). For peaksData(): character with the names of the peaks variables to retrieve.
drop: For spectraData() and peaksData(): logical(1) whether, when a single column is requested, the data should be returned as a vector instead of a data.frame or matrix.
name: For $: character(1) with the name of the variable to retrieve.
value: Replacement value(s).

Value

See description of individual functions for their return values.

Details

The MsBackendPy keeps only a reference to the MS data in Python (i.e. the name of the variable in Python) as well as an index pointing to the individual spectra in Python but no other data. Any data requested from the MsBackendPy is accessed and translated on-the-fly from the Python variable. The MsBackendPy is thus an interface to the MS data, but not a data container. All changes to the MS data in the Python variable (performed e.g. in Python) immediately affect any MsBackendPy instances pointing to this variable.

Special care must be taken if the MS data structure in Python is subset or its order is changed (e.g. by another process). In that case it might be needed to re-index the backend using the reindex() function: object <- reindex(object). This will update (replace) the index to the individual spectra in Python which is stored within the backend.

Note

As mentioned in the details section the MS data is completely stored in Python and the backend only references to this data through the name of the variable in Python. Thus, each time MS data is requested from the backend, it is retrieved in its current state. If for example data was transformed or metadata added or removed in the Python object, it immediately affects the Spectra/backend.

`MsBackendPy` methods

The MsBackendPy supports all methods defined by the Spectra::MsBackend() interface for access to MS data. Details on the invidual functions can also be found in the main documentation in the Spectra package (i.e. for Spectra::MsBackend()). Here we provide information for functions with specific properties of the backend.

backendInitialize(): initializes the backend with information from the referenced Python variable (attribute). The name of this attribute, ideally stored in the associated Python session, is expected to be provided with the pythonVariableName parameter. This variable is expected to contain or represent the full MS data. Alternatively, by providing the full MS data with the data parameter, this MS data will be first converted to the respective Python MS data representation and the MsBackendPy object will be initialized as described before. This allows to initialize the Python-based MS backend with data initially available in R. The optional spectraVariableMapping parameter allows to provide additional, or alternative, mapping of Spectra's spectra variables to metadata in the matchms.Spectrum objects. See defaultSpectraVariableMapping() (the default) for more information. Parameter pythonLibrary must be used to specify the Python library representing the MS data in Python. It can be either pythonLibrary = "matchms" (the default) or pythonLibrary = "spectrum_utils". The function returns an initialized instance of MsBackendPy.
peaksData(): extracts the peaks data matrices from the backend. Python code is applied to the data structure in Python to extract the m/z and intensity values as a list of (numpy) arrays. These are then translated into an R list of two-column numeric matrices. Because Python does not allow to name columns of an array, an additional loop in R is required to set the column names to "mz" and "intensity".
spectraData(): extracts the spectra data from the backend. Which spectra variables are translated and retrieved from the Python objects depends on the backend's spectraVariableMapping(). All metadata names defined are retrieved and added to the returned DataFrame (with eventually missing core spectra variables filled with NA).
spectraVariables(): retrieves available spectra variables, which include the names of all metadata attributes in the matchms.Spectrum objects and the core spectra variables Spectra::coreSpectraVariables().
spectraVariableMapping(): get the currently defined mapping for spectraVariables() of the backend.
spectraVariableMapping<-: replaces the spectraVariableMapping of the backend (see setSpectraVariableMapping() for details and description of the expected format).

Additional helper and utility functions

reindex(): update the internal index to match 1:length(object). This function is useful if the original data referenced by the backend was subset or re-ordered by a different process (or a function in Python).

Author

Johannes Rainer and the EuBIC hackathon team

Examples


## Loading an example MGF file provided by the SpectriPy package.
## As an alternative, the data could also be imported directly in Python
## using:
## import matchms
## from matchms.importing import load_from_mgf
## s_p = list(load_from_mgf(r.fl))
library(Spectra)
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: ‘generics’
#> The following objects are masked from ‘package:base’:
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#>     unsplit, which.max, which.min
#> 
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#> 
#>     findMatches
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: BiocParallel
library(MsBackendMgf)

fl <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
s <- Spectra(fl, source = MsBackendMgf())
#> Start data import from 1 files ... 
#> done
s
#> MSn data (Spectra) with 100 spectra in a MsBackendMgf backend:
#>       msLevel     rtime scanIndex
#>     <integer> <numeric> <integer>
#> 1           2        NA        NA
#> 2           2        NA        NA
#> 3           2        NA        NA
#> 4           2        NA        NA
#> 5           2        NA        NA
#> ...       ...       ...       ...
#> 96          2        NA        NA
#> 97          2        NA        NA
#> 98          2        NA        NA
#> 99          2        NA        NA
#> 100        NA        NA        NA
#>  ... 20 more variables/columns.

## Translating the MS data to Python and assigning it to a variable
## named "s_p" in the (*reticulate*'s) `py` Python environment. Assigning
## the variable to the Python environment has performance advantages, as
## any Python code applied to the MS data does not require any data
## conversions.
py_set_attr(py, "s_p", rspec_to_pyspec(s))

## Create a `MsBackendPy` representing an interface to the data in the
## "s_p" variable in Python:
be <- backendInitialize(MsBackendPy(), "s_p")
be
#> MsBackendPy with 100 spectra
#> Data stored in the "s_p" variable in Python

## An easier way to change the data representation of a `Spectra` object
## from R to Python is to use the `Spectra`'s `setBackend()` method
## selecting a `MsBackendPy` as the target backend representation:
s_2 <- setBackend(s, MsBackendPy(), pythonVariableName = "s_p2")
s_2
#> MSn data (Spectra) with 100 spectra in a MsBackendPy backend:
#> Data stored in the "s_p2" variable in Python
#> Processing:
#>  Switch backend from MsBackendMgf to MsBackendPy [Wed Jul  9 08:42:16 2025] 

## This moved the data from R to Python, storing it in a Python variable
## with the name `s_p2`. The resulting `s_2` is thus a `Spectra` object
## with all MS data however stored in Python.

## Alternatively, by passing the full MS data with parameter `data`, the
## data is first converted to Python and the backend is initialized with
## that data. The `setBackend()` call from above internally uses this
## code to convert the data.
be <- backendInitialize(MsBackendPy(), "s_p3",
    data = spectraData(s, c(spectraVariables(s), "mz", "intensity")))

## Create a Spectra object which this backend:
s_2 <- Spectra(be)
s_2
#> MSn data (Spectra) with 100 spectra in a MsBackendPy backend:
#> Data stored in the "s_p3" variable in Python

## Available spectra variables: these include, next to the *core* spectra
## variables, also the names of all metadata stored in the `matchms.Spectrum`
## objects.
spectraVariables(s_2)
#>  [1] "msLevel"                 "rtime"                  
#>  [3] "acquisitionNum"          "scanIndex"              
#>  [5] "dataStorage"             "dataOrigin"             
#>  [7] "centroided"              "smoothed"               
#>  [9] "polarity"                "precScanNum"            
#> [11] "precursorMz"             "precursorIntensity"     
#> [13] "precursorCharge"         "collisionEnergy"        
#> [15] "isolationWindowLowerMz"  "isolationWindowTargetMz"
#> [17] "isolationWindowUpperMz" 

## Get the full peaks data:
peaksData(s_2)
#> List of length 100

## Get the peaks from the first spectrum
peaksData(s_2)[[1L]]
#>            mz  intensity
#> [1,] 213.0546   754969.6
#> [2,] 241.0495  1058878.8
#> [3,] 259.0601 20211204.0

## Get the full spectra data:
spectraData(s_2)
#> DataFrame with 100 rows and 17 columns
#>       msLevel     rtime acquisitionNum scanIndex dataStorage  dataOrigin
#>     <integer> <numeric>      <integer> <integer> <character> <character>
#> 1           2        NA             NA        NA        s_p3          NA
#> 2           2        NA             NA        NA        s_p3          NA
#> 3           2        NA             NA        NA        s_p3          NA
#> 4           2        NA             NA        NA        s_p3          NA
#> 5           2        NA             NA        NA        s_p3          NA
#> ...       ...       ...            ...       ...         ...         ...
#> 96          2        NA             NA        NA        s_p3          NA
#> 97          2        NA             NA        NA        s_p3          NA
#> 98          2        NA             NA        NA        s_p3          NA
#> 99          2        NA             NA        NA        s_p3          NA
#> 100        NA        NA             NA        NA        s_p3          NA
#>     centroided  smoothed  polarity precScanNum precursorMz precursorIntensity
#>      <logical> <logical> <integer>   <integer>   <numeric>          <numeric>
#> 1           NA        NA        NA          NA     259.060                 NA
#> 2           NA        NA        NA          NA     837.532                 NA
#> 3           NA        NA        NA          NA     250.180                 NA
#> 4           NA        NA        NA          NA     250.180                 NA
#> 5           NA        NA        NA          NA     250.180                 NA
#> ...        ...       ...       ...         ...         ...                ...
#> 96          NA        NA        NA          NA     301.144                 NA
#> 97          NA        NA        NA          NA     407.135                 NA
#> 98          NA        NA        NA          NA     304.192                 NA
#> 99          NA        NA        NA          NA     361.202                 NA
#> 100         NA        NA        NA          NA          NA                 NA
#>     precursorCharge collisionEnergy isolationWindowLowerMz
#>           <integer>       <numeric>              <numeric>
#> 1                 1              NA                     NA
#> 2                 1              NA                     NA
#> 3                 1              NA                     NA
#> 4                 1              NA                     NA
#> 5                 1              NA                     NA
#> ...             ...             ...                    ...
#> 96                1              NA                     NA
#> 97                1              NA                     NA
#> 98                1              NA                     NA
#> 99                1              NA                     NA
#> 100               1              NA                     NA
#>     isolationWindowTargetMz isolationWindowUpperMz
#>                   <numeric>              <numeric>
#> 1                        NA                     NA
#> 2                        NA                     NA
#> 3                        NA                     NA
#> 4                        NA                     NA
#> 5                        NA                     NA
#> ...                     ...                    ...
#> 96                       NA                     NA
#> 97                       NA                     NA
#> 98                       NA                     NA
#> 99                       NA                     NA
#> 100                      NA                     NA

## Get the m/z values
mz(s_2)
#> NumericList of length 100
#> [[1]] 213.0546 241.0495 259.0601
#> [[2]] 116.0712 116.1074 127.0759 142.1231 ... 398.2568 679.4443 680.4436
#> [[3]] 232.1689 250.18 251.1831 252.1852
#> [[4]] 145.0638 187.1102 232.1687 233.1721 250.1795 251.1824 252.1862
#> [[5]] 232.1689 250.18 251.1831 252.1852
#> [[6]] 145.0638 187.1102 232.1687 233.1721 250.1795 251.1824 252.1862
#> [[7]] 232.1689 250.18 251.1831 252.1852
#> [[8]] 145.0638 187.1102 232.1687 233.1721 250.1795 251.1824 252.1862
#> [[9]] 135.0432 138.0632 163.0375 195.088
#> [[10]] 55.0537 56.0487 58.0281 58.0644 ... 71.0493 72.0792 82.0652 100.0763
#> ...
#> <90 more elements>

## Plot the first spectrum
plotSpectra(s_2[1L])



########
## Using the spectrum_utils Python library

## Below we convert the data to a list of `MsmsSpectrum` object from the
## spectrum_utils library.
py_set_attr(py, "su_p", rspec_to_pyspec(s,
    spectraVariableMapping("spectrum_utils"), "spectrum_utils"))

## Create a MsBackendPy representing this data. Importantly, we need to
## specify the Python library using the `pythonLibrary` parameter and
## ideally also set the `spectraVariableMapping` to the one specific for
## that library.
be <- backendInitialize(MsBackendPy(), "su_p",
    spectraVariableMapping = spectraVariableMapping("spectrum_utils"),
    pythonLibrary = "spectrum_utils")
be
#> MsBackendPy with 100 spectra
#> Data stored in the "su_p" variable in Python

## Get the peaks data for the first 3 spectra
peaksData(be[1:3])
#> [[1]]
#>            mz  intensity
#> [1,] 213.0546   754969.6
#> [2,] 241.0495  1058878.8
#> [3,] 259.0601 20211204.0
#> 
#> [[2]]
#>             mz intensity
#>  [1,] 116.0712     15660
#>  [2,] 116.1074     28164
#>  [3,] 127.0759      8616
#>  [4,] 142.1231      4516
#>  [5,] 158.1184    855092
#>  [6,] 159.1218     68248
#>  [7,] 380.2469      6148
#>  [8,] 398.2568      8996
#>  [9,] 679.4443     10596
#> [10,] 680.4436      5176
#> 
#> [[3]]
#>            mz intensity
#> [1,] 232.1689     20176
#> [2,] 250.1800   2073248
#> [3,] 251.1831    282228
#> [4,] 252.1852     21640
#> 

## Get the full spectraData
spectraData(be)
#> Warning: NAs introduced by coercion
#> DataFrame with 100 rows and 19 columns
#>       msLevel     rtime acquisitionNum scanIndex                          mz
#>     <integer> <numeric>      <integer> <integer>               <NumericList>
#> 1          NA        NA             NA        NA     213.055,241.049,259.060
#> 2          NA        NA             NA        NA 116.071,116.107,127.076,...
#> 3          NA        NA             NA        NA 232.169,250.180,251.183,...
#> 4          NA        NA             NA        NA 145.064,187.110,232.169,...
#> 5          NA        NA             NA        NA 232.169,250.180,251.183,...
#> ...       ...       ...            ...       ...                         ...
#> 96         NA        NA             NA        NA 108.023,109.028,122.037,...
#> 97         NA        NA             NA        NA 277.099,289.147,291.149,...
#> 98         NA        NA             NA        NA 168.139,168.177,168.230,...
#> 99         NA        NA             NA        NA 106.765,116.930,125.058,...
#> 100        NA        NA             NA        NA             100,101,122,...
#>                       intensity dataStorage  dataOrigin centroided  smoothed
#>                   <NumericList> <character> <character>  <logical> <logical>
#> 1      754970, 1058879,20211204        su_p          NA         NA        NA
#> 2         15660,28164, 8616,...        su_p          NA         NA        NA
#> 3     20176,2073248, 282228,...        su_p          NA         NA        NA
#> 4           864, 2548,10716,...        su_p          NA         NA        NA
#> 5     20176,2073248, 282228,...        su_p          NA         NA        NA
#> ...                         ...         ...         ...        ...       ...
#> 96            224, 295,1801,...        su_p          NA         NA        NA
#> 97                 21,26,27,...        su_p          NA         NA        NA
#> 98        24842,  882,  182,...        su_p          NA         NA        NA
#> 99               39, 51,108,...        su_p          NA         NA        NA
#> 100              47,153, 39,...        su_p          NA         NA        NA
#>      polarity precScanNum precursorMz precursorIntensity precursorCharge
#>     <integer>   <integer>   <numeric>          <numeric>       <integer>
#> 1          NA          NA     259.060                 NA               1
#> 2          NA          NA     837.532                 NA               1
#> 3          NA          NA     250.180                 NA               1
#> 4          NA          NA     250.180                 NA               1
#> 5          NA          NA     250.180                 NA               1
#> ...       ...         ...         ...                ...             ...
#> 96         NA          NA     301.144                 NA               1
#> 97         NA          NA     407.135                 NA               1
#> 98         NA          NA     304.192                 NA               1
#> 99         NA          NA     361.202                 NA               1
#> 100        NA          NA          NA                 NA               1
#>     collisionEnergy isolationWindowLowerMz isolationWindowTargetMz
#>           <numeric>              <numeric>               <numeric>
#> 1                NA                     NA                      NA
#> 2                NA                     NA                      NA
#> 3                NA                     NA                      NA
#> 4                NA                     NA                      NA
#> 5                NA                     NA                      NA
#> ...             ...                    ...                     ...
#> 96               NA                     NA                      NA
#> 97               NA                     NA                      NA
#> 98               NA                     NA                      NA
#> 99               NA                     NA                      NA
#> 100              NA                     NA                      NA
#>     isolationWindowUpperMz
#>                  <numeric>
#> 1                       NA
#> 2                       NA
#> 3                       NA
#> 4                       NA
#> 5                       NA
#> ...                    ...
#> 96                      NA
#> 97                      NA
#> 98                      NA
#> 99                      NA
#> 100                     NA

## Extract the precursor m/z
be$precursorMz
#>   [1] 259.0595 837.5318 250.1802 250.1802 250.1802 250.1802 250.1802 250.1802
#>   [9] 195.0877 100.0757 154.0777 190.1352 175.1190 287.0574  83.0604  83.0604
#>  [17] 195.0877       NA       NA 780.5500 770.5700 179.1179 195.0877 195.0877
#>  [25] 195.0877 425.1871 285.1809 224.0728 202.0854 161.0220 296.1412 239.1503
#>  [33] 990.9768 239.0673 163.0314 303.1339       NA       NA       NA       NA
#>  [41]       NA       NA       NA       NA       NA       NA       NA       NA
#>  [49]       NA       NA 296.1467 205.1911 317.2122 216.0000 207.0000 125.0000
#>  [57] 164.0000 256.0000 138.0000 256.0000 436.9460 866.6100 254.2478 335.0512
#>  [65] 257.1285 239.0815 239.0815 239.0815 301.1445 301.1445 302.3054 175.1190
#>  [73] 175.1190       NA 497.3272 175.1189 175.1189 230.0958 350.1598       NA
#>  [81]       NA 611.1612 273.1386 385.2122 449.1078 447.0933 591.1719 339.2067
#>  [89] 878.5001       NA 102.0600 465.4100 459.2748       NA 301.1445 301.1445
#>  [97] 407.1347 304.1918 361.2020       NA