Skip to contents

Package: MsIO
Authors: Johannes Rainer [aut, cre] (https://orcid.org/0000-0002-6977-7147), Philippine Louail [aut] (https://orcid.org/0009-0007-5429-6846), Laurent Gatto [ctb] (https://orcid.org/0000-0002-1520-2268)
Compiled: Thu Sep 5 12:47:18 2024

Introduction

Data objects in R can be serialized to disk in R’s Rds format using the base R save() function and re-imported using the load() function. This R-specific binary data format can however not be used or read by other programming languages preventing thus the exchange of R data objects between software or programming languages. The MsIO package provides functionality to export and import mass spectrometry data objects in various storage formats aiming to facilitate data exchange between software. This includes, among other formats, also storage of data objects using Bioconductor’s alabaster.base package.

For export or import of MS data objects, the saveMsObject() and readMsObject() functions can be used. For saveMsObject(), the first parameter is the MS data object that should be stored, for readMsObject() it defines type of MS object that should be restored (returned). The second parameter param defines and configures the storage format of the MS data. The currently supported formats and the respective parameter objects are:

  • PlainTextParam: storage of data in plain text file format.
  • AlabasterParam: storage of MS data using Bioconductor’s alabaster.base framework based files in HDF5 and JSON format.

These storage formats are described in more details in the following sections.

An example use of these functions and parameters: saveMsObject(x, param = PlainTextParam(storage_path)) to store an MS data object assigned to a variable x to a directory storage_path using the plain text file format. To restore the data (assuming x was an instance of a MsExperiment class): readMsObject(MsExperiment(), param = PlainTextParam(storage_path)).

Installation

The package can be installed with the BiocManager package. To install BiocManager use install.packages("BiocManager") and, after that, BiocManager::install("RforMassSpectrometry/MsIO") to install this package.

For import or export of MS data objects installation of additional Bioconductor packages might be needed:

  • Spectra (with BiocManager::install("Spectra")) for import or export of Spectra or MsBackendMzR objects.
  • MsExperiment (with BiocManager::install("MsExperiment")) for import or export of MsExperiment objects.
  • xcms (with BiocManager::install("xcms")) for import or export of XcmsExperiment objects (result objects of xcms-based preprocessing).

Plain text file format

Storage of MS data objects in plain text format aims to support an easy exchange of data, and in particular analysis results, with external software, such as MS-DIAL or mzmine3. In most cases, the data is stored as tabulator delimited text files simplifying the use of the data and results across multiple programming languages, or their import into spreadsheet applications. MS data objects stored in plain text format can also be fully re-imported into R providing thus an alternative, and more flexible, object serialization approach than the R internal Rds/RData format.

Below we create a MS data object (MsExperiment) representing the data from two raw MS data files and assign sample annotation information to these data files.

library(MsIO)
library(MsExperiment)

fls <- dir(system.file("TripleTOF-SWATH", package = "msdata"),
           full.names = TRUE)
mse <- readMsExperiment(
    fls,
    sampleData = data.frame(name = c("Pestmix1 DDA", "Pestmix SWATH"),
                            mode = c("DDA", "SWATH")))
mse
## Object of class MsExperiment 
##  Spectra: MS1 (5626) MS2 (10975) 
##  Experiment data: 2 sample(s)
##  Sample data links:
##   - spectra: 2 sample(s) to 16601 element(s).

We can export this data object to plain text files using MsIO’s saveMsObject() function in combination with the PlainTextParam parameter object. The path to the directory to which the data should be stored can be defined with the path parameter of PlainTextParam. With the call below we store the MS data object to a temporary directory.

d <- file.path(tempdir(), "ms_experiment_export")
saveMsObject(mse, PlainTextParam(path = d))

The data was exported to a set of text files that we list below:

dir(d)
## [1] "ms_backend_data.txt"                        
## [2] "ms_experiment_link_mcols.txt"               
## [3] "ms_experiment_sample_data_links_spectra.txt"
## [4] "ms_experiment_sample_data.txt"              
## [5] "spectra_processing_queue.json"              
## [6] "spectra_slots.txt"

Each text file contains information about one particular slot of the MS data object. See the ?PlainTextParam help for a description of the files and their respective formats. We can restore the MS data object again using the readMsObject() function, specifying the type of object we want to restore (and which was stored to the respective directory) with the first parameter of the function and the data storage format with the second. In our example we use MsExperiment() as first parameter and PlainTextParam as second. The MS data of our MsExperiment data object was represented by a Spectra object, thus, to import the data we need in addition to load the Spectra package.

## Object of class MsExperiment 
##  Spectra: MS1 (5626) MS2 (10975) 
##  Experiment data: 2 sample(s)
##  Sample data links:
##   - spectra: 2 sample(s) to 16601 element(s).

Note that at present MsIO does not support storage of the full MS data (i.e. the individual mass peaks’ m/z and intensity values) to plain text file. MsIO supports storage of on-disk data objects/representations (such as the MsBackendMzR object) to plain text formats. The Spectra object that is used to represent the MS data of our example MsExperiment object uses a MsBackendMzR backend and thus we were able to export and import its data. Due to its on-disk data mode, this type of backend retrieves the MS data on-the-fly from the original data files and hence we only need to store the MS metadata and the location of the original data files. Thus, also with the restored MS data object we have full access to the MS data:

spectra(mse_in) |>
    head() |>
    intensity()
## NumericList of length 6
## [[1]] 0.0307632219046354 0.163443520665169 ... 0.507792055606842
## [[2]] 0.124385602772236 0.306980639696121 ... 0.752154946327209
## [[3]] 0.140656530857086 0.194816112518311 ... 0.455461025238037
## [[4]] 0.0389336571097374 0.357547700405121 ... 0.478326231241226
## [[5]] 0.124386593699455 0.054143700748682 ... 0.251276850700378
## [[6]] 0.0940475389361382 0.247442871332169 ... 0.10762557387352

However, ff the location of the original MS data files was changed (e.g. if the files or the stored object was moved to a different location or file system), the new location of these files would be needed to be specified with parameter spectraPath (e.g. readMsObject(MsExperiment(), PlainTextParam(d), spectraPath = <path to new location>)).

Generally, saveMsData() stores the MS data objects in a modular way, i.e. the content of each component or slot is exported to its own data file. The storage directory of our example MsExperiment contains thus multiple data files:

dir(d)
## [1] "ms_backend_data.txt"                        
## [2] "ms_experiment_link_mcols.txt"               
## [3] "ms_experiment_sample_data_links_spectra.txt"
## [4] "ms_experiment_sample_data.txt"              
## [5] "spectra_processing_queue.json"              
## [6] "spectra_slots.txt"

This modularity allows also to load only parts of the original data. We can for example also load only the Spectra object representing the MS experiment’s MS data.

## MSn data (Spectra) with 16601 spectra in a MsBackendMzR backend:
##         msLevel     rtime scanIndex
##       <integer> <numeric> <integer>
## 1             1     0.231         1
## 2             1     0.351         2
## 3             1     0.471         3
## 4             1     0.591         4
## 5             1     0.711         5
## ...         ...       ...       ...
## 16597         2   899.527      8995
## 16598         2   899.624      8996
## 16599         2   899.721      8997
## 16600         2   899.818      8998
## 16601         2   899.915      8999
##  ... 27 more variables/columns.
## 
## file(s):
## PestMix1_DDA.mzML
## PestMix1_SWATH.mzML

Or even only the MsBackendMzR that is used by the Spectra object to represent the MS data.

## MsBackendMzR with 16601 spectra
##         msLevel     rtime scanIndex
##       <integer> <numeric> <integer>
## 1             1     0.231         1
## 2             1     0.351         2
## 3             1     0.471         3
## 4             1     0.591         4
## 5             1     0.711         5
## ...         ...       ...       ...
## 16597         2   899.527      8995
## 16598         2   899.624      8996
## 16599         2   899.721      8997
## 16600         2   899.818      8998
## 16601         2   899.915      8999
##  ... 27 more variables/columns.
## 
## file(s):
## PestMix1_DDA.mzML
## PestMix1_SWATH.mzML

alabaster-based formats

The alabaster framework and related Bioconductor package alabaster.base implements methods to save a variety of R/Bioconductor objects to on-disk representations based on standard file formats like HDF5 and JSON. This ensures that Bioconductor objects can be easily read from other languages like Python and Javascript. With AlabasterParam, MsIO supports export of MS data objects into these storage formats. Below we export our example MsExperiment to a storage directory using the alabaster format.

d <- file.path(tempdir(), "ms_experiment_export_alabaster")
saveMsObject(mse, AlabasterParam(path = d))

The contents of the storage folder is listed below:

dir(d, recursive = TRUE)
##  [1] "experiment_files/OBJECT"                       
##  [2] "experiment_files/x/list_contents.json.gz"      
##  [3] "experiment_files/x/OBJECT"                     
##  [4] "metadata/list_contents.json.gz"                
##  [5] "metadata/OBJECT"                               
##  [6] "OBJECT"                                        
##  [7] "other_data/list_contents.json.gz"              
##  [8] "other_data/OBJECT"                             
##  [9] "sample_data_links_mcols/basic_columns.h5"      
## [10] "sample_data_links_mcols/OBJECT"                
## [11] "sample_data_links/list_contents.json.gz"       
## [12] "sample_data_links/OBJECT"                      
## [13] "sample_data_links/other_contents/0/array.h5"   
## [14] "sample_data_links/other_contents/0/OBJECT"     
## [15] "sample_data/basic_columns.h5"                  
## [16] "sample_data/OBJECT"                            
## [17] "spectra/backend/OBJECT"                        
## [18] "spectra/backend/peaks_variables/contents.h5"   
## [19] "spectra/backend/peaks_variables/OBJECT"        
## [20] "spectra/backend/spectra_data/basic_columns.h5" 
## [21] "spectra/backend/spectra_data/OBJECT"           
## [22] "spectra/metadata/list_contents.json.gz"        
## [23] "spectra/metadata/OBJECT"                       
## [24] "spectra/OBJECT"                                
## [25] "spectra/processing_chunk_size/contents.h5"     
## [26] "spectra/processing_chunk_size/OBJECT"          
## [27] "spectra/processing_queue_variables/contents.h5"
## [28] "spectra/processing_queue_variables/OBJECT"     
## [29] "spectra/processing/contents.h5"                
## [30] "spectra/processing/OBJECT"                     
## [31] "spectra/spectra_processing_queue.json"

In contrast to the plain text format described in the previous section, that stores all data files into a single directory, the alabaster export is structured hierarchically into sub-folders by the MS data object’s slots/components.

To restore the object we use the readMsObject() function with an AlabasterParam parameter objects to define the used data storage format.

mse_in <- readMsObject(MsExperiment(), AlabasterParam(d))
mse_in
## Object of class MsExperiment 
##  Spectra: MS1 (5626) MS2 (10975) 
##  Experiment data: 2 sample(s)
##  Sample data links:
##   - spectra: 2 sample(s) to 16601 element(s).

Also for this format, we can load parts of the data separately. We can load the MS data as a Spectra object from the respective subfolder of the data storage directory:

s <- readMsObject(Spectra(), AlabasterParam(file.path(d, "spectra")))
s
## MSn data (Spectra) with 16601 spectra in a MsBackendMzR backend:
##         msLevel     rtime scanIndex
##       <integer> <numeric> <integer>
## 1             1     0.231         1
## 2             1     0.351         2
## 3             1     0.471         3
## 4             1     0.591         4
## 5             1     0.711         5
## ...         ...       ...       ...
## 16597         2   899.527      8995
## 16598         2   899.624      8996
## 16599         2   899.721      8997
## 16600         2   899.818      8998
## 16601         2   899.915      8999
##  ... 33 more variables/columns.
## 
## file(s):
## PestMix1_DDA.mzML
## PestMix1_SWATH.mzML

The import/export functionality is completely compatible with Bioconductor’s alabaster framework and hence allows also to read the whole, or parts of the data directly using alabaster’s readObject() method. The full MsExperiment is restored importing the full directory (i.e. providing the path to the directory containing the full export with the function’s path parameter).

mse_in <- readObject(path = d)
mse_in
## Object of class MsExperiment 
##  Spectra: MS1 (5626) MS2 (10975) 
##  Experiment data: 2 sample(s)
##  Sample data links:
##   - spectra: 2 sample(s) to 16601 element(s).

Alternatively, by providing a path to one of the MS object’s components, it is possible to read only specific parts of the data. Below we read the sample annotation information as a DataFrame from the sample_data subfolder:

readObject(path = file.path(d, "sample_data"))
## DataFrame with 2 rows and 3 columns
##            name        mode spectraOrigin
##     <character> <character>   <character>
## 1 Pestmix1 D...         DDA /__w/_temp...
## 2 Pestmix SW...       SWATH /__w/_temp...

Session information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] Spectra_1.15.7      BiocParallel_1.39.0 S4Vectors_0.43.2   
## [4] BiocGenerics_0.51.1 MsExperiment_1.7.0  ProtGenerics_1.37.1
## [7] MsIO_0.0.4          BiocStyle_2.33.1   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1            dplyr_1.1.4                
##  [3] fastmap_1.2.0               lazyeval_0.2.2             
##  [5] digest_0.6.37               lifecycle_1.0.4            
##  [7] cluster_2.1.6               alabaster.matrix_1.5.5     
##  [9] alabaster.base_1.5.6        magrittr_2.0.3             
## [11] compiler_4.4.1              rlang_1.1.4                
## [13] sass_0.4.9                  tools_4.4.1                
## [15] igraph_2.0.3                utf8_1.2.4                 
## [17] yaml_2.3.10                 knitr_1.48                 
## [19] S4Arrays_1.5.7              htmlwidgets_1.6.4          
## [21] DelayedArray_0.31.11        plyr_1.8.9                 
## [23] abind_1.4-5                 HDF5Array_1.33.6           
## [25] purrr_1.0.2                 desc_1.4.3                 
## [27] grid_4.4.1                  fansi_1.0.6                
## [29] Rhdf5lib_1.27.0             MASS_7.3-61                
## [31] MultiAssayExperiment_1.31.5 SummarizedExperiment_1.35.1
## [33] cli_3.6.3                   mzR_2.39.0                 
## [35] rmarkdown_2.28              crayon_1.5.3               
## [37] ragg_1.3.2                  generics_0.1.3             
## [39] httr_1.4.7                  reshape2_1.4.4             
## [41] ncdf4_1.23                  DBI_1.2.3                  
## [43] cachem_1.1.0                rhdf5_2.49.0               
## [45] stringr_1.5.1               zlibbioc_1.51.1            
## [47] parallel_4.4.1              AnnotationFilter_1.29.0    
## [49] BiocManager_1.30.25         XVector_0.45.0             
## [51] alabaster.schemas_1.5.0     matrixStats_1.4.0          
## [53] vctrs_0.6.5                 Matrix_1.7-0               
## [55] jsonlite_1.8.8              bookdown_0.40              
## [57] IRanges_2.39.2              clue_0.3-65                
## [59] systemfonts_1.1.0           jquerylib_0.1.4            
## [61] tidyr_1.3.1                 glue_1.7.0                 
## [63] pkgdown_2.1.0.9000          codetools_0.2-20           
## [65] QFeatures_1.15.2            stringi_1.8.4              
## [67] GenomeInfoDb_1.41.1         GenomicRanges_1.57.1       
## [69] UCSC.utils_1.1.0            tibble_3.2.1               
## [71] pillar_1.9.0                htmltools_0.5.8.1          
## [73] rhdf5filters_1.17.0         GenomeInfoDbData_1.2.12    
## [75] R6_2.5.1                    textshaping_0.4.0          
## [77] evaluate_0.24.0             Biobase_2.65.1             
## [79] lattice_0.22-6              bslib_0.8.0                
## [81] MetaboCoreUtils_1.13.0      Rcpp_1.0.13                
## [83] SparseArray_1.5.31          xfun_0.47                  
## [85] MsCoreUtils_1.17.1          fs_1.6.4                   
## [87] MatrixGenerics_1.17.0       pkgconfig_2.0.3