Package: MsBackendSql
Authors: Johannes Rainer [aut, cre] (https://orcid.org/0000-0002-6977-7147), Chong Tang
[ctb], Laurent Gatto [ctb] (https://orcid.org/0000-0002-1520-2268)
Compiled: Fri Sep 27 05:10:02 2024
Introduction
The Spectra
Bioconductor package provides a flexible and expandable infrastructure
for Mass Spectrometry (MS) data. The package supports interchangeable
use of different backends that provide additional file support
or different ways to store and represent MS data. The MsBackendSql
package provides backends to store data from whole MS experiments in SQL
databases. The data in such databases can be easily (and efficiently)
accessed using Spectra
objects that use the
MsBackendSql
class as an interface to the data in the
database. Such Spectra
objects have a minimal memory
footprint and hence allow analysis of very large data sets even on
computers with limited hardware capabilities. For certain operations,
the performance of this data representation is superior to that of other
low-memory (on-disk) data representations such as
Spectra
’s MsBackendMzR
backend. Finally, the
MsBackendSql
supports also remote data access to e.g. a
central database server hosting several large MS data sets.
Installation
The package can be installed with the BiocManager
package. To install BiocManager
use
install.packages("BiocManager")
and, after that,
BiocManager::install("MsBackendSql")
to install this
package.
Creating and using MsBackendSql
SQL databases
MsBackendSql
SQL databases can be created either by
importing (raw) MS data from MS data files using the
createMsBackendSqlDatabase()
or using the
backendInitialize()
function by providing in addition to
the database connection also the full MS data to import as a
DataFrame
. In the first example we use the
createMsBackendSqlDatabase()
function which takes a
connection to an (empty) database and the names of the files from which
the data should be imported as input parameters creates all necessary
database tables and stores the full data into the database. Below we
create an empty SQLite database (in a temporary file) and fill that with
MS data from two mzML files (from the msdata
package).
library(RSQLite)
dbfile <- tempfile()
con <- dbConnect(SQLite(), dbfile)
library(MsBackendSql)
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
createMsBackendSqlDatabase(con, fls)
By default the m/z and intensity values are stored as BLOB data types in the database. This has advantages on the performance to extract peaks data from the database but would for example not allow to filter peaks by m/z values directly in the database. As an alternative it is also possible to the individual m/z and intensity values in separate rows of the database table. This long table format results however in considerably larger databases (with potentially poorer performance). Note also that the code and backend is optimized for MySQL/MariaDB databases by taking advantage of table partitioning and specialized table storage options. Any other SQL database server is however also supported (also portable, self-contained SQLite databases).
The MsBackendSql package provides two backends to interact
with such databases: the (default) MsBackendSql
class and
the MsBackendOfflineSql
, that inherits all properties and
functions from the former, but which does not store the connection to
the database within the object but connects (and disconnects) to (and
from) the database in each function call. This allows to use the latter
also for parallel processing setups or to save/load the object
(e.g. using save
and saveRDS
). Thus, for most
applications the MsBackendOfflineSql
might be used as the
preferred backend to SQL databases.
To access the data in the database we create below a
Spectra
object providing the connection to the database in
the constructor call and specifying to use the MsBackendSql
as backend using the source
parameter.
sps <- Spectra(con, source = MsBackendSql())
sps
## MSn data (Spectra) with 1862 spectra in a MsBackendSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 34 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /tmp/RtmpWP4Rr5/fileb286077e2da
As an alternative, the MsBackendOfflineSql
backend could
also be used to interface with MS data in a SQL database. In contrast to
the MsBackendSql
, the MsBackendOfflineSql
does
not contain an active (open) connection to the database and hence
supports serializing (saving) the object to disk using e.g. the
save()
function, or parallel processing (if supported by
the database system). Thus, for most use cases the
MsBackendOfflineSql
should be used instead of the
MsBackendSql
. See further below for more information on the
MsBackendOfflineSql
.
Spectra
objects allow also to change the backend to any
other backend (extending MsBackend
) using the
setBackend()
function. Below we use this function to first
load all data into memory by changing from the MsBackendSql
to a MsBackendMemory
.
sps_mem <- setBackend(sps, MsBackendMemory())
sps_mem
## MSn data (Spectra) with 1862 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 34 more variables/columns.
## Processing:
## Switch backend from MsBackendSql to MsBackendMemory [Fri Sep 27 05:10:08 2024]
With this function it is also possible to change from any backend to
a MsBackendSql
in which case a new database is created and
all data from the originating backend is stored in this database. To
change the backend to an MsBackendOfflineSql
we need to
provide the connection information to the SQL database as additional
parameters. These parameters are the same that need to be passed to a
dbConnect()
call to establish the connection to the
database. These parameters include the database driver (parameter
drv
), the database name and eventually the user name, host
etc (see ?dbConnect
for more information). In the simple
example below we store the data into a SQLite database and thus only
need to provide the database name, which corresponds SQLite database
file. In our example we store the data into a temporary file.
sps2 <- setBackend(sps_mem, MsBackendOfflineSql(), drv = SQLite(),
dbname = tempfile())
sps2
## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 34 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /tmp/RtmpWP4Rr5/fileb2839a46d9b
## Processing:
## Switch backend from MsBackendSql to MsBackendMemory [Fri Sep 27 05:10:08 2024]
## Switch backend from MsBackendMemory to MsBackendOfflineSql [Fri Sep 27 05:10:08 2024]
Similar to any other Spectra
object we can retrieve the
available spectra variables using the
spectraVariables()
function.
spectraVariables(sps)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "peaksCount"
## [19] "totIonCurrent" "basePeakMZ"
## [21] "basePeakIntensity" "ionisationEnergy"
## [23] "lowMZ" "highMZ"
## [25] "mergedScan" "mergedResultScanNum"
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"
## [29] "injectionTime" "filterString"
## [31] "spectrumId" "ionMobilityDriftTime"
## [33] "scanWindowLowerLimit" "scanWindowUpperLimit"
## [35] "spectrum_id_"
The MS peak data can be accessed using either the mz()
,
intensity()
or peaksData()
functions. Below we
extract the peaks matrix of the 5th spectrum and display the first 6
rows.
## mz intensity
## [1,] 105.0347 0
## [2,] 105.0362 164
## [3,] 105.0376 0
## [4,] 105.0391 0
## [5,] 105.0405 328
## [6,] 105.0420 0
All data (peaks data or spectra variables) are
always retrieved on the fly from the database resulting
thus in a minimal memory footprint for the Spectra
object.
print(object.size(sps), units = "KB")
## 91.5 Kb
The backend supports also adding additional spectra variables or changing their values. Below we add 10 seconds to the retention time of each spectrum.
sps$rtime <- sps$rtime + 10
Such operations do however not change the data in the database (which is always considered read-only) but are cached locally within the backend object (in memory). The size in memory of the object is thus higher after changing that spectra variable.
print(object.size(sps), units = "KB")
## 106.2 Kb
Such $<-
operations can also be used to
cache spectra variables (temporarily) in memory which can
eventually improve performance. Below we test the time it takes to
extract the MS level from each spectrum from the database, then cache
the MS levels in memory using $msLevel <-
and test the
timing to extract these cached variable.
system.time(msLevel(sps))
## user system elapsed
## 0.007 0.001 0.008
sps$msLevel <- msLevel(sps)
system.time(msLevel(sps))
## user system elapsed
## 0.002 0.000 0.002
We can also use the reset()
function to reset
the data to its original state (this will cause any local spectra
variables to be deleted and the backend to be initialized with the
original data in the database).
sps <- reset(sps)
To use the MsBackendOfflineSql
backend we need to
provide all information required to connect to the database along with
the database driver to the Spectra
function. Which
parameters are required to connect to the database depends on the SQL
database and the used driver. In our example the data is stored in a
SQLite database, hence we use the SQLite()
database driver
and only need to provide the database name with the dbname
parameter. For a MySQL/MariaDB database we would use the
MariaDB()
driver and would have to provide the database
name, user name, password as well as the host name and port through
which the database is accessible.
sps_off <- Spectra(dbfile, drv = SQLite(),
source = MsBackendOfflineSql())
sps_off
## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 34 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /tmp/RtmpWP4Rr5/fileb286077e2da
This backend provides the exact same functionality than
MsBackendSql
with the difference that the connection to the
database is opened and closed for each function call. While this leads
to a slightly lower performance, it allows to to serialize the object
(i.e. save/load the object to/from disk) and to use it (and hence the
Spectra
object) also in a parallel processing setup. In
contrast, for the MsBackendSql
parallel processing is
disabled since it is not possible to share the active backend connection
within the object across different parallel processes.
Below we compare the performance of the two backends. The performance difference is the result from opening and closing the database connection for each call. Note that this will also depend on the SQL server that is being used. For SQLite databases there is almost no overhead.
library(microbenchmark)
microbenchmark(msLevel(sps), msLevel(sps_off))
## Unit: milliseconds
## expr min lq mean median uq max neval
## msLevel(sps) 5.332318 5.406762 5.649273 5.476532 5.542425 10.26601 100
## msLevel(sps_off) 6.649957 6.818596 6.973694 6.867898 6.938871 10.00390 100
Performance comparison with other backends
The need to retrieve any spectra data on-the-fly from the database
will have an impact on the performance of data access function of
Spectra
objects using the MsBackendSql
backends. To evaluate its impact we next compare the performance of the
MsBackendSql
to other Spectra
backends,
specifically, the MsBackendMzR
which is the default backend
to read and represent raw MS data, and the MsBackendMemory
backend that keeps all MS data in memory (and is thus not suggested for
larger MS experiments). Similar to the MsBackendMzR
, also
the MsBackendSql
keeps only a limited amount of data in
memory. These on-disk backends need thus to retrieve spectra
and MS peaks data on-the-fly from either the original raw data files (in
the case of the MsBackendMzR
) or from the SQL database (in
the case of the MsBackendSql
). The in-memory backend
MsBackendMemory
is supposed to provide the fastest data
access since all data is kept in memory.
Below we thus create Spectra
objects from the same data
but using the different backends.
sps <- Spectra(con, source = MsBackendSql())
sps_mzr <- Spectra(fls, source = MsBackendMzR())
sps_im <- setBackend(sps_mzr, backend = MsBackendMemory())
At first we compare the memory footprint of the 3 backends.
print(object.size(sps), units = "KB")
## 91.5 Kb
print(object.size(sps_mzr), units = "KB")
## 386.7 Kb
print(object.size(sps_im), units = "KB")
## 54494.5 Kb
The MsBackendSql
has the lowest memory footprint of all
3 backends because it does not keep any data in memory. The
MsBackendMzR
keeps all spectra variables, except the MS
peaks data, in memory and has thus a larger size. The
MsBackendMemory
keeps all data (including the MS peaks
data) in memory and has thus the largest size in memory.
Next we compare the performance to extract the MS level for each
spectrum from the 4 different Spectra
objects.
library(microbenchmark)
microbenchmark(msLevel(sps),
msLevel(sps_mzr),
msLevel(sps_im))
## Unit: microseconds
## expr min lq mean median uq max
## msLevel(sps) 5350.232 5537.7015 5771.88027 5626.0110 5741.3965 9305.801
## msLevel(sps_mzr) 367.796 400.6125 432.38209 420.9550 439.0395 650.835
## msLevel(sps_im) 10.830 13.4655 21.04269 20.3735 23.6145 72.585
## neval
## 100
## 100
## 100
Extracting MS levels is thus slowest for the
MsBackendSql
, which is not surprising because both other
backends keep this data in memory while the MsBackendSql
needs to retrieve it from the database.
We next compare the performance to access the full peaks data from
each Spectra
object.
microbenchmark(peaksData(sps, BPPARAM = SerialParam()),
peaksData(sps_mzr, BPPARAM = SerialParam()),
peaksData(sps_im, BPPARAM = SerialParam()), times = 10)
## Unit: microseconds
## expr min lq mean
## peaksData(sps, BPPARAM = SerialParam()) 106775.187 120725.956 285957.611
## peaksData(sps_mzr, BPPARAM = SerialParam()) 451588.290 456703.688 751336.931
## peaksData(sps_im, BPPARAM = SerialParam()) 357.737 405.517 2317.596
## median uq max neval
## 350136.025 388083.791 514076.63 10
## 770254.801 1077836.403 1114988.63 10
## 546.204 653.339 18531.44 10
As expected, the MsBackendMemory
has the fasted access
to the full peaks data. The MsBackendSql
outperforms
however the MsBackendMzR
providing faster access to the m/z
and intensity values.
Performance can be improved for the MsBackendMzR
using
parallel processing. Note that the MsBackendSql
does
not support parallel processing and thus parallel
processing is (silently) disabled in functions such as
peaksData()
.
m2 <- MulticoreParam(2)
microbenchmark(peaksData(sps, BPPARAM = m2),
peaksData(sps_mzr, BPPARAM = m2),
peaksData(sps_im, BPPARAM = m2), times = 10)
## Unit: microseconds
## expr min lq mean median
## peaksData(sps, BPPARAM = m2) 96809.705 106153.578 171793.8454 125550.1305
## peaksData(sps_mzr, BPPARAM = m2) 463529.496 485036.802 675009.7979 649723.4730
## peaksData(sps_im, BPPARAM = m2) 383.796 618.193 605.1018 644.4725
## uq max neval
## 133632.325 396552.127 10
## 783405.630 1125822.560 10
## 674.268 724.341 10
We next compare the performance of subsetting operations.
microbenchmark(filterRt(sps, rt = c(50, 100)),
filterRt(sps_mzr, rt = c(50, 100)),
filterRt(sps_im, rt = c(50, 100)))
## Unit: microseconds
## expr min lq mean median
## filterRt(sps, rt = c(50, 100)) 2685.972 2748.231 2976.2252 2797.785
## filterRt(sps_mzr, rt = c(50, 100)) 2030.728 2141.570 2333.1756 2206.873
## filterRt(sps_im, rt = c(50, 100)) 452.784 485.421 522.1169 507.597
## uq max neval
## 2861.4985 17459.073 100
## 2421.4225 6680.685 100
## 548.1525 895.580 100
The two on-disk backends MsBackendSql
and
MsBackendMzR
show a comparable performance for this
operation. This filtering does involves access to a spectra variables
(the retention time in this case) which, for the
MsBackendSql
needs first to be retrieved from the backend.
The MsBackendSql
backend allows however also to
cache spectra variables (i.e. they are stored within the
MsBackendSql
object). Any access to such cached spectra
variables can eventually be faster because no dedicated SQL query is
needed.
To evaluate the performance of a pure subsetting operation
we first define the indices of 10 random spectra and subset the
Spectra
objects to these.
idx <- sample(seq_along(sps), 10)
microbenchmark(sps[idx],
sps_mzr[idx],
sps_im[idx])
## Unit: microseconds
## expr min lq mean median uq max neval
## sps[idx] 132.798 141.6640 152.3949 150.5610 157.0680 280.623 100
## sps_mzr[idx] 643.571 669.1335 685.4696 678.8715 694.1505 927.961 100
## sps_im[idx] 223.306 234.2920 243.9662 241.7510 250.4520 367.947 100
Here the MsBackendSql
outperforms the other backends
because it does not keep any data in memory and hence does not need to
subset these. The two other backends need to subset the data they keep
in memory which is in both cases a data frame with either a reduced set
of spectra variables or the full MS data.
At last we compare also the extraction of the peaks data from the
such subset Spectra
objects.
sps_10 <- sps[idx]
sps_mzr_10 <- sps_mzr[idx]
sps_im_10 <- sps_im[idx]
microbenchmark(peaksData(sps_10),
peaksData(sps_mzr_10),
peaksData(sps_im_10),
times = 10)
## Unit: microseconds
## expr min lq mean median uq
## peaksData(sps_10) 2647.50 2654.273 3291.1344 3614.394 3715.712
## peaksData(sps_mzr_10) 69819.87 72925.655 73515.5042 73877.762 74639.904
## peaksData(sps_im_10) 348.27 406.358 510.9153 447.791 612.092
## max neval
## 3878.386 10
## 75518.083 10
## 863.561 10
The MsBackendSql
outperforms the
MsBackendMzR
while, not unexpectedly, the
MsBackendMemory
provides fasted access.
Considerations for database systems/servers
The backends from the MsBackendSql package use standard SQL calls to retrieve MS data from the database and hence any SQL database system (for which an R package is available) is supported. SQLite-based databases would represent the easiest and most user friendly solution since no database server administration and user management is required. Indeed, performance of SQLite is very high, even for very large data sets. Server-based databases on the other hand have the advantage to enable a centralized storage and control of MS data (inclusive user management etc). Also, such server systems would also allow data set or server-specific configurations to improve performance.
A comparison between a SQLite-based with a MariaDB-based MsBackendSql database for a large data set comprising over 8,000 samples and over 15,000,000 spectra is available here. In brief, performance to extract data was comparable and for individual spectra variables even faster for the SQLite database. Only when more complex SQL queries were involved (combining several primary keys or data fields) the more advanced MariaDB database outperformed SQLite.
Other properties of the MsBackendSql
The MsBackendSql
backend does not support parallel
processing since the database connection can not be shared across the
different (parallel) processes. Thus, all methods on
Spectra
objects that use a MsBackendSql
will
automatically (and silently) disable parallel processing even if a
dedicated parallel processing setup was passed along with the
BPPARAM
method.
Some functions on Spectra
objects require to load the MS
peak data (i.e., m/z and intensity values) into memory. For very large
data sets (or computers with limited hardware resources) such function
calls can cause out-of-memory errors. One example is the
lengths()
function that determines the number of peaks per
spectrum by loading the peak matrix first into memory. Such functions
should ideally be called using the peaksapply()
function
with parameter chunkSize
(e.g.,
peaksapply(sps, lengths, chunkSize = 5000L)
). Instead of
processing the full data set, the data will be first split into chunks
of size chunkSize
that are stepwise processed. Hence, only
data from chunkSize
spectra is loaded into memory in one
iteration.
Summary
The MsBackendSql
provides an MS data representations and
storage mode with a minimal memory footprint (in R) that is still
comparably efficient for standard processing and subsetting operations.
This backend is specifically useful for very large MS data sets, that
could even be hosted on remote (MySQL/MariaDB) servers. A potential use
case for this backend could thus be to set up a central storage place
for MS experiments with data analysts connecting remotely to this server
to perform initial data exploration and filtering. After subsetting to a
smaller data set of interest, users could then retrieve/download this
data by changing the backend to e.g. a MsBackendMemory
,
which would result in a download of the full data to the user
computer’s memory.
Session information
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] microbenchmark_1.5.0 RSQLite_2.3.7 MsBackendSql_1.5.0
## [4] Spectra_1.15.8 ProtGenerics_1.37.1 BiocParallel_1.39.0
## [7] S4Vectors_0.43.2 BiocGenerics_0.51.1 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 MsCoreUtils_1.17.2 hms_1.1.3
## [4] digest_0.6.37 evaluate_1.0.0 bookdown_0.40
## [7] fastmap_1.2.0 blob_1.2.4 jsonlite_1.8.9
## [10] progress_1.2.3 mzR_2.39.0 DBI_1.2.3
## [13] BiocManager_1.30.25 codetools_0.2-20 textshaping_0.4.0
## [16] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
## [19] crayon_1.5.3 Biobase_2.65.1 bit64_4.5.2
## [22] cachem_1.1.0 yaml_2.3.10 tools_4.4.1
## [25] parallel_4.4.1 memoise_2.0.1 ncdf4_1.23
## [28] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
## [31] fs_1.6.4 htmlwidgets_1.6.4 IRanges_2.39.2
## [34] bit_4.5.0 clue_0.3-65 MASS_7.3-61
## [37] ragg_1.3.3 cluster_2.1.6 pkgconfig_2.0.3
## [40] desc_1.4.3 pkgdown_2.1.1.9000 bslib_0.8.0
## [43] Rcpp_1.0.13 data.table_1.16.0 systemfonts_1.1.0
## [46] xfun_0.47 knitr_1.48 htmltools_0.5.8.1
## [49] rmarkdown_2.28 compiler_4.4.1 prettyunits_1.2.0
## [52] MetaboCoreUtils_1.13.0