2.1 Mass spectrometry data handling and (pre-)processing

For all mass spectrometers, the fundamental data generated is a mass spectrum, i.e. mass-signal intensity pairs. MS-based metabolomics data is typically acquired either as a single mass spectrum or a collection of mass spectra over time, with the time axis (retention time) defined by chromatographic (or other time domain) separation. One of the first steps in metabolomics data processing is usually the reduction of the typically large raw data produced by the instrument to a much smaller set of so called features, which are then subjected to downstream data analysis and interpretation. Features normally represent integrated peaks for a given mass that have been aligned across samples. Establishing these features is called pre-processing. The feature detection approaches and packages applicable depend on the type and characteristics of the input data. This section describes the basic data structure for some of the common analytical approaches and shows appropriate tools in R for pre-processing such data, see Table 1 for an overview of the corresponding packages.

2.1.1 Profile mode and centroided data

The mass spectra can be recorded in profile (also called continuum) mode, but are often ‘centroided’. Centroiding is, in effect, a process of peak detection for a profile mode mass spectrum (hence in the m/z dimension, not in a chromatographic dimension) - a gaussian region of a continuum spectrum with a sufficiently high signal to noise ratio is integrated to give a centroided mass (a “stick” in the mass spectrum as opposed to a continuous signal) and integrated area under the curve. This results in data of reduced size - what was many m/z-intensity pairs has been reduced to a single m/z-intensity pair. Practically, this reduces the file size considerably and many data processing tools (e.g. centWave in xcms) require MS data that has been centroided. The centroiding can be done either during acquisition on the fly by the instrument software, or as an initial processing step. Post-acquisition centroiding can be performed during conversion of the vendor data format to open formats; typically using msconvert from ProteoWizard [27,28] which in some cases provides access to vendor centroiding algorithms or can alternatively use its own built-in centroinding method. Dedicated vendor tools can also be used, and the R packages MSnbase also provides centroiding capabilities.

2.1.2 Direct infusion mass spectrometry data

Currently, one of the highest throughput analytical approaches is direct infusion MS, where the sample is directly injected into the mass spectrometer without any chromatographic separation. This approach can be used with high mass resolution or ultra-high resolution mass spectrometers to discriminate isobaric analytes [29]. Summing or averaging these spectra generates a single mass spectrum, which is representative of that sample. Peak picking can be done using MassSpecWavelet that applies a continuous wavelet transform-based peak detection. xcms provides a wrapper for this function in the findPeaks.MSW function. In the Flow Injection Analysis analytical approach (FIA), the sample is transiently injected into the carrier stream flowing directly into the MS instrument. In the absence of chromatographic separation, matrix effects are a challenge for the quantification, especially in complex matrices. FIA coupled to High-Resolution Mass Spectrometry data can be processed with the proFIA workflow which provides efficient and robust peak detection and quantification.

2.1.3 Hyphenated MS and non-targeted data

Chromatographic separation before MS enables better measurement of complex samples and the ability to separate isobaric compounds. Here, the mass spectra are acquired over time as the sample components separate on the chromatography column. The mass spectrum at any given time has the same data structure as any mass spectrum - units of mass to charge ratio and time. As can be inferred from the above descriptions, chromatographically coupled mass spectrometry data is three-dimensional, with dimensions of retention time, m/z, and intensity.

For the pre-processing of LC-MS and GC-MS data, xcms is widely used. A recent paper reviewed some of the “xcms family” packages [30] though many more packages exist that build on xcms by providing tools for specialised analyses while others provide improvements of some of the xcms processing steps such as improved peak picking (xMSanalyzer, warpgroup, cosmiq). xcms itself provides a number of different algorithms for peak picking such as matchedFilter [31], centWave [32] and massifquant [33]. apLCMS, yamss, KPIC2 and enviPick also provide peak picking for LC-MS data independently of xcms. In cases where the alignment of the peak data of different samples is considered (e.g. in cohort studies), xcms and apLCMS include methods to group the peaks by their m/z and retention times within tolerance levels. The groups are split into sub-groups using density functions and the consensus m/z and retention time is assigned to each bin.

2.1.4 Targeted data and alternative representations of data

In addition to the most standard “spectra over time” representation of chromatographically separated MS data, there are a number of alternative ways to represent the data or simplify the data. The signal intensity for a given mass (or mass range) over chromatographic time can be represented as two equal length vectors, with retention time and intensity as units for the values of those vectors. Examples of these vector pairs include the extracted ion chromatogram (EIC, sometimes also referred to as selected or eXtracted ion chromatogram SIC, XIC), where these chromatograms represent the intensity of a given mass over (retention) time. The data thus contains no spectra but a number of SICs. Frequently this is accomplished by summarizing the raw data in a two dimensional matrix consisting of m/z and time dimensions, with each cell holding the signal intensity for that m/z and retention time range (or bin). Low mass resolution mass spectrometers often represent the data natively as a SIC and targeted data are also usually represented this way. Recent versions of xcms are also able to process such data, and additional xcms-based functionalities for analysis of targeted data can be found in the packages TargetSearch and SWATHtoMRM, while analysis of isotope labeled data can be found in the packages X13CMS, geoRge, and IsotopicLabelling. SIMAT also provides processing for targeted data and does not rely on xcms.

2.1.5 Additional dimensionality

The vast majority of data collected for metabolomics comprises of three dimensions: retention time, m/z, and intensity. However, there are more complicated analytical approaches that add additional dimensionality to the data. Two dimensional chromatography offer two separations in the chromatographic (retention time) domain. The eluent from one column is captured by retention time range and transferred to a second column where a fast orthogonal separation occurs. When coupled to a mass spectrometer, this generates four-dimensional data (m/z, first retention time, second retention time, intensity).

Ion mobility separation (IMS) is a gas phase separation method offering resolution of ions based on molecular shape. This separation occurs on timescales of tens of microseconds, which generates a nested data structure in which there are dozens to hundreds of mass spectra collected across the IMS separation time scale. One can envision this as an ion mobility ‘chromatogram’ - however, this chromatogram is nested within the actual chromatographic separation, thus LC-IMS-MS data is also four dimensional.

Most MS instruments offer the capability to perform selection (or filtering) of ions for fragmentation. The precursor selection can be performed through a quadrupole or ion trap, and fragmentation is often induced by collisions with an inert collision gas. Because this adds a level of mass spectrometry, it is called tandem MS, MS2 or MS/MS. Ion trap instruments can further select fragment ions and acquire MSn spectra.

There are several data independent MS/MS approaches, whereby MS/MS precursor selection is done, typically, on a scanning basis. These approaches perform precursor selection in a manner which does not depend on any feedback from the instrument control software or the MS level data. In practice, this precursor window can be either m/z or ion mobility based. The processing tools within the R universe (discussed below) are so far underdeveloped for these approaches. With the increased popularity of multidimensional separation, the need for algorithms that can fully utilize the increased separation power is also increasing.

Currently, osd provide peak picking for unit resolution GC×GC-MS. While the msPeak package provides peak picking for GC×GC-MS data, the peak picking is done on the total ion chromatogram, thus not taking advantage of the mass selectivity provided by the MS detector. It does not appear that any package for R exists that provides peak picking for GC×GC-MS, LC×LC-MS or LC-IMS-MS, similar to (or even better than) commercial tools (e.g. ChromaTOF, GC Image, ChromSquare). Also, at least in the case of GC×GC-MS, unit mass resolution still seems to be the most common use-case, even though high-resolution MS could further improve signal deconvolution and ultimately, analyte identification. Such capabilities are crucial for moving these new powerful analytical approaches into mainstream metabolomics analysis.

2.1.6 Structuring data and metadata

The result from the pre-processing is usually a matrix of abundances, rows being features (or features grouped into compounds/molecules) and columns being the samples. Within the statistical community, it is common nowadays to manipulate data matrices with rows as observations and columns as features, this difference stems from the early days, when spreadsheet programs could only handle a limited number of columns smaller than the number of e.g. genes. Such matrices can be easily encapsulated into an ExpressionSet class from Bioconductor’s Biobase package [34], the more recent SummarizedExperiment defined in the SummarizedExperiment [35] package or the mSet class from the metabolomics focussed Metabase [36] package. The main advantage of such objects is their inherent support to align quantitative data along with related metadata (i.e. feature definitions/annotations as row - and sample annotations as column metadata). As an example, a SummarizedExperiment can be generated from xcms pre-processing results by adding the output from the featureValues function on the xcms result object as quantitative assay and the outputs of the featureDefinitions and pData functions as row and column annotations, respectively. Many Bioconductor packages for omics data analysis have native support for such objects (e.g., pcaMethods, STATegRa, ropls, biosigner, omicade4).

For the downstream export of mass spectrometry data from metabolomics or lipidomics experiments, the package rmzTab-M provides support for exporting quantitative and identification results backed by analytical and mass spectrometric evidence into the mzTab-M metabolomics file format [37].

2.1.7 Ion species grouping and annotation

In MS-based metabolomics, the characterisation and identification of metabolites involves several steps and approaches. After peak (feature) table generation, several tools can be used for grouping features that are postulated to originate from the same molecule. These include the widely-used CAMERA for MS1 data, as well as RAMClustR (particularly for DIA data), MetTailor, nontarget, CliqueMS and peakANOVA. Packages that support interpretation of the relationship between the ion species, including adducts, isotopes and in-source fragmentation, are InterpretMSSpectrum, CAMERA, nontarget and mzMatch [38]. See Table 2 for a summary of these packages.

Detailed reconstructed isotope patterns can be used to determine the molecular formula of potential candidates. In the case of molecular formula and isotope analysis, the m/z and intensities for a given (set of) features can be used to calculate a ranked list of possible molecular formulas, based on the accurate mass and relative isotope abundances. The Rdisop, GenFormR and enviPat packages are able to simulate and decompose isotopic patterns into molecular formula candidates. Some post processing can calculate e.g. the double bond equivalents (DBE) and similar characteristics to reduce the number of false positive assignments. Another additional source of information to improve molecular formula estimation is to include MS/MS spectra, as used in MFAssignR, InterpretMSSpectrum or GenFormR.

A typical next step is the annotation of m/z with putative metabolites using accurate mass lookup, or if the molecular formula was calculated, lookup of the formula in metabolite databases. It has to be noted that annotation with accurate mass search by no means is equivalent to identification. Under the assumption that all the metabolites measured in a sample have some biochemical relation, a global annotation strategy as used in ProbMetab can help as well. Here, the individual ranked lists of formulae are re-evaluated to also maximize the number of pairs with (potential) biochemical substrate-product pairs. The masstrixR package contains several utility functions for accurate mass lookup. It enables matching of measured m/z values against a given database or library and can additionally perform matching based on retention times (RT) and/or collisional cross sections (CCS) if available.