2.8 Multifunctional workflows

When dealing with non-targeted metabolomics data sets, data processing represents a key step for obtaining meaningful and consistent results. While the type and number of data processing methods may vary according to the experimental design and aim of the study, some key steps can be identified that are common for most metabolomics experiments. For this reason, a number of multifunctional R based workflows have been developed over the years. A key advantage of using multifunctional workflows is that most of the functions the user needs are available within the same “environment”, so that the data does not have to be formatted to comply with functions in other packages. In this respect, a quite common backbone of R workflows consists in performing a pre-processing step that generates an R object that can be used as argument for different functions. Another advantage is that, in most cases, workflows allow a certain degree of flexibility so that functionalities can be used as standalone functions (modular workflows) to better comply with the user’s needs. The packages covering larger parts of metabolomics workflows available in R are listed in Table 9.

These multifunctional packages include comprehensive workflows that focus on multiple aspects, such as: data pre-processing, data validation, preliminary statistical analysis and data visualisation of large metabolomics datasets. The considered workflows support both MS based data (LC-MS and GC-MS) and data generated by different analytical platforms. MAIT (Metabolite Automatic Identification Toolkit) offers pre-processing, annotation, statistical analysis and data visualization. It relies on xcms for peak picking and on CAMERA for the preliminary annotation. In addition to CAMERA, the peak annotation process is implemented by including a functionality that allows relating in-source mass losses to specific biotransformations. Human biotransformations are already included, additional biotransformation criteria can be added by the end user. MAIT also provides a number of statistical tools and visual representations (e.g. PCA, boxplot, PLS) as well as a function to perform identifications using accurate mass search in HMDB. MetMSLine shows some similarities with MAIT in terms of processing stages (xcms-based pre-processing, multivariate statistics, metabolite identifications). Functionalities characterizing MetMSLine include: normalisation, signal drift correction using a smoothing method, noise transformation and outlier removal. SimExTargId is a wrapper of different software and R packages for LC-MS data. It includes tools for data conversion (Proteowizard), peak picking and annotation (xcms and CAMERA), outlier detection and data correction (MetMSLine), and basic statistical analysis. A special feature of SimeExTargId is the real time monitoring of the different workflow stages aimed at metabolomics core facilities; users are notified by email in case of processing errors (e.g. outlier detection, signal drift). mzMatch is slightly different from the above mentioned workflows and is designed to fit in a broader processing pipeline itself. The project also includes a dedicated file format (peakML) and a Java environment. The different modules can still be used independently. mzMatch supports peak picking and grouping using xcms, reproducibility calculation, data normalization. The peakMonitor app identifies peaks using the local database. The identification is performed on the basis of m/z and retention time values with user-defined mass accuracy and retention time deviation values.

MetaDB is built by integrating the metaMS R package into a web application written in Grails. It has also been designed to be integrated with the MetaboLights database. MetaDB supports both LC-MS and GC-MS datasets and offers a wide range of functionalities, including: data storage and metadata management (using the ISA-Tab format and ISACreator tool [127,128]), peak picking and annotation (via metaMS, an xcms and CAMERA add-on) and QC plots.

MStractor is designed for non-expert users to carry out non-targeted data processing on LC-MS experiments. It gathers xcms and CAMERA functions in an user-friendly pipeline, requiring minimal input and providing graphical QC outputs throughout the workflow. It also includes a manual peak curation step and the possibility of calculating descriptive statistics for each sample class.

patRoon is an interface for different MS-based open source software for non-targeted data processing. patRoon covers different aspects of metabolomics workflows, such as: file conversion to open data formats (mzXML and mzML), feature extraction and grouping (using a number of open software and R packages: xcms, OpenMS, enviPick), extraction of MS and MS/MS data (mzR), component generation (RAMClustR, CAMERA, nontarget), formula calculation (GenForm) and compound identification through automatic annotation of MS/MS spectra (MetFrag and SIRIUS with CSI:FingerID). Other functionalities include (interactive) visualization and reporting of workflow data, comparison and combining results from different workflow algorithms and several data reduction and selection strategies.

Specmine provides a general framework that addresses a variety of different analytical platforms, such as LC-MS, GC-MS, NMR, IR and UV-Vis. The package supports many data formats and includes the possibility of adding metadata in a tabular format. It relies on xcms for LC-MS and GC-MS data pre-processing, on hyperSpec for NMR, IR and UV/VIS data processing and on MAIT for metabolite identification. Specmine provides scripts for missing values imputation, univariate and multivariate statistics and machine learning methods. A number of case studies are available for testing purposes.

mQTL.NMR is a package specific for the systematic analysis of 1H NMR metabolomics in quantitative genetics. The package mainly focuses on NMR spectral data pre-processing (normalization, scaling and peak alignment), mQTL mapping in different model organisms, structural assignment of marker metabolites, and result visualization.

enviMass is a comprehensive workflow for the data-mining of LC-MS and GC-MS datasets, which also supports MS/MS experiments. It provides the user with a graphic interface and a flexible workflow structure covering common processing steps such as data conversion, peak picking, noise removal, peak picking, mass re-calibration, data normalisation, and blank subtraction. It also offers a number of more specific and advanced functionalities including: isotopologue and adduct grouping, homologous series detection and visualization, estimation of atom counts for nontarget components, temporal sequences, profile trend detection and processing of both data dependent and data independent acquisition of MS/MS experiments. RMassScreening is a workflow for batch processing of LC-HRMS datasets using a script interface, YAML-based setting configuration and visual interactive data evaluation. It provides wrappers for script-based usage of enviPick and basic enviMass components, and implements suspect screening and combinatorial prediction of possible metabolites (transformation products) from parent compounds. A GUI provides facilities to analyze the results, grouped by sample groups and experimental timepoints, by applying freely adjustable filters.

MetaboNexus is an interactive data analysis platform for metabolomics experiments, which provides a user friendly R shiny-based GUI designed to work without the need of web server connections. It allows pre-processing (using xcms and MZmine), data scaling, univariate and multivariate statistics (t-test, ANOVA, PCA, PLS-DA, Random Forest, Heatmap), putative metabolite identification (library matching of MS and MS/MS adduct with METLIN, HMDB and MassBank databases), and a number of functions for data visualization.