2.5 Statistical analysis of metabolomics data

Following the feature detection and grouping steps outlined in the sections above, different paths to statistical analysis are available in R and Bioconductor. Once the “sample versus variable” feature matrix of molecule intensities or abundances has been generated, comprehensive statistical analyzes can be performed by using the vast range of packages provided by the R statistical software and the Bioconductor project (see Table 6), see for instance StatisticalMethod biocViews [64] and the ExperimentalDesign [65], Cluster [66], Multivariate [67], MachineLearning [68] CRAN Task Views [69]. As mentioned in the introduction we will only cover common statistical approaches used in metabolomics. Areas such as time-series analysis, clustering methods, machine learning and visualisation of high-dimensional data were dealt with in various books and literature reviews [70–78].

With regard to statistical analyses in untargeted metabolomics, two strategies can be differentiated that necessitate the use of different methods. The first strategy “metabolite profiling” is performed by most untargeted metabolomics studies. Here, a bottom-up approach is taken where sets or classes of pre-defined metabolites are studied usually in different phenotypes of the same biological species and differences in metabolites are usually related to more coarse functional or biological levels (e.g. to phenotype or to control vs. treatment in biomedical studies) [79]. Exploratory data analysis, univariate methods, hierarchical clustering (HCA), Principal Component Analysis (PCA) and Multi-Dimensional Scaling (MDS) like methods are very common in metabolite profiling approaches. Feature/variable selection is performed to find only the most significant metabolite candidates that explain the underlying research question, usually using univariate methods to target only specific metabolites that are interesting to the research question of the study [80–83].

The second strategy “metabolite fingerprinting” is commonly used in biomedicine, environmental metabolomics and eco-metabolomics to find metabolite patterns across metabolite profiles. Here, metabolites are characterised without necessarily identifying them and characterisation usually occurs from spatiotemporally coarser scales to intrinsic scales within biological species [84]. Multivariate statistical methods are used that require reduction of high-dimensional data and, thus, ordination methods are commonly applied like (Orthogonal) Partial Least Squares regression (sometimes also coupled to Discriminant Analysis) ((O)PLS(-DA)), (Linear) Discriminant Analysis ((L)DA), and (Canonical) Correspondence Analysis ((C)CA) that allow to relate sets of explanatory variables containing species traits or environmental properties (such as soil type, plant height, smoker/non-smoker, gender, etc.) to the metabolite feature matrix [77,85,86]. Other machine learning methods like Random Forests (RF), Support Vector Machines (SVM) and Neural Networks (NN or ANN) are also applicable [87]. Lately, untargeted metabolomics data is related to other ‘omics using network analysis or Procrustes analysis to visualise (dis)similarities between two or more ‘omics data sets [88–91].

Extracting a restricted list of features which still provide a high prediction performance (i.e., a molecular signature) is critical for biomarker validation and clinical diagnostic. Several strategies have been described for feature selection [92,93] (e.g., wrapper approaches such as Recursive Feature Elimination, Genetic Algorithms, or sparse models such as Lasso, Elastic Net, or sparse PLS). Such techniques are implemented in R packages, which also provide detailed comparisons on real datasets in terms of the stability and the size of the selected signature, the prediction performance of the final model, and the computation time [94–97].

A great number of packages is available to perform statistics on metabolomics datasets. Some of them focus on performing a number of specific tasks, such as sample size estimation, batch normalization, exploratory data analysis, univariate hypothesis testing, multivariate modeling and Omics data integration. Others, listed in the section ‘Multi workflow steps’ in Table 9, adopt a more comprehensive approach, providing statistics toolbox that cover different methods and functionalities.

muma is a package designed to be compatible with MS and NMR generated data. The package mainly focuses on performing statistics. It does not contain functions for data extraction and the user has to provide values arranged in a data.frame format. The pre-processing is limited to missing value imputation, noise filtering, variable scaling and normalization. The package also provides tools for outlier detection, univariate and multivariate analysis. Notably, the package offers a script for Statistical TOtal Correlation SpetroscopY (STOCSY) on NMR data.

MOFA proposes tools for the integration of data coming from different omics disciplines (multi-omics). Using factor analysis it allows to calculate hidden factors that capture the biological sample variation across multi-omics datasets, thus allowing marker discovery. MOFA also provide various tools for the visualization of results. IntLIM also supports integration of other omics datasets with metabolomics data by leveraging linear modeling to identify gene-metabolite pairs whose relationship differs from one phenotype to another (e.g. positive correlation in one phenotype, negative or no correlation in another). IntLIM includes a user-friendly web interface to perform data quality control of input data, identification of phenotype-dependent gene-metabolite pairs, and interactive visualization of results. This tool is particularly useful for integrating transcriptomic and metabolomic or other omics data by generating novel hypothesis in a data-driven manner.

MetaboDiff is presented as an entry level, user friendly package for differential metabolomics analysis. The information contained in the input data (metabolomics measurements and metadata) are stored in S4 objects which are used for the downstream processing. The pre-processing consists of missing value imputation, outlier removal and data normalization, while the data analysis part offers a variety of statistical methods including tools to explore how metabolites relate to each other in sub-pathways.

MetaboAnalystR is a toolbox built over several R packages and contains more than 500 functions organised in eleven modules. The package was created to overcome the limitations of the homonymous web application, such as the possibility of creating flexible customized workflows (including xcms interoperability) and the capacity of dealing with large data sets. MetaboAnalystR functionalities cover a wide range of tools: exploratory statistical analysis, biomarker analysis, power analysis, biomarker meta-analysis, functional enrichment analysis, pathway and joint pathway analysis. Through an implementation of the mummichog algorithm [98], MetaboAnalystR also allows to infer pathways for from user-generated m/z peak-lists. Using the MetaboAnalyst knowledgebase, MetaboAnalystR provides access to metabolite set libraries, compound libraries and pathway libraries.