2.7 Network analysis and biochemical pathways
The R environment offers packages to analyse networks of metabolomics data and metabolic pathways (see Table 8). Within this section, we refer to ‘pathway’ as a linked series of chemical reactions between molecules, conveyed by enzymes that lead to a product or change in a cell. These molecules are also known as metabolites and transformations occur in the same cellular compartment or in close vicinity. The term ‘network’ refers to the entity of metabolites that are connected biologically, chemically or structurally (e.g. similarity between MS/MS spectra of two metabolites), functionally or by any other measure (e.g. statistically correlated).
2.7.1 Network infrastructure and analysis
The R environment offers general infrastructure for network analysis. Functionality is implemented in a plethora of software packages, amongst others igraph, tidygraph or the statnet suite. These packages offer functions to generate networks from respective data input (e.g. adjacency matrices), to analyse networks, calculate network properties and to visualize networks. Generally, any kind of metabolomics data that can be converted to an interpretable format for one of these packages can be analyzed by generic network analysis tools. For example, MSnbase offers functionality to calculate similarity scores between MS/MS spectral data that can be readily interpreted as a spectral similarity network (see [113] for the pioneering work of mass spectral molecular networking for biological systems). Such networks can be analysed by the functions provided by the above-mentioned packages or by packages tailored more towards the analysis of biological data (e.g. RedeR). Specifically interesting for metabolomics applications is DiffCorr, an R package to compare correlation networks from two different experimental conditions, that builds on an association measure such as Pearson’s correlation coefficient to identify distinctive properties. DiffCorr enables testing of differential correlation of high-dimensional data sets by identifying the first principal component-based ‘eigen-molecules’ in the correlation networks. DiffCorr then tests these differential correlation values based on Fisher’s z-transformation to identify discriminating metabolite pairs that show different response to conditions. Another R package, more tailored towards the analysis of metabolomics data, is BioNetStat, which creates correlation-based networks from metabolite concentration data and analyses the networks based on graph spectra (group of eigenvalues in an adjacency matrix), spectral entropy, degree distribution and node centralities. BioNetStat also allows for KEGG pathway visualization of metabolite data.
2.7.2 Metabolite annotation
As mentioned above in section 2.2, a major challenge in metabolomics is metabolite annotation, spanning the annotation of known compounds (dereplication) or annotation of unknown metabolites and proposing hypotheses of their structures. Network and pathway analysis can be employed to putatively annotate metabolites in metabolomics data sets. The Bioconductor package MetNet aims at facilitating detection and putative annotation of unknown MS1 features in untargeted metabolomic studies. MetNet infers networks by using an ensemble of statistical associations between intensity values across samples and structural information (mass difference matching between features to a list of enzymatic transformation, retention time adjustment) to infer metabolic networks and guide the annotation of especially specialized metabolites of plant, fungi or bacteria samples. Another package to improve annotation is the package xMSAnnotator that incorporates a multi-criteria scoring algorithm to annotate mass features into different confidence levels. xMSAnnotator uses coelution, pathway level correlations, correlation and KEGG [114–116], HMDB, Toxin and Toxin Target Database (T3DB) [117,118], LipidMaps [119] and ChemSpider [106] for annotation and incorporates several filter steps, e.g. by defining modules of co-expressing m/z features using WGCNA and a topological overlap-based dissimilarity matrix and thereby categorizing related metabolites into the same network modules.
Molecular networking starting from MS/MS data can enhance the annotation of metabolites. MetDNA, implemented in R, JavaScript and Python (available via a web interface on http://metdna.zhulab.cn), combines MS1 and MS/MS data to putatively annotate features in metabolomics data sets [120]. MetDNA uses a metabolic reaction network based recursive algorithm for metabolite annotation employing spectral matching of MS/MS spectra in an automatic fashion. The iterated application of similarity matching between reaction pairs, a substrate metabolite with its product metabolite displaying similar chemical structures, allows the expansion of annotation using seed metabolites or previously annotated metabolites.
MetCirc, designed for the annotation of MS/MS features in untargeted metabolomics data, visualizes the spectral similarity matrix (e.g. the normalized dot product) between MS/MS spectra in a Circos-like interactive shiny application. Within the shiny application, similarity scores can be thresholded, MS/MS spectra can be interactively explored and annotated based on expert knowledge given the similarity score and displayed spectral features. MetCirc relies on the MSnbase framework to store MS/MS spectral data and to calculate similarities between spectra. Similarly, CluMSID employs spectral similarity matching to guide annotation of MS/MS spectra, incorporates functionality to calculate a correlation networks and for hierarchical and density-based clustering. compMS2Miner is another R package for MS/MS feature annotation and offers functionality for noise filtering, MS/MS substructure annotation, calculation of correlation- and spectral similarity-based networks and interactive visualization.
2.7.3 Generation of metabolic networks
Several R packages implement the functionality to generate metabolic networks. These networks can afterwards be analysed by their topological properties, be used to identify motifs that differ between experimental conditions or queried to find associations between metabolic features. MetaMapR generates metabolic networks by integrating enzymatic transformation, structural similarity between metabolites, mass spectral similarity and empirical correlation information. Hereby, MetaMapR queries biochemical reactions in KEGG and molecular fingerprints for structural similarities in PubChem. Furthermore, MetaMapR aims at incorporating metabolites with unknown biochemistry and unknown structures, and integrates other data sources (genomic, proteomic, clinical data). The package Metabox offers a pipeline for metabolomics data analysis, including functionality for data-driven network construction using correlation, estimation of chemical structure similarity networks using substructure fingerprints. Its statistical analysis highlights metabolites that are altered based on the experimental design group, which can be further interrogated by network and pathway analysis tools. Furthermore, the package MetabNet includes functionality to perform targeted metabolome-wide association studies (MWAS) and to guide the association of unknowns to a specific metabolic pathway, followed by mapping a target metabolite to the metabolic network structure.
2.7.4 Pathway analysis
Several R packages enable pathway analysis that uses quantitative data of metabolites and maps these to biological pathways. The Bioconductor package pwOmics analyses proteomics, transcriptomics and other -omics data in combination to highlight molecular mechanisms for single-point and time-series experiments. In downstream analyses, pwOmics allows for pathway, transcription factor and target gene identification.
Another important aspect commonly executed is enrichment analysis to identify pathways that are up- or downregulated given an experimental condition. The R environment offers a whole range of enrichment analysis packages (e.g. tmod for metabolite data). Targeted more towards pathway analysis, FELLA is a Bioconductor package for enrichment analysis. FELLA detects discriminative metabolic features, maps these to known biological pathways of the KEGG database and detects enriched terms by a diffusion algorithm. CePa offers enrichment analysis tools extending conventional gene set enrichment methods by incorporating pathway topologies. CePa takes nodes rather than terms for analysis and uses network centralities as weight of nodes incorporating pathways from the Pathway Interaction Database (PID, [121]), including NCI/Nature Pathway Interaction , BioCarta [122], Reactome [123] and KEGG [114–116].
MetaboDiff offers functionality to pinpoint to metabolome-wide differences using PCA and t-distributed stochastic neighbor embedding (tSNE) building on the MultiAssayExperiment S4 class. Using t-test or ANOVA, MetaboDiff identified metabolites that differ in their abundance between groups and identifies modules/sub-pathways by using WGCNA that indicate changes in biological pathways. SDAMS (Semi-parametric differential abundance analysis method for proteomics and metabolomics data from mass spectrometry) building upon the SummarizedExperiment S4 class, performs differential abundance analysis on metabolomics data by linking (non-normally distributed) metabolite levels to phenotypic data, containing zero and possibly non-normally distributed non-zero intensity values.
Many R packages guide the discovery of biomarkers for specific phenotypes. Among these is lilikoi, that maps features to pathways by using standardized HMDB IDs, transforms metabolomic profiles to pathway-based profiles using pathway deregulation scores, a measure how much a sample deviates from a normal level, followed by feature selection, classification and prediction. INDEED (INtegrated DiffErential Expression and Differential network analysis) aims to detect biomarkers by performing a differential expression analysis, which is combined with a differential network analysis based on partial correlation and followed by a network topology analysis. Subsequently, activity scores are calculated based on differences detected in the differential expression and the topology of the differential network that will guide the selection of biomarkers. Another R package for biomarker and feature selection is MoDentify which finds regulated modules, groups of correlating molecules that can span from few metabolites to entire pathways, to a given phenotype. These groups are possibly functionally coordinated, coregulated or driven by a similar or same biological process. Score maximization using a multivariable linear regression model with the candidate module as dependent and the phenotype and optional covariates as independent variables identifies the modules. Furthermore, MoDentify implements Gaussian graphical models, where depending on the resolution nodes reflect metabolites or entire pathways.
PAPi (Pathway activity profiling) assigns pathway activity scores to samples to represent the potential pathway activity and statistically detects affected pathways by applying t-test or ANOVA. PAPi uses KEGG pathway identifiers. pathwayPCA, with gene selection in mind, offers multi-omics data analysis by estimating sample-specific pathway activities, e.g. taken from the rWikiPathways interface. pathwayPCA takes continuous, binary or survival outcomes as input and estimates contributions of individual genes towards pathway significance.
R offers packages to analyze metabolic systems and to estimate biochemical reaction rates in metabolic networks using flux balance analysis, e.g. BiGGR, abcdeFBA, sybil, and fbar. For example, BiGGR interfaces with the BiGG databases that contains reconstructions of metabolic networks. After importing pathways from the database, flux balance and downstream routines can be performed, e.g. linear optimization routines or likelihood-based ensembles of calculated flux distributions fitting experimental data.
The package MetaboLouise simulates longitudinal metabolomics data. The simulation builds on a mathematical representation that is parameterized according to underlying biological networks, i.e. by defining metabolites and relation between them by initializing enzyme rates. Optionally, the package implements functionality to vary the rates depending on the network state, to add external fluxes and to analyze results based on different parameters.
2.7.5 Pathway resources and interfaces
A plethora of pathway resources exist, aptly aggregated by Pathguide.org. A number of these resources can be accessed by R packages, which were partly reviewed in [124]: rBiopaxParser, graphite, NCIgraph, pathview, KEGGgraph, SBMLR, rsbml, gaggle, and PSICQUIC. Of these, graphite stores pathway information for proteins and metabolites of currently fourteen species (version 1.28.0). Available databases are KEGG, Biocarta, Reactome, NCI/Nature Pathway Interaction Database, HumanCyc, Panther, SMPDB and PharmGKB. graphite offers in addition topological and statistical pathway analysis tools for metabolomics data by interfaces with the Bioconductor packages SPIA and clipper and supports functionality to build own pathways. Furthermore, RPathVisio enables creating and editing biological pathways. RPathVisio enables to visualise data on pathways, to perform statistics on pathway data, and provides an interface to WikiPathways. KEGGREST allows to access the KEGG REST API via a client interface. The package provides utility to search keywords, convert identifiers and link across databases. The package also allows to return amino acid sequences as AAStringSet or nucleotide sequences as DNAStringSet objects (from the Biostrings [125] package). Another package, paxtoolsr, provides literature-curated pathway using the Biological Pathway Exchange (BioPAX) format by providing an interface to the Pathway Commons database (including data from the NCI Pathway Interaction Database (PID), PantherDB, HumanCyc, Reactome, PhosphoSitePlus and HPRD). rWikiPathways is an interface between R and WikiPathways.org. Pathways can be queried, interrogated and downloaded to the R session. Furthermore, rWikiPathways associates metabolite information to pathways when providing the system code of a chemical database (e.g. from HMDB, ChEBI, or ChemSpider). RaMP provides a relational database of Metabolomics Pathways, integrates pathway, gene, and metabolite annotations from KEGG, HMDB, Reactome, and WikiPathways. The database is downloadable as a standalone MySQL dump, for integration with other software, and is also accessible through an R package, and includes a shiny [126] web interface that supports four basic queries: 1) retrieve analytes (genes of metabolites) given a pathway name; 2) retrieve a pathways for one or more analytes; 3) retrieve analytes involved in the same reaction; 4) retrieve ontologies (cellular location, biofluid locations, etc.) from metabolites. The web interface also supports pathway overrepresentation analysis on genes, metabolites, or genes and metabolites combined (query 3) and includes clustering of significantly enriched pathways according to the percent of overlapping analytes between pathways. Further, the web interface provides network visualization of gene-metabolites relationships (query 4).