2.6 Handling of molecule structures and chemical structure databases

Several packages that can deal with cheminformatics tasks, property calculations, metabolite lookup in (web) databases or mapping between databases or structure format conversions (see Table 7).

A well-established package is rcdk which provide a comprehensive subset of functions from the Chemistry Development Kit [99]. rcdk provides a computer readable representation of molecular structures and provide a wealth of functions to import structures from different molecule structure description formats, manipulate structures, visualize structures and calculate properties and molecular fingerprints. The package fingerprint can then be used to compare fingerprints. rinchi provides reading and writing of InChI and InChIKeys [100]. ChemmineR is an alternative to rcdk, providing many similar functions, with more tools for fingerprints, clustering and others through querying the ChemMine Tools web service [101]. ChemmineR also has significantly faster parsing of SDF files, which can be an advantage when reading large databases. A large number of additional descriptors are available in the package camb which focuses on quantitative predictive models. ChemmineOB provides conversion between a large number of chemical structure formats using OpenBabel [102]. A notable exception is InChI/InChIKey, which is not directly supported by ChemmineOB or ChemmineR and one would thus have to go through rinchi and rcdk for offline import from InChI to ChemmineR or ChemmineOB. RChemMass is a package that combines the functionality of the rcdk with that of RMassBank, and enviPat. The package RRDKit makes (part of) the functionality of the RDKit [103] toolkit available from within R.

A number of existing compound databases are useful for metabolomics. These can supply metadata such as common names and synonyms, database identifiers and experimental or predicted properties. The Rpubchem package provides lookup of information available in PubChem [104,105], while the webchem package provide query of a large number of databases including PubChem, ChemSpider [106], Wikidata [107], Chemical Translation Service [108], PHYSPROP [109], Chemical Identifier Resolver [110] and others. BridgeDbR can be used to map identifiers (metabolites, but also genes and proteins, and interactions) between databases, e.g. PubChem to ChemSpider identifiers; RMassBank and RChemMass also provide some useful web-retrieval functions. rgoslin [111] normalizes lipid shorthand nomenclature names from different dialects and allows mapping of those names to different levels of the structural hierarchy, e.g. from subspecies to species level.

The analysis of identified compounds on the level of substance classes can give biochemical insights which are not obvious from the individual structures, or in case the structures are not fully elucidated. The web tool ClassyFire is able to annotate a given structure with compound classes from their ChemOnt taxonomy as well as different substituents [112]. The classyfireR package supports the retrieval of substance classes using the RESTful API of the ClassyFire tool based on InChIKeys.