2.9 User interfaces and workflow management systems
Visualisation is an important part of data analysis. Traditionally graphics in R have been focussed on creating static plots, while typical explorative studies generally require interactive visualisation to fully investigate the data. User interactions could range from simply zooming in chromatographic or spectroscopic data through to temporarily excluding data from a complex plot for clarity. Several packages in R are available for making interactive plots, e.g. the Plotly library [129] to create interactive graphics from the static plots generated by the popular plotting framework ggplot2 [130]. The use of interactive plots in R is growing, and is helped by an increasing number of code examples available.
Another way interactive plots, and even full GUI tools, are being introduced into R is through the shiny framework, which can create web apps using the full power of R packages as the backend. Many such tools related to metabolomics data analysis are also becoming available, which decreases the learning curve considerably for the typical metabolomics scientist without a computational background. A current gap in the shiny metabolomics landscape are powerful and re-usable widget collections for e.g. spectra viewers, molecular structures or metabolic networks.
There are several approaches to create, share and use data analysis in R for developers and users, with different strengths and weaknesses. Table 10 summarizes several ways to create and run a data analysis with some interpretation and comparative comments. Note that in some cases it is difficult to quantify “implementation simplicity”, e.g. in the case of shiny apps, which can range from rather straightforward to highly complex.
Table 10: Categorization of creating and sharing R code and data analysis functionality. Symbols indicate strengths (+, ++) or weaknesses (-, –) or neutral (o) assessment.
Framework | Implementation simplicity low to high | User- friendliness ow to high | Interactivity | Example URLs |
---|---|---|---|---|
R script | ++ | – | - | write.mzTab |
R Markdown vignette | o | o | – | xcms, patRoon |
Jupyter Notebook | o | + | + | MSEAp |
LearnR (CRAN) | - | ++ | + | LearnR Examples |
shiny app | – | ++ | ++ | MetFamily and apps in e.g. RaMP-DB, IntLIM |
All of these environments can be run locally, or installed on a (local or cloud-based) server. Recently, several initiatives have started to provide publicly available computing resources. Examples are e.g. the previously mentioned rdrr.io, which offers to paste R code into an online console for execution. The console can also be embedded into individual websites. The same project also hosts rnotebook.io, which allows to create and run R notebooks. The shinyapps.io platform operated by RStudio Inc has free and paid options to host shiny apps. The binder project (involving members from large academic institutions and companies (like UC Berkeley, Cal Poly San Luis Obispo, Wild Tree Tech Switzerland, Netflix or Simula Research Lab) is an infrastructure to create and use shareable, interactive and reproducible data analysis (not only) with R [131] by taking any GitHub repository, turning it into a Docker image and launching it on a cloud service. The package holepunch [132] simplifies preparing an R project for launching on binder. A public instance is the mybinder.org service providing (limited) resources to execute R based scripts in a hosted Rstudio, Jupyter notebook or applications written with e.g. shiny. The binder infrastructure code is available on GitHub, so that the service can be offered by universities and research groups to its users, lifting the resource limitations of the public instance.
In some cases an R package can provide bindings to existing tools and libraries written in other languages (see Table 11). This is for example in the case for the packages rcdk or MetFragR using the rJava bindings, or mzR which is a wrapper around the Proteowizard C++ library using the Rcpp package. The fairly new reticulate package provides the corresponding infrastructure to execute Python from R code.
Several workflow systems support workflow nodes and tools that can wrap and execute R code, and in turn build on the huge number of R packages (not only) for metabolomics. In this way, systems like KNIME [133,134] and Galaxy [135,136] also provide a graphical user interface and visual programming using the wrapped R functionality, and possibly combine with tools developed in other programming frameworks.
Galaxy is a web-based environment for omics data analysis [137]. The Workflow4metabolomics.org online Galaxy infrastructure dedicated to metabolomics [136] includes wrappers of xcms, CAMERA, metaMS, proFIA, ropls, biosigner and is open to new contributions. W4M is supported by two national infrastructures: the French Institute of Bioinformatics (www.france-bioinformatique.fr) and the Infrastructure for Metabolomics and Fluxomics (www.metabohub.fr) [138]. Wrapping R code into a Galaxy module is quite straightforward: examples can be found on the toolshed central repository (toolshed.g2.bx.psu.edu) and in the RGalaxy bioconductor package. An additional benefit is that the workflow developers need to ensure seamless data flow through the workflow steps, and often contribute the glue code to bridge the gap between objects and data structures that are not always directly compatible across different packages and softwares, thus also improving interoperability beyond the use in workflow systems.
Workflows and input/output data can be publicly referenced [139,140] on the Workflow4metabolomics platform, thus enabling fully reproducible research. By using workflow systems the reuse and reprocessing of data sets is greatly encouraged, as well as the tracking of data provenance [141]. This way, workflows help to boost the FAIR principles that were shaped for data [142].