Dataset investigation: What to do when you get your data • Metabonaut

Introduction

So, you (or your amazing lab mate) have finally finished the data acquisition, and now you have a dataset in hand. What’s next? Unfortunately, the work isn’t over yet. Before diving into any analysis, it’s crucial to understand the dataset itself. This is the first step in any data analysis workflow, ensuring that the data is of good quality and is well-prepared for preprocessing and any downstream analysis you plan to perform.

In this vignette, we present the dataset used throughout the different vignettes of this website. It’s far from a perfect dataset, which actually mirrors the reality of most datasets you’ll encounter in research.

Some issues will indeed be specific to this described dataset. However, the purpose of this vignette is to encourage you to think critically about your data and guide you through steps that can help you avoid spending hours on an analysis, only to realize later that some samples or features should have been removed or flagged earlier on.

Dataset Description

In this workflow, two datasets are used:

An LC-MS-based (MS1 level only) untargeted metabolomics dataset to quantify small polar metabolites in human plasma samples.
An additional LC-MS/MS dataset of selected samples from the former study for the identification and annotation of significant features.

The samples were randomly selected from a larger study aimed at identifying metabolites with varying abundances between individuals suffering from cardiovascular disease (CVD) and healthy controls (CTR). The subset analyzed here includes data for three CVD patients, three CTR individuals, and four quality control (QC) samples. The QC samples, representing a pooled serum sample from a large cohort, were measured repeatedly throughout the experiment to monitor signal stability.

The data and metadata for this workflow are available on the MetaboLights database under the ID: MTBLS8735.

The detailed materials and methods used for the sample analysis are also available in the MetaboLights entry. This is particularly important for understanding the analysis and the parameters used. It should be noted that the samples were analyzed using ultra-high-performance liquid chromatography (UHPLC) coupled to a Q-TOF mass spectrometer (TripleTOF 5600+), and chromatographic separation was achieved using hydrophilic interaction liquid chromatography (HILIC).

Provide more in-depth visualizations to explore and understand the dataset quality.
Compare pool lc-ms and pool lc-ms/ms and show that we have better separation on the second run.

Package

## Data Import and handling
library(MsExperiment)
library(MsIO)
library(MsBackendMetaboLights)
library(SummarizedExperiment)

## Preprocessing of LC-MS data
library(xcms)
library(Spectra)
library(MetaboCoreUtils)
library(Biobase)

## Visualisation
library(pander)
library(RColorBrewer)
library(pheatmap)

MS1 level data

We first load the raw data from MetaboLights:

param <- MetaboLightsParam(mtblsId = "MTBLS8735",
                           assayName = paste0("a_MTBLS8735_LC-MS_positive_",
                                              "hilic_metabolite_profiling.txt"),
                           filePattern = ".mzML")

lcms1 <- readMsObject(MsExperiment(),
                      param,
                      keepOntology = FALSE,
                      keepProtocol = FALSE,
                      simplify = TRUE)

We set up parallel processing to facilitate futher computations.

#' Set up parallel processing using 2 cores
if (.Platform$OS.type == "unix") {
    register(MulticoreParam(2))
} else {
    register(SnowParam(2))
}

We update the metadata to make the column names easier to access:

# Let's rename the column for easier access
colnames(sampleData(lcms1)) <- c("sample_name", "derived_spectra_data_file",
                                "metabolite_asssignment_file",
                                "source_name",
                                "organism",
                                "blood_sample_type",
                                "sample_type", "age", "unit", "phenotype")

# Add "QC" to the phenotype of the QC samples
sampleData(lcms1)$phenotype[sampleData(lcms1)$sample_name == "POOL"] <- "QC"
sampleData(lcms1)$sample_name[sampleData(lcms1)$sample_name == "POOL" ] <- c("POOL1", "POOL2", "POOL3", "POOL4")

#  Add injection index column
sampleData(lcms1)$injection_index <- seq_len(nrow(sampleData(lcms1)))

#let's look at the updated sample data
sampleData(lcms1)[, c("derived_spectra_data_file",
                     "phenotype", "sample_name", "age",
                     "injection_index")] |>
    kable(format = "pipe")

Table 2. Simplified sample data.
derived_spectra_data_file	phenotype	sample_name	age	injection_index
FILES/MS_QC_POOL_1_POS.mzML	QC	POOL1	NA	1
FILES/MS_A_POS.mzML	CVD	A	53	2
FILES/MS_B_POS.mzML	CTR	B	30	3
FILES/MS_QC_POOL_2_POS.mzML	QC	POOL2	NA	4
FILES/MS_C_POS.mzML	CTR	C	66	5
FILES/MS_D_POS.mzML	CVD	D	36	6
FILES/MS_QC_POOL_3_POS.mzML	QC	POOL3	NA	7
FILES/MS_E_POS.mzML	CTR	E	66	8
FILES/MS_F_POS.mzML	CVD	F	44	9
FILES/MS_QC_POOL_4_POS.mzML	QC	POOL4	NA	10

#' Define colors for the different phenotypes
col_phenotype <- brewer.pal(9, name = "Set1")[c(9, 5, 4)]
names(col_phenotype) <- c("QC", # grey
                          "CVD", # orange
                          "CTR") # purple
col_sample <- col_phenotype[sampleData(lcms1)$phenotype]

Spectra exploration

Quick reminder that we access the spectra data as below:

#' Access Spectra Object
spectra(lcms1)

MSn data (Spectra) with 17210 spectra in a MsBackendMetaboLights backend:
        msLevel     rtime scanIndex
      <integer> <numeric> <integer>
1             1     0.274         1
2             1     0.553         2
3             1     0.832         3
4             1     1.111         4
5             1     1.390         5
...         ...       ...       ...
17206         1   479.052      1717
17207         1   479.331      1718
17208         1   479.610      1719
17209         1   479.889      1720
17210         1   480.168      1721
 ... 37 more variables/columns.

file(s):
MS_QC_POOL_1_POS.mzML
MS_A_POS.mzML
MS_B_POS.mzML
 ... 7 more files

One of the first check that should be done is evaluatating the number of spectra per sample. Below we summarize the number of spectra and their respective MS level (extracted with the msLevel() function). The fromFile() function returns for each spectrum the index of its sample (data file) and can thus be used to split the information (MS level in this case) by sample to further summarize using the base R table() function and combine the result into a matrix.

#' Count the number of spectra with a specific MS level per file.
spectra(lcms1) |>
    msLevel() |>
    split(fromFile(lcms1)) |>
    lapply(table) |>
    do.call(what = cbind)

     1    2    3    4    5    6    7    8    9   10
1 1721 1721 1721 1721 1721 1721 1721 1721 1721 1721

The present data set thus contains only MS1 data, which is ideal for quantification of the signal. A second (LC-MS/MS) data set also with fragment (MS2) spectra of the same samples will be used later on in the workflow. We also cannot see any large difference in number of spectra between the samples, which is a good sign that the data is of good quality. If one sample had a significantly lower number of spectra, it would be a sign of a potential issue with the sample.

Data obtained from LC-MS experiments are typically analyzed along the retention time axis, while MS data is organized by spectrum, orthogonal to the retention time axis. As another example, we below determine the retention time range for the entire data set.

#' Retention time range for entire dataset
spectra(lcms1) |>
    rtime() |>
    range()

[1]   0.273 480.169

Data visualization and general quality assessment

Effective visualization is paramount for inspecting and assessing the quality of MS data. For a general overview of our LC-MS data, we can:

Combine all mass peaks from all (MS1) spectra of a sample into a single spectrum in which each mass peak then represents the maximum signal of all mass peaks with a similar m/z. This spectrum might then be called Base Peak Spectrum (BPS), providing information on the most abundant ions of a sample.
Aggregate mass peak intensities for each spectrum, resulting in the Base Peak Chromatogram (BPC). The BPC shows the highest measured intensity for each distinct retention time (hence spectrum) and is thus orthogonal to the BPS.
Sum the mass peak intensities for each spectrum to create a Total Ion Chromatogram (TIC).
Compare the BPS of all samples in an experiment to evaluate similarity of their ion content.
Compare the BPC of all samples in an experiment to identify samples with similar or dissimilar chromatographic signal.

In addition to such general data evaluation and visualization, it is also crucial to investigate specific signal of e.g. internal standards or compounds/ions known to be present in the samples. By providing a reliable reference, internal standards help achieve consistent and accurate analytical results.

Spectra Data Visualization: BPS

The BPS collapses data in the retention time dimension and reveals the most prevalent ions present in each of the samples, creation of such BPS is however not straightforward. Mass peaks, even if representing signals from the same ion, will never have identical m/z values in consecutive spectra due to the measurement error/resolution of the instrument.

Below we use the combineSpectra function to combine all spectra from one file (defined using parameter f = fromFile(data)) into a single spectrum. All mass peaks with a difference in m/z value smaller than 3 parts-per-million (ppm) are combined into one mass peak, with an intensity representing the maximum of all such grouped mass peaks. To reduce memory requirement, we in addition first bin each spectrum combining all mass peaks within a spectrum, aggregating mass peaks into bins with 0.01 m/z width. In case of large datasets, it is also recommended to set the processingChunkSize() parameter of the MsExperiment object to a finite value (default is Inf) causing the data to be processed (and loaded into memory) in chunks of processingChunkSize() spectra. This can reduce memory demand and speed up the process.

#' Setting the chunksize
chunksize <- 1000
processingChunkSize(spectra(lcms1)) <- chunksize

We can now generate BPS for each sample and plot() them.

#' Combining all spectra per file into a single spectrum
bps <- spectra(lcms1) |>
    bin(binSize = 0.01) |>
    combineSpectra(f = fromFile(lcms1), intensityFun = max, ppm = 3)

#' Plot the base peak spectra
par(mar = c(2, 1, 1, 1))
plotSpectra(bps, main= "")

Here, there is observable overlap in ion content between the files, particularly around 300 m/z and 700 m/z. There are however also differences between sets of samples. In particular, BPS 1, 4, 7 and 10 (counting row-wise from left to right) seem different than the others. In fact, these four BPS are from QC samples, and the remaining six from the study samples. The observed differences might be explained by the fact that the QC samples are pools of serum samples from a different cohort, while the study samples represent plasma samples, from a different sample collection.

Next to the visual inspection above, we can also calculate and express the similarity between the BPS with a heatmap. Below we use the compareSpectra() function to calculate pairwise similarities between all BPS and use then the pheatmap() function from the pheatmap package to cluster and visualize this result.

#' Calculate similarities between BPS
sim_matrix <- compareSpectra(bps)

#' Add sample names as rownames and colnames
rownames(sim_matrix) <- colnames(sim_matrix) <- sampleData(lcms1)$sample_name
ann <- data.frame(phenotype = sampleData(lcms1)[, "phenotype"])
rownames(ann) <- rownames(sim_matrix)

#' Plot the heatmap
pheatmap(sim_matrix, annotation_col = ann,
         annotation_colors = list(phenotype = col_phenotype))

Figure 2. Heatmap of MS signals similarities.

We get a first glance at how our different samples distribute in terms of similarity. The heatmap confirms the observations made with the BPS, showing distinct clusters for the QCs and the study samples, owing to the different matrices and sample collections.

It is also strongly recommended to delve deeper into the data by exploring it in more detail. This can be accomplished by carefully assessing our data and extracting spectra or regions of interest for further examination. In the next chunk, we will look at how to extract information for a specific spectrum from distinct samples.

#' Accessing a single spectrum - comparing with QC
par(mfrow = c(1,2), mar = c(2, 2, 2, 2))
spec1 <- spectra(lcms1[1])[125]
spec2 <- spectra(lcms1[3])[125]
plotSpectra(spec1, main = "QC sample")
plotSpectra(spec2, main = "CTR sample")

Figure 3. Intensity and m/z values of the 125th spectrum of two CTR samples.

The significant dissimilarities in peak distribution and intensity confirm the difference in composition between QCs and study samples. We next compare a full MS1 spectrum from a CVD and a CTR sample.

#' Accessing a single spectrum - comparing CVD and CTR
par(mfrow = c(1,2), mar = c(2, 2, 2, 2))
spec1 <- spectra(lcms1[2])[125]
spec2 <- spectra(lcms1[3])[125]
plotSpectra(spec1, main = "CVD sample")
plotSpectra(spec2, main = "CTR sample")

Figure 4. Intensity and m/z values of the 125th spectrum of one CVD and one CTR sample.

Above, we can observe that the spectra between CVD and CTR samples are not entirely similar, but they do exhibit similar main peaks between 200 and 600 m/z with a general higher intensity in control samples. However the peak distribution (or at least intensity) seems to vary the most between an m/z of 10 to 210 and after an m/z of 600.

The CTR spectrum above exhibits significant peaks around an m/z of 150 - 200 that have a much lower intensity in the CVD sample. To delve into more details about this specific spectrum, a wide range of functions can be employed:

#' Checking its intensity
intensity(spec2)

NumericList of length 1
[[1]] 18.3266733266736 45.1666666666667 ... 27.1048951048951 34.9020979020979

#' Checking its rtime
rtime(spec2)

[1] 34.872

#' Checking its m/z
mz(spec2)

NumericList of length 1
[[1]] 51.1677328505635 53.0461968245186 ... 999.139446289161 999.315208803072

#' Filtering for a specific m/z range and viewing in a tabular format
filt_spec <- filterMzRange(spec2,c(50,200))

data.frame(intensity = unlist(intensity(filt_spec)),
           mz = unlist(mz(filt_spec))) |>
  head() |> kable(format = "markdown")

Table 3. Intensity and m/z values of the 125th spectrum of one CTR sample.
intensity	mz
18.32667	51.16773
45.16667	53.04620
23.07692	53.70114
41.36364	53.84408
32.86713	53.88130
1159.03030	54.01002

Chromatographic info

#' Filter the data based on retention time
lcms1 <- filterRt(lcms1, c(10, 240))

Filter spectra

bpc <- chromatogram(lcms1, aggregationFun = "max")

#' Plot after filtering
plot(bpc, col = paste0(col_sample, 80),
     main = "BPC after filtering retention time", lwd = 1.5)
grid()
legend("topright", col = col_phenotype,
       legend = names(col_phenotype), lty = 1, lwd = 2, horiz = TRUE, bty = "n")

Figure 6. BPC after filtering retention time.

Initially, we examined the entire BPC and subsequently filtered it based on the desired retention times. This not only results in a smaller file size but also facilitates a more straightforward interpretation of the BPC.

The final plot illustrates the BPC for each sample colored by phenotype, providing insights on the signal measured along the retention times of each sample. It reveals the points at which compounds eluted from the LC column. In essence, a BPC condenses the three-dimensional LC-MS data (m/z by retention time by intensity) into two dimensions (retention time by intensity).

We can also here compare similarities of the BPCs in a heatmap. The retention times will however not be identical between different samples. Thus we bin() the chromatographic signal per sample along the retention time axis into bins of two seconds resulting in data with the same number of bins/data points. We can then calculate pairwise similarities between these data vectors using the cor() function and visualize the result using pheatmap().

#' Total ion chromatogram
tic <- chromatogram(lcms1, aggregationFun = "sum") |>
  bin(binSize = 2)

#' Calculate similarity (Pearson correlation) between BPCs
ticmap <- do.call(cbind, lapply(tic, intensity)) |>
  cor()

rownames(ticmap) <- colnames(ticmap) <- sampleData(lcms1)$sample_name
ann <- data.frame(phenotype = sampleData(lcms1)[, "phenotype"])
rownames(ann) <- rownames(ticmap)

#' Plot heatmap
pheatmap(ticmap, annotation_col = ann,
         annotation_colors = list(phenotype = col_phenotype))

The heatmap above reinforces what our exploration of spectra data showed, which is a strong separation between the QC and study samples. This is important to bear in mind for later analyses.

Additionally, study samples group into two clusters, cluster I containing samples C and F and cluster II with all other samples. Below we plot the TIC of all samples, using a different color for each cluster.

cluster_I_idx <- sampleData(lcms1)$sample_name %in% c("F", "C")
cluster_II_idx <- sampleData(lcms1)$sample_name %in% c("A", "B", "D", "E")

temp_col <- c("grey", "red")
names(temp_col) <- c("Cluster II", "Cluster I")
col <- rep(temp_col[1], length(lcms1))
col[cluster_I_idx] <- temp_col[2]
col[sampleData(lcms1)$phenotype == "QC"] <- NA

lcms1 |>
    chromatogram(aggregationFun = "sum") |>
    plot( col = col,
     main = "TIC after filtering retention time", lwd = 1.5)
grid()
legend("topright", col = temp_col,
       legend = names(temp_col), lty = 1, lwd = 2,
       horiz = TRUE, bty = "n")

Figure 8. Example of TIC of unusual signal.

While the TIC of all samples look similar, the samples from cluster I show a different signal in the retention time range from about 40 to 160 seconds. Whether, and how strong this difference will impact the following analysis remains to be determined.

known compounds

While the artificially isotope labeled compounds were spiked to the individual samples, there should also be the signal from the endogenous compounds in serum (or plasma) samples. Thus, we calculate next the mass and m/z of an [M+H]+ ion of the endogenous cystine from its chemical formula and extract also the EIC from this ion. For calculation of the exact mass and the m/z of the selected ion adduct we use the calculateMass() and mass2mz() functions from the MetaboCoreUtils package.

#' extract endogenous cystine mass and EIC and plot.
cysmass <- calculateMass("C6H12N2O4S2")
cys_endo <- mass2mz(cysmass, adduct = "[M+H]+")[, 1]
eic_cys_endo <- chromatogram(lcms1, mz = cys_endo + c(-0.005, 0.005),
                             rt = c(199, 219), aggregationFun = "max")
eic_cys_spiked <-  chromatogram(lcms1 , mz = c(249.040276, 249.050276), 
                                rt = c(199,219))

#' Plot versus spiked
par(mfrow = c(1, 2))
plot(eic_cys_endo, col = paste0(col_sample, 80)) 
grid()

plot(eic_cys_spiked, col = paste0(col_sample, 80))
grid()
legend("topright", col = col_phenotype, legend = names(col_phenotype), 
       lty = 1, bty = "n")

Figure 10. EIC of endogenous cystine vs spiked.

The two cystine EICs above look highly similar (the endogenous shown left, the isotope labeled right in the plot above), if not for the shift in m/z, which arises from the artificial labeling. This shift allows us to discriminate between the endogenous and non-endogenous compound.

Further post-processing analysis

Below we load the lcms1 object that we saved after preprocessing.

#load preprocessed xcmsExperiment
lcms1 <- readMsObject(XcmsExperiment(),
    AlabasterParam(system.file("extdata", "preprocessed_lcms1",
                               package = "Metabonaut")))

res <- readObject(system.file("extdata", "preprocessed_res",
                               package = "Metabonaut"))

Noise analysis

Below we plot the backgrounds signal for each study group. This can be interesting in cases on technical evaluation. In our cases we expect very similar background noise in both CVD and CTR.

# overall signal in the dataset 
#' - for each file calculate the sum of intensities 
background  <- spectra(lcms1) |>
    split(fromFile(lcms1)) |>
    lapply(tic) |>
    lapply(sum) |>
    unlist()

# Overall signal that is in the chromatographic peaks detection 
detected <- apply(assay(res), 2, function(x) sum(x, na.rm = TRUE))

names(background) <- names(detected) <- res$phenotype
idx_qc <- sampleData(lcms1)$phenotype == "QC"
noise <- background[!idx_qc] - detected[!idx_qc]

f <- factor(names(noise), levels = unique(names(noise)))
group <- split(log2(noise), f)

plot(NULL, xlim = c(1, length(group)), ylim = range(unlist(group)), 
     xaxt = "n", xlab = "Devices", ylab = "Noise", 
     main = "log2 background signal comparison between study group")
for (i in seq_along(group)) {
  points(rep(i, length(group[[i]])), group[[i]], pch = 19)
}
axis(1, at = seq_along(group), labels = names(group))

There seems to be more background noise in the CVD samples…

More coming soon…