Computes the GNPS (Global Natural Products Social molecular networking)
modified cosine similarity score between two mass spectra using a fused
join + score algorithm based on Chain-DP (Chain Dynamic Programming).
This function combines peak matching and scoring in a single C call,
achieving consequent speedup over the standard gnps()
implementation while maintaining exact mathematical equivalence
(differences <= 2.2e-16).
Algorithm: Chain-DP optimal assignment. When spectra are sanitized, the bipartite matching graph forms simple chains (not arbitrary networks). This enables O(n+m) greedy scoring for most of the pairs, with exact Hungarian solver O(k³) only for rare conflicts (k about 3–5).
Complexity: O(n+m) time, O(n+m) memory (vs. O(n³) time, O(n²) memory for full Hungarian).
Usage
gnps_chain_dp(
x,
y,
xPrecursorMz = NA_real_,
yPrecursorMz = NA_real_,
tolerance = 0,
ppm = 0,
...,
matchedPeaksCount = FALSE
)Arguments
- x
Numeric
matrixwith query spectrum peaks (2 columns: mz, intensity). Must be sorted by mz in ascending order.- y
Numeric
matrixwith library spectrum peaks (2 columns: mz, intensity). Must be sorted by mz in ascending order.- xPrecursorMz
numeric(1), precursor m/z for query spectrum.- yPrecursorMz
numeric(1), precursor m/z for library spectrum.- tolerance
numeric(1), absolute tolerance in Daltons.- ppm
numeric(1), relative tolerance in ppm.- ...
ignored.
- matchedPeaksCount
logical(1); ifTRUE, return both score and matched-peak count, otherwise return score only.
Value
A numeric vector of length 1 by default (score), or length 2 when
matchedPeaksCount = TRUE (c(score, matched_peaks)).
Details
The modified cosine score is computed as:
$$ \text{score}(i,j) = \frac{\sqrt{I_x(i)}}{\sqrt{\sum I_x}} \times \frac{\sqrt{I_y(j)}}{\sqrt{\sum I_y}} $$
where the sum is over unique m/z values (first occurrence of duplicates).
The total score is the sum of all optimally assigned peak pairs, found via:
Direct matching:
join(x, y, type="outer")- closest one-to-oneShifted matching:
join(x + pdiff, y, type="outer")wherepdiff = yPrecursorMz - xPrecursorMzOptimal assignment via Chain-DP: For each query peak, pick the better of its direct and shifted matches. When spectra are sanitized, conflicts are rare (~1%) and resolved optimally with exact Hungarian.
Precursor threshold: Shifted matching is skipped when
|pdiff| <= tolerance + ppm × max(xPrecursorMz, yPrecursorMz) × 1e-6,
i.e., when the precursor difference is within the peak matching tolerance.
This is scientifically correct (no meaningful neutral loss when
pdiff close to tolerance) but differs from the existing gnps()
implementation, which only skips when pdiff == 0.0 exactly. See
References for details.
Prerequisites
CRITICAL: Input spectra MUST be sanitized before calling this function:
Unique m/z values: no two peaks in the same spectrum should have m/z values close enough to match each other (i.e.,
|mz_i - mz_j| > tolerancefor all peak pairs i,j within the same spectrum)Non-negative intensities (no NaN/NA/Inf)
Sorted by m/z in ascending order
The chain-DP algorithm assumes at most one direct match and one shifted match per peak — a property that holds when peaks are well-separated. Unsanitized spectra will produce incorrect scores silently.
How to sanitize:
library(Spectra)
sps <- reduceSpectra(sps) # Remove peaks closer than tolerance
sps <- combinePeaks(sps) # Merge duplicate m/z
sps <- scalePeaks(sps) # Normalize intensitiesReferences
Wang M, Carver JJ, Phelan VV, et al. (2016). "Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking." Nature Biotechnology 34:828–837. doi:10.1038/nbt.3597
Dührkop K, Fleischauer M, Ludwig M, et al. (2019). "SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information." Nature Methods 16:299–302. doi:10.1038/s41592-019-0344-8
Chain-DP algorithm implementation: https://github.com/sirius-ms/sirius/blob/stable/spectral_alignment/src/main/java/de/unijena/bionf/fastcosine/FastCosine.java
See also
gnps() for the standard (backward-compatible) implementation.
join_gnps() for peak matching only.
Examples
# Example spectra (sanitized: sorted, unique m/z, no NAs)
x <- cbind(mz = c(10, 36, 63, 91, 93), intensity = c(14, 15, 999, 650, 1))
y <- cbind(mz = c(10, 12, 50, 63, 105), intensity = c(35, 5, 16, 999, 450))
# Compute modified cosine via chain-DP (hot path)
result <- gnps_chain_dp(x, y,
xPrecursorMz = 91.0,
yPrecursorMz = 105.0,
tolerance = 0.01,
ppm = 10,
matchedPeaksCount=TRUE)
result[1L]
#> [1] 0.9923501
result[2L]
#> [1] 4
# Compare with standard implementation (should agree within 1e-15)
matches <- join_gnps(x[,1], y[,1], 91.0, 105.0, 0.01, 10)
score_std <- gnps(x[matches$x, ], y[matches$y, ])
abs(result[1L] - score_std) < 1e-10 # TRUE
#> [1] TRUE
