These functions provide different normalized similariy/distance measurements.
Usage
ndotproduct(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...)
dotproduct(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...)
neuclidean(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...)
navdist(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...)
nspectraangle(x, y, m = 0L, n = 0.5, na.rm = TRUE, ...)
Arguments
- x
matrix
, two-columns e.g. m/z, intensity- y
matrix
, two-columns e.g. m/z, intensity- m
numeric
, weighting for the first column ofx
andy
(e.g. "mz"), default:0
means don't weight by the first column. For more details see thendotproduct
details section.- n
numeric
, weighting for the second column ofx
andy
(e.g. "intensity"), default:0.5
means effectly usingsqrt(x[,2])
andsqrt(y[,2])
. For more details see thendotproduct
details section.- na.rm
logical(1)
, shouldNA
be removed prior to calculation (defaultTRUE
).- ...
ignored.
Details
All functions that calculate normalized similarity/distance measurements are prefixed with a n.
ndotproduct
: the normalized dot product is described in Stein and Scott
1994 as: \(NDP = \frac{\sum(W_1 W_2)^2}{\sum(W_1)^2 \sum(W_2)^2}\); where
\(W_i = x^m * y^n\), where \(x\) and \(y\) are the m/z and intensity
values, respectively. Please note also that \(NDP = NCos^2\); where NCos
is the cosine value (i.e. the orthodox normalized dot product) of the
intensity vectors as described in Yilmaz et al. 2017. Stein and Scott 1994
empirically determined the optimal exponents as m = 3
and n = 0.6
by
analyzing ca. 12000 EI-MS data of 8000 organic compounds in the NIST Mass
Spectral Library.
MassBank (Horai et al. 2010) uses m = 2
and n = 0.5
for small compounds. In general with increasing values for m
,
high m/z values will be taken more into account for similarity calculation.
Especially when working with small molecules, a value m > 0
can be set
to give a weight on the m/z values to accommodate that shared fragments
with higher m/z are less likely and will mean that molecules might be more
similar. Increasing n
will result in a higher importance of the intensity
values. Most commonly m = 0
and n = 0.5
are used.
neuclidean
: the normalized euclidean distance is described in Stein and
Scott 1994 as:
\(NED = (1 + \frac{\sum((W_1 - W_2)^2)}{sum((W_2)^2)})^{-1}\); where
\(W_i = x^m * y^n\), where \(x\) and \(y\) are the m/z and intensity
values, respectively. See the details section about ndotproduct
for an
explanation how to set m
and n
.
navdist
: the normalized absolute values distance is described in Stein and
Scott 1994 as:
\(NED = (1 + \frac{\sum(|W_1 - W_2|)}{sum((W_2))})^{-1}\); where
\(W_i = x^m * y^n\), where \(x\) and \(y\) are the m/z and intensity
values, respectively. See the details section about ndotproduct
for an
explanation how to set m
and n
.
nspectraangle
: the normalized spectra angle is described in Toprak et al
2014 as:
\(NSA = 1 - \frac{2*\cos^{-1}(W_1 \cdot W_2)}{\pi}\); where
\(W_i = x^m * y^n\), where \(x\) and \(y\) are the m/z and intensity
values, respectively. The weighting was not originally proposed by Toprak et
al. 2014. See the details section about ndotproduct
for an explanation how
to set m
and n
.
Note
These methods are implemented as described in Stein and Scott 1994
(navdist
, ndotproduct
, neuclidean
) and Toprak et al. 2014
(nspectraangle
) but because there is no reference implementation available
we are unable to guarantee that the results are identical.
Note that the Stein and Scott 1994 normalized dot product method (and by
extension ndotproduct
) corresponds to the square of the orthodox
normalized dot product (or cosine distance) used also commonly as spectrum
similarity measure (Yilmaz et al. 2017).
Please see also the corresponding discussion at the github pull request
linked below. If you find any problems or reference implementation please
open an issue at
https://github.com/rformassspectrometry/MsCoreUtils/issues.
References
Stein, S. E., and Scott, D. R. (1994). Optimization and testing of mass spectral library search algorithms for compound identification. Journal of the American Society for Mass Spectrometry, 5(9), 859–866. doi:10.1016/1044-0305(94)87009-8 .
Yilmaz, S., Vandermarliere, E., and Lennart Martens (2017). Methods to Calculate Spectrum Similarity. In S. Keerthikumar and S. Mathivanan (eds.), Proteome Bioinformatics: Methods in Molecular Biology, vol. 1549 (pp. 81). doi:10.1007/978-1-4939-6740-7_7 .
Horai et al. (2010). MassBank: a public repository for sharing mass spectral data for life sciences. Journal of mass spectrometry, 45(7), 703–714. doi:10.1002/jms.1777 .
Toprak et al. (2014). Conserved peptide fragmentation as a benchmarking tool for mass spectrometers and a discriminating feature for targeted proteomics. Molecular & Cellular Proteomics : MCP, 13(8), 2056–2071. doi:10.1074/mcp.O113.036475 .
Pull Request for these distance/similarity measurements: https://github.com/rformassspectrometry/MsCoreUtils/pull/33
See also
Other distance/similarity functions:
gnps()
Author
navdist
, neuclidean
, nspectraangle
: Sebastian Gibb
ndotproduct
: Sebastian Gibb and
Thomas Naake, thomasnaake@googlemail.com
Examples
x <- matrix(c(1:5, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity")))
y <- matrix(c(1:5, 5:1), ncol = 2, dimnames = list(c(), c("mz", "intensity")))
ndotproduct(x, y)
#> [1] 0.7660906
ndotproduct(x, y, m = 2, n = 0.5)
#> [1] 0.9074293
ndotproduct(x, y, m = 3, n = 0.6)
#> [1] 0.9127553
neuclidean(x, y)
#> [1] 0.8003406
navdist(x, y)
#> [1] 0.6970151
nspectraangle(x, y)
#> [1] 0.5556013