Matching MS Spectra against a reference
Source:R/CompDbSource.R
, R/matchSpectra.R
CompareSpectraParam.Rd
matchSpectra
compares experimental (query) MS2 spectra against
reference (target) MS2 spectra and reports matches with a similarity that
passing a specified threshold. The function performs the similarity
calculation between each query spectrum against each target spectrum.
Parameters query
and target
can be used to define the query and target
spectra, respectively, while parameter param
allows to define and configure
the similarity calculation and matching condition. Parameter query
takes
a Spectra object while target
can be either a Spectra object, a
CompDb (reference library) object defined in the CompoundDb
package or
a CompAnnotationSource (e.g. a CompDbSource()
)
with the reference or connection information to a supported annotation
resource).
Some notes on performance and information on parallel processing are provided in the vignette.
Currently supported parameter objects defining the matching are:
CompareSpectraParam
: the generic parameter object allowing to set all settings for thecompareSpectra()
call that is used to perform the similarity calculation. This includesMAPFUN
andFUN
defining the peak-mapping and similarity calculation functions andppm
andtolerance
to define an acceptable difference between m/z values of the compared peaks. Additional parameters to thecompareSpectra
call can be passed along with...
. See the help ofSpectra()
for more information on these parameters. ParametersrequirePrecursor
(defaultTRUE
) andrequirePrecursorPeak
(defaultFALSE
) allow to pre-filter the target spectra prior to the actual similarity calculation for each individual query spectrum. Target spectra can also be pre-filtered based on retention time if parametertoleranceRt
is set to a value different than the defaulttoleranceRt = Inf
. Only target spectra with a retention time within the query's retention time +/- (toleranceRt
+percentRt
% of the query's retention time) are considered. Note that while forppm
andtolerance
only a single value is accepted,toleranceRt
andpercentRt
can be also of length equal to the number of query spectra hence allowing to define different rt boundaries for each query spectrum. While these pre-filters can considerably improve performance, it should be noted that no matches will be found between query and target spectra with missing values in the considered variable (precursor m/z or retention time). For target spectra without retention times (such as forSpectra
from a public reference database such as MassBank) the defaulttoleranceRt = Inf
should thus be used. Finally, parameterTHRESHFUN
allows to define a function to be applied to the similarity scores to define which matches to report. See below for more details.MatchForwardReverseParam
: performs spectra matching as withCompareSpectraParam
but reports, similar to MS-DIAL, also the reverse similarity score and the presence ratio. In detail, the matching of query spectra to target spectra is performed by considering all peaks from the query and all peaks from the target (reference) spectrum (i.e. forward matching using an outer join-based peak matching strategy). For matching spectra also the reverse similarity is calculated considering only peaks present in the target (reference) spectrum (i.e. using a right join-based peak matching). This is reported as spectra variable"reverse_score"
. In addition, the ratio between the number of matched peaks and the total number of peaks in the target (reference) spectra is reported as the presence ratio (spectra variable"presence_ratio"
) and the total number of matched peaks as"matched_peaks_count"
. See examples below for details. ParameterTHRESHFUN_REVERSE
allows to define an additional threshold function to filter matches. IfTHRESHFUN_REVERSE
is defined only matches with a spectra similarity fulfilling bothTHRESHFUN
andTHRESHFUN_REVERSE
are returned. With the defaultTHRESHFUN_REVERSE = NULL
all matches passingTHRESHFUN
are reported.
Usage
# S4 method for class 'Spectra,CompDbSource,Param'
matchSpectra(
query,
target,
param,
BPPARAM = BiocParallel::SerialParam(),
addOriginalQueryIndex = TRUE
)
CompareSpectraParam(
MAPFUN = joinPeaks,
tolerance = 0,
ppm = 5,
FUN = MsCoreUtils::ndotproduct,
requirePrecursor = TRUE,
requirePrecursorPeak = FALSE,
THRESHFUN = function(x) which(x >= 0.7),
toleranceRt = Inf,
percentRt = 0,
...
)
MatchForwardReverseParam(
MAPFUN = joinPeaks,
tolerance = 0,
ppm = 5,
FUN = MsCoreUtils::ndotproduct,
requirePrecursor = TRUE,
requirePrecursorPeak = FALSE,
THRESHFUN = function(x) which(x >= 0.7),
THRESHFUN_REVERSE = NULL,
toleranceRt = Inf,
percentRt = 0,
...
)
# S4 method for class 'Spectra,Spectra,CompareSpectraParam'
matchSpectra(
query,
target,
param,
rtColname = c("rtime", "rtime"),
BPPARAM = BiocParallel::SerialParam(),
addOriginalQueryIndex = TRUE
)
# S4 method for class 'Spectra,CompDb,Param'
matchSpectra(
query,
target,
param,
rtColname = c("rtime", "rtime"),
BPPARAM = BiocParallel::SerialParam(),
addOriginalQueryIndex = TRUE
)
# S4 method for class 'Spectra,Spectra,MatchForwardReverseParam'
matchSpectra(
query,
target,
param,
rtColname = c("rtime", "rtime"),
BPPARAM = BiocParallel::SerialParam(),
addOriginalQueryIndex = TRUE
)
Arguments
- query
for
matchSpectra
: Spectra object with the query spectra.- target
for
matchSpectra
: Spectra, CompDb or object extending CompAnnotationSource (such as CompDbSource) with the target (reference) spectra to comparequery
against.- param
for
matchSpectra
: parameter object (such asCompareSpectraParam
) defining the settings for the matching.- BPPARAM
for
matchSpectra
: parallel processing setup (see theBiocParallel
package for more information). Parallel processing is disabled by default (with the default settingBPPARAM = SerialParam()
).- addOriginalQueryIndex
for
matchSpectra()
:logical(1)
whether an additional spectra variable".original_query_index"
should be added to thequery
Spectra
object providing the index of the spectrum in this originally provided object. This spectra variable can be useful to link back to the originalSpectra
object if theMatchedSpectra
object gets subsetted/processed.- MAPFUN
function
used to map peaks between the compared spectra. Defaults forCompareSpectraParam
tojoinPeaks()
. SeecompareSpectra()
for details.- tolerance
numeric(1)
for an absolute maximal accepted difference between m/z values. This will be used incompareSpectra
as well as for eventual precursor m/z matching.- ppm
numeric(1)
for a relative, m/z-dependent, maximal accepted difference between m/z values. This will be used incompareSpectra
as well as for eventual precursor m/z matching.- FUN
function
used to calculate similarity between spectra. Defaults forCompareSpectraParam
toMsCoreUtils::ndotproduct()
. SeeMsCoreUtils::ndotproduct()
for details.- requirePrecursor
logical(1)
whether only target spectra are considered in the similarity calculation with a precursor m/z that matches the precursor m/z of the query spectrum (considering alsoppm
andtolerance
). WithrequirePrecursor = TRUE
(the default) the function will complete much faster, but will not find any hits for target (or query spectra) with missing precursor m/z. It is suggested to check first the availability of the precursor m/z intarget
andquery
.- requirePrecursorPeak
logical(1)
whether only target spectra will be considered in the spectra similarity calculation that have a peak with an m/z matching the precursor m/z of the query spectrum. Defaults torequirePrecursorPeak = FALSE
. It is suggested to check first the availability of the precursor m/z inquery
, as no match will be reported for query spectra with missing precursor m/z.- THRESHFUN
function
applied to the similarity score to define which target spectra are considered matching. Defaults toTHRESHFUN = function(x) which(x >= 0.7)
hence selects all target spectra matching a query spectrum with a similarity higher or equal than0.7
. Any function that takes a numeric vector with similarity scores from the comparison of a query spectrum with all target spectra (as returned bycompareSpectra()
) as input and returns alogical
vector (same dimensions as the similarity scores) or an integer with the matches is supported.- toleranceRt
numeric
of length 1 or equal to the number of query spectra defining the maximal accepted (absolute) difference in retention time between query and target spectra. By default (withtoleranceRt = Inf
) the retention time-based filter is not considered. See help ofCompareSpectraParam
above for more information.- percentRt
numeric
of length 1 or equal to the number of query spectra defining the maximal accepted relative difference in retention time between query and target spectra expressed in percentage of the query rt. ForpercentRt = 10
, similarities are defined between the query spectrum and all target spectra with a retention time within query rt +/- 10% of the query. By default (withtoleranceRt = Inf
) the retention time-based filter is not considered. Thus, to consider thepercentRt
parameter,toleranceRt
should be set to a value different than that. See help ofCompareSpectraParam
above for more information.- ...
for
CompareSpectraParam
: additional parameters passed along to thecompareSpectra()
call.- THRESHFUN_REVERSE
for
MatchForwardReverseParam
: optional additional thresholding function to filter the results on the reverse score. If specified the same format thanTHRESHFUN
is expected.- rtColname
character(2)
with the name of the spectra variable containing the retention time information for compounds to be used in retention time matching (only used iftoleranceRt
is notInf
). It can also becharacter(1)
if the two names are the same. Defaults tortColname = c("rtime", "rtime")
.
Value
matchSpectra
returns a MatchedSpectra()
object with the matching
results. If target
is a CompAnnotationSource
only matching target
spectra will be reported.
Constructor functions return an instance of the class.
Examples
library(Spectra)
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: ‘generics’
#> The following objects are masked from ‘package:base’:
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#>
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:MetaboAnnotation’:
#>
#> endoapply
#> The following object is masked from ‘package:utils’:
#>
#> findMatches
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> Loading required package: BiocParallel
library(msdata)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)
## subset to selected spectra.
pest_ms2 <- pest_ms2[c(808, 809, 945:955)]
## Load a small example MassBank data set
load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))
## Match spectra with the default similarity score (normalized dot product)
csp <- CompareSpectraParam(requirePrecursor = TRUE, ppm = 10)
mtches <- matchSpectra(pest_ms2, minimb, csp)
mtches
#> Object of class MatchedSpectra
#> Total number of matches: 16
#> Number of query objects: 13 (5 matched)
#> Number of target objects: 100 (11 matched)
## Are there any matching spectra for the first query spectrum?
mtches[1]
#> Object of class MatchedSpectra
#> Total number of matches: 0
#> Number of query objects: 1 (0 matched)
#> Number of target objects: 100 (0 matched)
## No
## And for the second query spectrum?
mtches[2]
#> Object of class MatchedSpectra
#> Total number of matches: 4
#> Number of query objects: 1 (1 matched)
#> Number of target objects: 100 (4 matched)
## The second query spectrum matches 4 target spectra. The scores for these
## matches are:
mtches[2]$score
#> [1] 0.7869556 0.8855473 0.7234894 0.7219942
## To access the score for the full data set
mtches$score
#> [1] NA 0.7869556 0.8855473 0.7234894 0.7219942 NA 0.7769746
#> [8] 0.7577286 NA 0.7433718 0.7019807 0.7081274 NA 0.7320465
#> [15] 0.8106258 0.7290458 0.8168876 0.7247800 0.7412586 0.7198787 NA
#> [22] NA NA NA
## Below we use a THRESHFUN that returns for each query spectrum the (first)
## best matching target spectrum.
csp <- CompareSpectraParam(requirePrecursor = FALSE, ppm = 10,
THRESHFUN = function(x) which.max(x))
mtches <- matchSpectra(pest_ms2, minimb, csp)
mtches
#> Object of class MatchedSpectra
#> Total number of matches: 13
#> Number of query objects: 13 (13 matched)
#> Number of target objects: 100 (10 matched)
## Each of the query spectra is matched to one target spectrum
length(mtches)
#> [1] 13
matches(mtches)
#> query_idx target_idx score
#> 1 1 1 0.000000e+00
#> 2 2 73 8.855473e-01
#> 3 3 2 6.313687e-01
#> 4 4 44 7.769746e-01
#> 5 5 74 1.772117e-05
#> 6 6 2 7.433718e-01
#> 7 7 5 1.906998e-03
#> 8 8 53 8.168876e-01
#> 9 9 44 7.412586e-01
#> 10 10 86 4.085289e-04
#> 11 11 53 4.323403e-01
#> 12 12 47 3.469648e-03
#> 13 13 71 7.612480e-06
## Match spectra considering also measured retention times. This requires
## that both query and target spectra have non-missing retention times.
rtime(pest_ms2)
#> [1] 361.651 361.741 377.609 377.699 378.120 378.539 378.779 378.869 378.959
#> [10] 379.379 380.059 380.609 381.029
rtime(minimb)
#> [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## Target spectra don't have retention times. Below we artificially set
## retention times to show how an additional retention time filter would
## work.
rtime(minimb) <- rep(361, length(minimb))
## Matching spectra requiring a matching precursor m/z and the difference
## of retention times between query and target spectra to be <= 2 seconds.
csp <- CompareSpectraParam(requirePrecursor = TRUE, ppm = 10,
toleranceRt = 2)
mtches <- matchSpectra(pest_ms2, minimb, csp)
mtches
#> Object of class MatchedSpectra
#> Total number of matches: 4
#> Number of query objects: 13 (1 matched)
#> Number of target objects: 100 (4 matched)
matches(mtches)
#> query_idx target_idx score
#> 1 2 70 0.7869556
#> 2 2 73 0.8855473
#> 3 2 75 0.7234894
#> 4 2 76 0.7219942
## Note that parameter `rtColname` can be used to define different spectra
## variables with retention time information (such as retention indices etc).
## A `CompDb` compound annotation database could also be used with
## parameter `target`. Below we load the test `CompDb` database from the
## `CompoundDb` Bioconductor package.
library(CompoundDb)
#> Loading required package: AnnotationFilter
fl <- system.file("sql", "CompDb.MassBank.sql", package = "CompoundDb")
cdb <- CompDb(fl)
res <- matchSpectra(pest_ms2, cdb, CompareSpectraParam())
## We do however not find any matches since the used compound annotation
## database contains only a very small subset of the MassBank.
res
#> Object of class MatchedSpectra
#> Total number of matches: 0
#> Number of query objects: 13 (0 matched)
#> Number of target objects: 70 (0 matched)
## As `target` we have now however the MS2 spectra data from the compound
## annotation database
target(res)
#> MSn data (Spectra) with 70 spectra in a MsBackendCompDb backend:
#> msLevel precursorMz polarity
#> <integer> <numeric> <integer>
#> 1 2 179.07 1
#> 2 2 179.07 1
#> 3 2 179.07 1
#> 4 2 179.07 1
#> 5 2 179.07 1
#> ... ... ... ...
#> 66 2 337.091 1
#> 67 2 337.091 1
#> 68 2 337.091 1
#> 69 2 337.091 1
#> 70 2 337.091 1
#> ... 46 more variables/columns.
#> Use 'spectraVariables' to list all of them.
#> data source: MassBank
#> version: 2020.09
#> organism: NA
## See the package vignette for details, descriptions and more examples,
## also on how to retrieve e.g. MassBank reference databases from
## Bioconductor's AnnotationHub.