Package: PSMatch
Authors: Laurent Gatto [aut, cre] (ORCID: https://orcid.org/0000-0002-1520-2268), Johannes Rainer
[aut] (ORCID: https://orcid.org/0000-0002-6977-7147), Sebastian Gibb
[aut] (ORCID: https://orcid.org/0000-0001-7406-4443), Samuel Wieczorek
[ctb], Thomas Burger [ctb]
Last modified:
2024-12-12 14:59:07.975702
Compiled: Thu Dec 12
15:34:32 2024
To install the package from Bioconductor, make sure you have the
BiocManager
package, available from CRAN, and then run
BiocManager::install("PSMatch")
This vignette is one among several illustrating how to use the
PSMatch
package, focusing on the handling and processing of
proteomics identification data using the PSM
class. For a
general overview of the package, see the PSMatch
package
manual page (?PSMatch
) and references therein.
We are going to use an mzid
file from the
msdata
package.
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
The PSM()
function parses one or multiple
mzid
files and returns an object of class PSM
.
This class is a simple extension of the DFrame
class,
representing the peptide-spectrum matches in a tabular format.
## Loading required namespace: mzR
id
## PSM with 5802 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation
This table contains 5802 matches for 5343 scans and 4938 peptides sequences, each annotated by 35 variables.
nrow(id) ## number of matches
## [1] 5802
## [1] 5343
## [1] 4938
names(id)
## [1] "sequence" "spectrumID"
## [3] "chargeState" "rank"
## [5] "passThreshold" "experimentalMassToCharge"
## [7] "calculatedMassToCharge" "peptideRef"
## [9] "modNum" "isDecoy"
## [11] "post" "pre"
## [13] "start" "end"
## [15] "DatabaseAccess" "DBseqLength"
## [17] "DatabaseSeq" "DatabaseDescription"
## [19] "scan.number.s." "acquisitionNum"
## [21] "spectrumFile" "idFile"
## [23] "MS.GF.RawScore" "MS.GF.DeNovoScore"
## [25] "MS.GF.SpecEValue" "MS.GF.EValue"
## [27] "MS.GF.QValue" "MS.GF.PepQValue"
## [29] "modPeptideRef" "modName"
## [31] "modMass" "modLocation"
## [33] "subOriginalResidue" "subReplacementResidue"
## [35] "subLocation"
The PSM data are read as is, without any filtering. As we can see below, we still have all the hits from the forward and reverse (decoy) databases.
table(id$isDecoy)
##
## FALSE TRUE
## 2906 2896
The data also contains multiple matches for several spectra. The table below shows the number of individual MS scans that have 1, 2, … up to 5 matches.
##
## 1 2 3 4 5
## 4936 369 26 10 2
More specifically, we can see below how scan 1774 has 4 matches, all
to sequence RTRYQAEVR
, which itself matches to 4 different
proteins:
i <- grep("scan=1774", id$spectrumID)
id[i, ]
## PSM with 4 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation
id[i, "DatabaseAccess"]
## [1] "ECA2104" "ECA2867" "ECA3427" "ECA4142"
If the goal is to keep all the matches, but arranged by
scan/spectrum, one can reduce the DataFrame
object
by the spectrumID
variable, so that each scan correponds to
a single row that still stores all values1:
id2 <- reducePSMs(id, id$spectrumID)
rownames(id2) <- NULL ## rownames not needed here
dim(id2)
## [1] 5343 35
The resulting object contains a single entrie for scan 1774 with information for the multiple matches stored as a list within the table cell.
j <- grep("scan=1774", id2$spectrumID)
id2[j, ]
## Reduced PSM with 1 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation
id2[j, "DatabaseAccess"]
## CharacterList of length 1
## [["controllerType=0 controllerNumber=1 scan=1774"]] ECA2104 ECA2867 ECA3427 ECA4142
The identification data could be used to annotate an raw mass
spectrometry Spectra
object (see the
Spectra::joinSpectraData()
function for details).
Often, the PSM data is filtered to only retain reliable matches. The
MSnID
package can be used to set thresholds to attain
user-defined PSM, peptide or protein-level FDRs. Here, we will simply
filter out wrong or the least reliable identifications.
id <- filterPsmDecoy(id)
## Removed 2896 decoy hits.
id
## PSM with 2906 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation
id <- filterPsmRank(id)
## Removed 155 PSMs with rank > 1.
id
## PSM with 2751 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation
The data still contains shared peptides, i.e. those that different
proteins. For example QKTRCATRAFKANKGRAR
matches proteins
ECA2869
and ECA4278
.
id[id$sequence == "QKTRCATRAFKANKGRAR", "DatabaseAccess"]
## [1] "ECA2869" "ECA4278"
We can filter these out to retain unique peptides.
id <- filterPsmShared(id)
## Removed 85 shared peptides.
id
## PSM with 2666 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation
This last filtering leaves us with 2666 PSMs.
Note that the ConnectedComponents
approach defined in
this package allows one to explore protein groups defined by such shared
peptides more thoroughly and make informed decision as to which shared
peptides to retain and which ones to drop. For details see the related
vignette or the ConnectedComponents
manual page.
This can also be achieved with the filterPSMs()
function:
id <- PSM(f)
filterPSMs(id)
## Starting with 5802 PSMs:
## Removed 2896 decoy hits.
## Removed 155 PSMs with rank > 1.
## Removed 85 shared peptides.
## 2666 PSMs left.
## PSM with 2666 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation
mzR
and mzID
parsers
The PSM()
function can take two different values for the
parser
parameter, namely "mzR"
(the default
value) and "mzID"
.
mzR uses the openIDfile()
function
from the mzR to parse
the mzId
file(s), and then coerces the data to a
data.frame
which is eventually returned as a
PSM
object. The parser function uses dedicated code from
the Proteowizard project (included in mzR
) and is generally
the fastest approach.
mzID parses the mzId
file with
mzID()
function from the mzID
package, and then flattens the data to a data.frame
with
mzID::flatten()
and eventuelly returns a PSM
object. The mzID
package relies on the XML package. Is
is slower but is is more robust to variations in the mzID
implementation, as is thus a useful backup when the mzR
backend fails.
system.time(id1 <- PSM(f, parser = "mzR"))
## user system elapsed
## 0.158 0.003 0.160
system.time(id2 <- PSM(f, parser = "mzID"))
## Loading required namespace: mzID
## reading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid...
## DONE!
## user system elapsed
## 5.859 0.070 5.939
Other differences in the two parsers include the columns that are
returned, the way they name them, and, as will shown below the matches
that are returned. Note for instance (and this will be important later),
that there is no equivalent of "modLocation"
in
id2
.
names(id1)
## [1] "sequence" "spectrumID"
## [3] "chargeState" "rank"
## [5] "passThreshold" "experimentalMassToCharge"
## [7] "calculatedMassToCharge" "peptideRef"
## [9] "modNum" "isDecoy"
## [11] "post" "pre"
## [13] "start" "end"
## [15] "DatabaseAccess" "DBseqLength"
## [17] "DatabaseSeq" "DatabaseDescription"
## [19] "scan.number.s." "acquisitionNum"
## [21] "spectrumFile" "idFile"
## [23] "MS.GF.RawScore" "MS.GF.DeNovoScore"
## [25] "MS.GF.SpecEValue" "MS.GF.EValue"
## [27] "MS.GF.QValue" "MS.GF.PepQValue"
## [29] "modPeptideRef" "modName"
## [31] "modMass" "modLocation"
## [33] "subOriginalResidue" "subReplacementResidue"
## [35] "subLocation"
names(id2)
## [1] "spectrumid" "scan number(s)"
## [3] "acquisitionnum" "passthreshold"
## [5] "rank" "calculatedmasstocharge"
## [7] "experimentalmasstocharge" "chargestate"
## [9] "ms-gf:denovoscore" "ms-gf:evalue"
## [11] "ms-gf:pepqvalue" "ms-gf:qvalue"
## [13] "ms-gf:rawscore" "ms-gf:specevalue"
## [15] "assumeddissociationmethod" "isotopeerror"
## [17] "isdecoy" "post"
## [19] "pre" "end"
## [21] "start" "accession"
## [23] "length" "description"
## [25] "pepseq" "modified"
## [27] "modification" "idFile"
## [29] "spectrumFile" "databaseFile"
We also have different number of matches in the two tables:
nrow(id1)
## [1] 5802
nrow(id2)
## [1] 5759
table(id1$isDecoy)
##
## FALSE TRUE
## 2906 2896
table(id2$isdecoy)
##
## FALSE TRUE
## 2886 2873
Let’s first filter the PSM tables to facilitate focus the comparison
of relevant scans. Note that the default filterPSMs()
arguments are set to work with both parser.
id1_filtered <- filterPSMs(id1)
## Starting with 5802 PSMs:
## Removed 2896 decoy hits.
## Removed 155 PSMs with rank > 1.
## Removed 85 shared peptides.
## 2666 PSMs left.
id2_filtered <- filterPSMs(id2)
## Starting with 5759 PSMs:
## Removed 2873 decoy hits.
## Removed 155 PSMs with rank > 1.
## Removed 85 shared peptides.
## 2646 PSMs left.
As can be seen, we are also left with 2666 vs 2646 PSMs after filtering.
The difference doesn’t stem from different scans, given that the spectum identifiers are identical in both tables:
## [1] TRUE
The difference is obvious when we tally a table of spectrum id
occurences in the filtered tables. In id2_filtered
, each
scan is unique, i.e matched only once.
anyDuplicated(id2_filtered$spectrumid)
## [1] 0
However, for id1_filtered
, we see that some scans are
still repeat up to 4 times in the table:
##
## 1 2 3 4
## 2630 13 2 1
The example below shows that these differences stem from the
modification location ("modLocation"
), that is not report
by the mzID
parser:
k <- names(which(table(id1_filtered$spectrumID) == 4))
id1_filtered[id1_filtered$spectrumID == k, "sequence"]
## [1] "KCNQCLKVACTLFYCK" "KCNQCLKVACTLFYCK" "KCNQCLKVACTLFYCK" "KCNQCLKVACTLFYCK"
id1_filtered[id1_filtered$spectrumID == k, "modLocation"]
## [1] 2 5 10 15
id1_filtered[id1_filtered$spectrumID == k, "modName"]
## [1] "Carbamidomethyl" "Carbamidomethyl" "Carbamidomethyl" "Carbamidomethyl"
If we remove the "modLocation"
column, we recoved the
same number of PSMs than with the mzID
parser.
## [1] 2646
## [1] 2646
## R Under development (unstable) (2024-12-04 r87420)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] PSMatch_1.11.1 S4Vectors_0.45.2 BiocGenerics_0.53.3
## [4] generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] fastmap_1.2.0 lazyeval_0.2.2
## [5] XML_3.99-0.17 digest_0.6.37
## [7] lifecycle_1.0.4 cluster_2.1.8
## [9] ProtGenerics_1.39.0 magrittr_2.0.3
## [11] compiler_4.5.0 rlang_1.1.4
## [13] sass_0.4.9 tools_4.5.0
## [15] igraph_2.1.2 utf8_1.2.4
## [17] yaml_2.3.10 knitr_1.49
## [19] S4Arrays_1.7.1 htmlwidgets_1.6.4
## [21] DelayedArray_0.33.3 plyr_1.8.9
## [23] abind_1.4-8 BiocParallel_1.41.0
## [25] purrr_1.0.2 desc_1.4.3
## [27] grid_4.5.0 fansi_1.0.6
## [29] iterators_1.0.14 MASS_7.3-61
## [31] MultiAssayExperiment_1.33.1 SummarizedExperiment_1.37.0
## [33] cli_3.6.3 mzR_2.41.1
## [35] rmarkdown_2.29 crayon_1.5.3
## [37] ragg_1.3.3 httr_1.4.7
## [39] reshape2_1.4.4 ncdf4_1.23
## [41] cachem_1.1.0 stringr_1.5.1
## [43] zlibbioc_1.53.0 parallel_4.5.0
## [45] AnnotationFilter_1.31.0 BiocManager_1.30.25
## [47] XVector_0.47.0 matrixStats_1.4.1
## [49] vctrs_0.6.5 Matrix_1.7-1
## [51] jsonlite_1.8.9 bookdown_0.41
## [53] IRanges_2.41.2 clue_0.3-66
## [55] systemfonts_1.1.0 foreach_1.5.2
## [57] jquerylib_0.1.4 tidyr_1.3.1
## [59] glue_1.8.0 pkgdown_2.1.1.9000
## [61] codetools_0.2-20 QFeatures_1.17.0
## [63] stringi_1.8.4 GenomeInfoDb_1.43.2
## [65] GenomicRanges_1.59.1 UCSC.utils_1.3.0
## [67] mzID_1.45.0 tibble_3.2.1
## [69] pillar_1.9.0 htmltools_0.5.8.1
## [71] GenomeInfoDbData_1.2.13 R6_2.5.1
## [73] textshaping_0.4.1 doParallel_1.0.17
## [75] evaluate_1.0.1 lattice_0.22-6
## [77] Biobase_2.67.0 msdata_0.47.0
## [79] bslib_0.8.0 Rcpp_1.0.13-1
## [81] SparseArray_1.7.2 xfun_0.49
## [83] MsCoreUtils_1.19.0 fs_1.6.5
## [85] MatrixGenerics_1.19.0 pkgconfig_2.0.3
The rownames aren’t needed here and are removed to
reduce to output in the the next code chunks displaying parts of
id2
.↩︎