These functions convert tabular data into dedicated data
objets. The readSummarizedExperiment() function takes a file
name or data.frame and converts it into a
SummarizedExperiment() object. The readQFeatures() function
takes a data.frame and converts it into a QFeatures object
(see QFeatures() for details). For the latter, two use-cases
exist:
The single-set case will generate a
QFeaturesobject with a singleSummarizedExperimentcontaining all features of the input table.The multi-set case will generate a
QFeaturesobject containing multipleSummarizedExperiments, resulting from splitting the input table. This multi-set case is generally used when the input table contains data from multiple runs/batches.
Usage
readSummarizedExperiment(
assayData,
quantCols = NULL,
fnames = NULL,
ecol = NULL,
...
)
readQFeatures(
assayData,
colData = NULL,
quantCols = NULL,
runCol = NULL,
name = "quants",
removeEmptyCols = FALSE,
verbose = TRUE,
ecol = NULL,
fnames = NULL,
...
)Arguments
- assayData
A
data.frame, or any object that can be coerced into adata.frame, holding the quantitative assay. ForreadSummarizedExperiment(), this can also be acharacter(1)pointing to a filename. Thisdata.frameis typically generated by an identification and quantification software, such as Sage, Proteome Discoverer, MaxQuant, ...- quantCols
A
numeric(),logical()orcharacter()defining the columns of theassayDatathat contain the quantitative data. This information can also be defined incolData(see details).- fnames
For the single- and multi-set cases, an optional
character(1)ornumeric(1)indicating the column to be used as feature names. Note that rownames must be unique withinQFeaturessets. Default isNULL. See also section 'Feature names'.- ecol
Same as
quantCols. Available for backwards compatibility. Default isNULL. If bothecolandcolDataare set, an error is thrown.- ...
Further arguments that can be passed on to
read.csv()exceptstringsAsFactors, which is alwaysFALSE. Only applicable toreadSummarizedExperiment().- colData
A
data.frame(or any object that can be coerced to adata.frame) containing sample/column annotations, includingquantColsandrunCol(see details).- runCol
For the multi-set case, a
numeric(1)orcharacter(1)pointing to the column ofassayData(andcolData, is set) that contains the runs/batches. Make sure that the column name in both tables are identical and syntactically valid (if you supply acharacter) or have the same index (if you supply anumeric). Note that characters are converted to syntactically valid names usingmake.names- name
For the single-set case, an optional
character(1)to name the set in theQFeaturesobject. Default isquants.- removeEmptyCols
A
logical(1). IfTRUE, quantitative columns that contain only missing values are removed.- verbose
A
logical(1)indicating whether the progress of the data reading and formatting should be printed to the console. Default isTRUE.
Value
An instance of class QFeatures or
SummarizedExperiment::SummarizedExperiment(). For the
former, the quantitative sets of each run are stored in
SummarizedExperiment::SummarizedExperiment() object.
Details
The single- and multi-set cases are defined by the quantCols and
runCol parameters, whether passed by the quantCols and
runCol vectors and/or the colData data.frame (see below).
Single-set case
The quantitative data variables are defined by the quantCols.
The single-set case can be represented schematically as shown
below.
|------+----------------+-----------|
| cols | quantCols 1..N | more cols |
| . | ... | ... |
| . | ... | ... |
| . | ... | ... |
|------+----------------+-----------|Note that every quantCols column contains data for a single
sample. The single-set case is defined by the absence of any
runCol input (see next section). We here provide a
(non-exhaustive) list of typical data sets that fall under the
single-set case:
Peptide- or protein-level label-free data (bulk or single-cell).
Peptide- or protein-level multiplexed (e.g. TMT) data (bulk or single-cell).
PSM-level multiplexed data acquired in a single MS run (bulk or single-cell).
PSM-level data from fractionation experiments, where each fraction of the same sample was acquired with the same multiplexing label.
Multi-set case
A run/batch variable, runCol, is required to import multi-set
data. The multi-set case can be represented schematically as shown
below.
|--------+------+----------------+-----------|
| runCol | cols | quantCols 1..N | more cols |
| 1 | . | ... | ... |
| 1 | . | ... | ... |
|--------+------+----------------+-----------|
| 2 | . | ... | ... |
|--------+------+----------------+-----------|
| . | . | ... | ... |
|--------+------+----------------+-----------|Every quantCols column contains data for multiple samples
acquired in different runs. The multi-set case applies when
runCol is provided, which will determine how the table is split
into multiple sets.
We here provide a (non-exhaustive) list of typical data sets that fall under the multi-set case:
PSM- or precursor-level multiplexed data acquired in multiple runs (bulk or single-cell)
PSM- or precursor-level label-free data acquired in multiple runs (bulk or single-cell)
DIA-NN data (see also
readQFeaturesFromDIANN()).
Adding sample annotations with colData
We recommend providing sample annotations when creating a
QFeatures object. The colData is a table in which each row
corresponds to a sample and each column provides information about
the samples. There is no restriction on the number of columns and
on the type of data they should contain. However, we impose one or
two columns (depending on the use case) that allow to link the
annotations of each sample to its quantitative data:
Single-set case: the
colDatamust contain a column namedquantColsthat provides the names of the columns inassayDatacontaining quantitative values for each sample (see single-set cases in the examples).Multi-set case: the
colDatamust contain a column namedquantColsthat provides the names of the columns inassayDatawith the quantitative values for each sample, and a column namedrunColthat provides the MS runs/batches in which each sample has been acquired. The entries incolData[["runCol"]]are matched against the entries provided byassayData[[runCol]].
When the quantCols argument is not provided to
readQFeatures(), the function will automatically determine the
quantCols from colData[["quantCols"]]. Therefore, quantCols
and colData cannot be both missing.
Samples that are present in assayData but absent
colData will lead to a warning, and the missing entries will be
automatically added to the colData and filled with NAs.
When using the quantCols and runCol arguments only
(without colData), the colData contains zero
columns/variables.
Feature names
Assay feature (i.e. rownames) are important as they are used when assays are
joined with joinAssays(). They can be set upon creation of the
QFeatures() object by setting the fnames argument. See also
createPrecursorId() in case a precursor identifier is note readily
available and should be created from other, existing rowData variables.
See also
The
QFeatures(seeQFeatures()) class to read about how to manipulate the resultingQFeaturesobject.The
readQFeaturesFromDIANN()function to import DIA-NN quantitative data.
Examples
######################################
## Single-set case.
## Load a data.frame with PSM-level data
data(hlpsms)
hlpsms[1:10, c(1, 2, 10:11, 14, 17)]
#> X126 X127C X131 Sequence ProteinGroupAccessions PEP
#> 383 0.12283431 0.08045915 0.11961594 SQGEIDk Q8BYY4 0.11800
#> 475 0.35268185 0.14162381 0.02957384 YEAQGDk P46467 0.01070
#> 478 0.01546089 0.16142297 0.04370403 TTScDTk Q64449 0.11800
#> 552 0.04702854 0.09288723 0.10014038 aEELESR P60469 0.04450
#> 596 0.01044693 0.15866147 0.02307803 aQEEAIk P13597-2 0.00850
#> 610 0.04955362 0.01215244 0.29732174 dGAVDGcR Q6P5D8 0.00322
#> 731 0.04007112 0.06632932 0.10188731 AcDSAEVk Q01237 0.04090
#> 786 0.16122744 0.10251588 0.04884985 VSSDEDLk Q9D8U8 0.00130
#> 795 0.60288497 0.11022069 0.02182222 TDQNYEk Q8BMJ2 0.01880
#> 816 0.10298287 0.05818306 0.07723716 QEEIQQk Q3URD3 0.02900
## Create a QFeatures object with a single psms set
qf1 <- readQFeatures(hlpsms, quantCols = 1:10, name = "psms")
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf1
#> An instance of class QFeatures (type: bulk) with 1 set:
#>
#> [1] psms: SummarizedExperiment with 3010 rows and 10 columns
colData(qf1)
#> DataFrame with 10 rows and 0 columns
######################################
## Single-set case with colData.
(coldat <- data.frame(var = rnorm(10),
quantCols = names(hlpsms)[1:10]))
#> var quantCols
#> 1 0.8497468 X126
#> 2 0.6412677 X127C
#> 3 -0.8533205 X127N
#> 4 1.4502467 X128C
#> 5 -0.5834776 X128N
#> 6 -1.5154718 X129C
#> 7 1.4507215 X129N
#> 8 -0.8785367 X130C
#> 9 0.4751676 X130N
#> 10 -1.7268766 X131
qf2 <- readQFeatures(hlpsms, colData = coldat)
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf2
#> An instance of class QFeatures (type: bulk) with 1 set:
#>
#> [1] quants: SummarizedExperiment with 3010 rows and 10 columns
colData(qf2)
#> DataFrame with 10 rows and 2 columns
#> var quantCols
#> <numeric> <character>
#> X126 0.849747 X126
#> X127C 0.641268 X127C
#> X127N -0.853321 X127N
#> X128C 1.450247 X128C
#> X128N -0.583478 X128N
#> X129C -1.515472 X129C
#> X129N 1.450721 X129N
#> X130C -0.878537 X130C
#> X130N 0.475168 X130N
#> X131 -1.726877 X131
######################################
## Multi-set case.
## Let's simulate 3 different files/batches for that same input
## data.frame, and define a colData data.frame.
hlpsms$file <- paste0("File", sample(1:3, nrow(hlpsms), replace = TRUE))
hlpsms[1:10, c(1, 2, 10:11, 14, 17, 29)]
#> X126 X127C X131 Sequence ProteinGroupAccessions PEP
#> 383 0.12283431 0.08045915 0.11961594 SQGEIDk Q8BYY4 0.11800
#> 475 0.35268185 0.14162381 0.02957384 YEAQGDk P46467 0.01070
#> 478 0.01546089 0.16142297 0.04370403 TTScDTk Q64449 0.11800
#> 552 0.04702854 0.09288723 0.10014038 aEELESR P60469 0.04450
#> 596 0.01044693 0.15866147 0.02307803 aQEEAIk P13597-2 0.00850
#> 610 0.04955362 0.01215244 0.29732174 dGAVDGcR Q6P5D8 0.00322
#> 731 0.04007112 0.06632932 0.10188731 AcDSAEVk Q01237 0.04090
#> 786 0.16122744 0.10251588 0.04884985 VSSDEDLk Q9D8U8 0.00130
#> 795 0.60288497 0.11022069 0.02182222 TDQNYEk Q8BMJ2 0.01880
#> 816 0.10298287 0.05818306 0.07723716 QEEIQQk Q3URD3 0.02900
#> file
#> 383 File2
#> 475 File2
#> 478 File1
#> 552 File2
#> 596 File1
#> 610 File2
#> 731 File2
#> 786 File2
#> 795 File3
#> 816 File1
qf3 <- readQFeatures(hlpsms, quantCols = 1:10, runCol = "file")
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf3
#> An instance of class QFeatures (type: bulk) with 3 sets:
#>
#> [1] File1: SummarizedExperiment with 987 rows and 10 columns
#> [2] File2: SummarizedExperiment with 1000 rows and 10 columns
#> [3] File3: SummarizedExperiment with 1023 rows and 10 columns
colData(qf3)
#> DataFrame with 30 rows and 0 columns
######################################
## Multi-set case with colData.
(coldat <- data.frame(runCol = rep(paste0("File", 1:3), each = 10),
var = rnorm(10),
quantCols = names(hlpsms)[1:10]))
#> runCol var quantCols
#> 1 File1 -0.5998708 X126
#> 2 File1 1.4683437 X127C
#> 3 File1 -0.6198368 X127N
#> 4 File1 0.5668461 X128C
#> 5 File1 1.3854334 X128N
#> 6 File1 1.3984792 X129C
#> 7 File1 -1.7610803 X129N
#> 8 File1 -0.4927893 X130C
#> 9 File1 0.8543945 X130N
#> 10 File1 0.3859256 X131
#> 11 File2 -0.5998708 X126
#> 12 File2 1.4683437 X127C
#> 13 File2 -0.6198368 X127N
#> 14 File2 0.5668461 X128C
#> 15 File2 1.3854334 X128N
#> 16 File2 1.3984792 X129C
#> 17 File2 -1.7610803 X129N
#> 18 File2 -0.4927893 X130C
#> 19 File2 0.8543945 X130N
#> 20 File2 0.3859256 X131
#> 21 File3 -0.5998708 X126
#> 22 File3 1.4683437 X127C
#> 23 File3 -0.6198368 X127N
#> 24 File3 0.5668461 X128C
#> 25 File3 1.3854334 X128N
#> 26 File3 1.3984792 X129C
#> 27 File3 -1.7610803 X129N
#> 28 File3 -0.4927893 X130C
#> 29 File3 0.8543945 X130N
#> 30 File3 0.3859256 X131
qf4 <- readQFeatures(hlpsms, colData = coldat, runCol = "file")
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf4
#> An instance of class QFeatures (type: bulk) with 3 sets:
#>
#> [1] File1: SummarizedExperiment with 987 rows and 10 columns
#> [2] File2: SummarizedExperiment with 1000 rows and 10 columns
#> [3] File3: SummarizedExperiment with 1023 rows and 10 columns
colData(qf4)
#> DataFrame with 30 rows and 3 columns
#> runCol var quantCols
#> <character> <numeric> <character>
#> File1_X126 File1 -0.599871 X126
#> File1_X127C File1 1.468344 X127C
#> File1_X127N File1 -0.619837 X127N
#> File1_X128C File1 0.566846 X128C
#> File1_X128N File1 1.385433 X128N
#> ... ... ... ...
#> File3_X129C File3 1.398479 X129C
#> File3_X129N File3 -1.761080 X129N
#> File3_X130C File3 -0.492789 X130C
#> File3_X130N File3 0.854394 X130N
#> File3_X131 File3 0.385926 X131
