QFeatures from tabular data

These functions convert tabular data into dedicated data objets. The readSummarizedExperiment() function takes a file name or data.frame and converts it into a SummarizedExperiment() object. The readQFeatures() function takes a data.frame and converts it into a QFeatures object (see QFeatures() for details). For the latter, two use-cases exist:

The single-set case will generate a QFeatures object with a single SummarizedExperiment containing all features of the input table.
The multi-set case will generate a QFeatures object containing multiple SummarizedExperiments, resulting from splitting the input table. This multi-set case is generally used when the input table contains data from multiple runs/batches.

Usage

readSummarizedExperiment(
  assayData,
  quantCols = NULL,
  fnames = NULL,
  ecol = NULL,
  ...
)

readQFeatures(
  assayData,
  colData = NULL,
  quantCols = NULL,
  runCol = NULL,
  name = "quants",
  removeEmptyCols = FALSE,
  verbose = TRUE,
  ecol = NULL,
  fnames = NULL,
  ...
)

Arguments

assayData: A data.frame, or any object that can be coerced into a data.frame, holding the quantitative assay. For readSummarizedExperiment(), this can also be a character(1) pointing to a filename. This data.frame is typically generated by an identification and quantification software, such as Sage, Proteome Discoverer, MaxQuant, ...
quantCols: A numeric(), logical() or character() defining the columns of the assayData that contain the quantitative data. This information can also be defined in colData (see details).
fnames: For the single- and multi-set cases, an optional character(1) or numeric(1) indicating the column to be used as feature names. Note that rownames must be unique within QFeatures sets. Default is NULL.
ecol: Same as quantCols. Available for backwards compatibility. Default is NULL. If both ecol and colData are set, an error is thrown.
...: Further arguments that can be passed on to read.csv() except stringsAsFactors, which is always FALSE. Only applicable to readSummarizedExperiment().
colData: A data.frame (or any object that can be coerced to a data.frame) containing sample/column annotations, including quantCols and runCol (see details).
runCol: For the multi-set case, a numeric(1) or character(1) pointing to the column of assayData (and colData, is set) that contains the runs/batches. Make sure that the column name in both tables are identical and syntactically valid (if you supply a character) or have the same index (if you supply a numeric). Note that characters are converted to syntactically valid names using make.names
name: For the single-set case, an optional character(1) to name the set in the QFeatures object. Default is quants.
removeEmptyCols: A logical(1). If TRUE, quantitative columns that contain only missing values are removed.
verbose: A logical(1) indicating whether the progress of the data reading and formatting should be printed to the console. Default is TRUE.

Value

An instance of class QFeatures or SummarizedExperiment::SummarizedExperiment(). For the former, the quantitative sets of each run are stored in SummarizedExperiment::SummarizedExperiment() object.

Details

The single- and multi-set cases are defined by the quantCols and runCol parameters, whether passed by the quantCols and runCol vectors and/or the colData data.frame (see below).

Single-set case

The quantitative data variables are defined by the quantCols. The single-set case can be represented schematically as shown below.

|------+----------------+-----------|
| cols | quantCols 1..N | more cols |
| .    | ...            | ...       |
| .    | ...            | ...       |
| .    | ...            | ...       |
|------+----------------+-----------|

Note that every quantCols column contains data for a single sample. The single-set case is defined by the absence of any runCol input (see next section). We here provide a (non-exhaustive) list of typical data sets that fall under the single-set case:

Peptide- or protein-level label-free data (bulk or single-cell).
Peptide- or protein-level multiplexed (e.g. TMT) data (bulk or single-cell).
PSM-level multiplexed data acquired in a single MS run (bulk or single-cell).
PSM-level data from fractionation experiments, where each fraction of the same sample was acquired with the same multiplexing label.

Multi-set case

A run/batch variable, runCol, is required to import multi-set data. The multi-set case can be represented schematically as shown below.

|--------+------+----------------+-----------|
| runCol | cols | quantCols 1..N | more cols |
|   1    | .    | ...            | ...       |
|   1    | .    | ...            | ...       |
|--------+------+----------------+-----------|
|   2    | .    | ...            | ...       |
|--------+------+----------------+-----------|
|   .    | .    | ...            | ...       |
|--------+------+----------------+-----------|

Every quantCols column contains data for multiple samples acquired in different runs. The multi-set case applies when runCol is provided, which will determine how the table is split into multiple sets.

We here provide a (non-exhaustive) list of typical data sets that fall under the multi-set case:

PSM- or precursor-level multiplexed data acquired in multiple runs (bulk or single-cell)
PSM- or precursor-level label-free data acquired in multiple runs (bulk or single-cell)
DIA-NN data (see also readQFeaturesFromDIANN()).

Adding sample annotations with `colData`

We recommend providing sample annotations when creating a QFeatures object. The colData is a table in which each row corresponds to a sample and each column provides information about the samples. There is no restriction on the number of columns and on the type of data they should contain. However, we impose one or two columns (depending on the use case) that allow to link the annotations of each sample to its quantitative data:

Single-set case: the colData must contain a column named quantCols that provides the names of the columns in assayData containing quantitative values for each sample (see single-set cases in the examples).
Multi-set case: the colData must contain a column named quantCols that provides the names of the columns in assayData with the quantitative values for each sample, and a column named runCol that provides the MS runs/batches in which each sample has been acquired. The entries in colData[["runCol"]] are matched against the entries provided by assayData[[runCol]].

When the quantCols argument is not provided to readQFeatures(), the function will automatically determine the quantCols from colData[["quantCols"]]. Therefore, quantCols and colData cannot be both missing.

Samples that are present in assayData but absent colData will lead to a warning, and the missing entries will be automatically added to the colData and filled with NAs.

When using the quantCols and runCol arguments only (without colData), the colData contains zero columns/variables.

Author

Laurent Gatto, Christophe Vanderaa

Examples


######################################
## Single-set case.

## Load a data.frame with PSM-level data
data(hlpsms)
hlpsms[1:10, c(1, 2, 10:11, 14, 17)]
#>           X126      X127C       X131 Sequence ProteinGroupAccessions     PEP
#> 383 0.12283431 0.08045915 0.11961594  SQGEIDk                 Q8BYY4 0.11800
#> 475 0.35268185 0.14162381 0.02957384  YEAQGDk                 P46467 0.01070
#> 478 0.01546089 0.16142297 0.04370403  TTScDTk                 Q64449 0.11800
#> 552 0.04702854 0.09288723 0.10014038  aEELESR                 P60469 0.04450
#> 596 0.01044693 0.15866147 0.02307803  aQEEAIk               P13597-2 0.00850
#> 610 0.04955362 0.01215244 0.29732174 dGAVDGcR                 Q6P5D8 0.00322
#> 731 0.04007112 0.06632932 0.10188731 AcDSAEVk                 Q01237 0.04090
#> 786 0.16122744 0.10251588 0.04884985 VSSDEDLk                 Q9D8U8 0.00130
#> 795 0.60288497 0.11022069 0.02182222  TDQNYEk                 Q8BMJ2 0.01880
#> 816 0.10298287 0.05818306 0.07723716  QEEIQQk                 Q3URD3 0.02900

## Create a QFeatures object with a single psms set
qf1 <- readQFeatures(hlpsms, quantCols = 1:10, name = "psms")
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf1
#> An instance of class QFeatures (type: bulk) with 1 set:
#> 
#>  [1] psms: SummarizedExperiment with 3010 rows and 10 columns 
colData(qf1)
#> DataFrame with 10 rows and 0 columns

######################################
## Single-set case with colData.

(coldat <- data.frame(var = rnorm(10),
                      quantCols = names(hlpsms)[1:10]))
#>           var quantCols
#> 1   0.7543883      X126
#> 2  -1.4626257     X127C
#> 3  -0.1084145     X127N
#> 4   0.7761296     X128C
#> 5   0.6055939     X128N
#> 6   0.4827772     X129C
#> 7   1.6805106     X129N
#> 8  -0.1508190     X130C
#> 9   0.6795999     X130N
#> 10  0.4499002      X131
qf2 <- readQFeatures(hlpsms, colData = coldat)
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf2
#> An instance of class QFeatures (type: bulk) with 1 set:
#> 
#>  [1] quants: SummarizedExperiment with 3010 rows and 10 columns 
colData(qf2)
#> DataFrame with 10 rows and 2 columns
#>             var   quantCols
#>       <numeric> <character>
#> X126   0.754388        X126
#> X127C -1.462626       X127C
#> X127N -0.108415       X127N
#> X128C  0.776130       X128C
#> X128N  0.605594       X128N
#> X129C  0.482777       X129C
#> X129N  1.680511       X129N
#> X130C -0.150819       X130C
#> X130N  0.679600       X130N
#> X131   0.449900        X131

######################################
## Multi-set case.

## Let's simulate 3 different files/batches for that same input
## data.frame, and define a colData data.frame.

hlpsms$file <- paste0("File", sample(1:3, nrow(hlpsms), replace = TRUE))
hlpsms[1:10, c(1, 2, 10:11, 14, 17, 29)]
#>           X126      X127C       X131 Sequence ProteinGroupAccessions     PEP
#> 383 0.12283431 0.08045915 0.11961594  SQGEIDk                 Q8BYY4 0.11800
#> 475 0.35268185 0.14162381 0.02957384  YEAQGDk                 P46467 0.01070
#> 478 0.01546089 0.16142297 0.04370403  TTScDTk                 Q64449 0.11800
#> 552 0.04702854 0.09288723 0.10014038  aEELESR                 P60469 0.04450
#> 596 0.01044693 0.15866147 0.02307803  aQEEAIk               P13597-2 0.00850
#> 610 0.04955362 0.01215244 0.29732174 dGAVDGcR                 Q6P5D8 0.00322
#> 731 0.04007112 0.06632932 0.10188731 AcDSAEVk                 Q01237 0.04090
#> 786 0.16122744 0.10251588 0.04884985 VSSDEDLk                 Q9D8U8 0.00130
#> 795 0.60288497 0.11022069 0.02182222  TDQNYEk                 Q8BMJ2 0.01880
#> 816 0.10298287 0.05818306 0.07723716  QEEIQQk                 Q3URD3 0.02900
#>      file
#> 383 File3
#> 475 File3
#> 478 File2
#> 552 File3
#> 596 File1
#> 610 File2
#> 731 File3
#> 786 File3
#> 795 File2
#> 816 File2

qf3 <- readQFeatures(hlpsms, quantCols = 1:10, runCol = "file")
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf3
#> An instance of class QFeatures (type: bulk) with 3 sets:
#> 
#>  [1] File1: SummarizedExperiment with 986 rows and 10 columns 
#>  [2] File2: SummarizedExperiment with 998 rows and 10 columns 
#>  [3] File3: SummarizedExperiment with 1026 rows and 10 columns 
colData(qf3)
#> DataFrame with 30 rows and 0 columns


######################################
## Multi-set case with colData.

(coldat <- data.frame(runCol = rep(paste0("File", 1:3), each = 10),
                      var = rnorm(10),
                      quantCols = names(hlpsms)[1:10]))
#>    runCol         var quantCols
#> 1   File1  0.51626857      X126
#> 2   File1  1.02704216     X127C
#> 3   File1  0.73166309     X127N
#> 4   File1 -0.11245031     X128C
#> 5   File1  1.42917302     X128N
#> 6   File1  0.89530083     X129C
#> 7   File1 -1.50043104     X129N
#> 8   File1 -0.70734051     X130C
#> 9   File1 -0.11555948     X130N
#> 10  File1  0.08755882      X131
#> 11  File2  0.51626857      X126
#> 12  File2  1.02704216     X127C
#> 13  File2  0.73166309     X127N
#> 14  File2 -0.11245031     X128C
#> 15  File2  1.42917302     X128N
#> 16  File2  0.89530083     X129C
#> 17  File2 -1.50043104     X129N
#> 18  File2 -0.70734051     X130C
#> 19  File2 -0.11555948     X130N
#> 20  File2  0.08755882      X131
#> 21  File3  0.51626857      X126
#> 22  File3  1.02704216     X127C
#> 23  File3  0.73166309     X127N
#> 24  File3 -0.11245031     X128C
#> 25  File3  1.42917302     X128N
#> 26  File3  0.89530083     X129C
#> 27  File3 -1.50043104     X129N
#> 28  File3 -0.70734051     X130C
#> 29  File3 -0.11555948     X130N
#> 30  File3  0.08755882      X131
qf4 <- readQFeatures(hlpsms, colData = coldat, runCol = "file")
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
qf4
#> An instance of class QFeatures (type: bulk) with 3 sets:
#> 
#>  [1] File1: SummarizedExperiment with 986 rows and 10 columns 
#>  [2] File2: SummarizedExperiment with 998 rows and 10 columns 
#>  [3] File3: SummarizedExperiment with 1026 rows and 10 columns 
colData(qf4)
#> DataFrame with 30 rows and 3 columns
#>                  runCol        var   quantCols
#>             <character>  <numeric> <character>
#> File1_X126        File1   0.516269        X126
#> File1_X127C       File1   1.027042       X127C
#> File1_X127N       File1   0.731663       X127N
#> File1_X128C       File1  -0.112450       X128C
#> File1_X128N       File1   1.429173       X128N
#> ...                 ...        ...         ...
#> File3_X129C       File3  0.8953008       X129C
#> File3_X129N       File3 -1.5004310       X129N
#> File3_X130C       File3 -0.7073405       X130C
#> File3_X130N       File3 -0.1155595       X130N
#> File3_X131        File3  0.0875588        X131