Abstract
This vignette describes the multiple ways to perform imputation
using the impute() methods and the underlying support
function from the MsCoreUtils package. This vignette is
distributed under a CC BY-SA license.
Introduction
This vignette provides a technical description of the imputation functionality available in the R for Mass Spectrometry packages, in particular MsCoreUtils for the implementation and QFeatures for the high level application. These packages depend on other ones for the specific imputation implementation approaches.
This vignette focuses on the technical aspects of imputation, without delving in the scientific motivations too much - see (Webb-Robertson et al. 2015; Lazar et al. 2016; Bramer et al. 2021) for the necessary backgroud. We will simply introduce important concepts when needed and refer to some relevant papers for further reading.
An important concept, described among others in (Lazar et al. 2016), is data that can be missing at random (MAR) or missing not at random (MNAR). A MNAR feature is assumed to be missing in the data because is was effectively absent or below the limit of detection in the biological sample. MAR features, on the other hand, have not been detected or identified due to technological limitations such as poor ionisation, competition among precursors (in data dependent acquisition), or absence of identification or mis-identification. Given the different underlying causes of the missingness, they should be imputed using different approaches. Typically, MNAR features can imputed using left-censored method, that will impute using a small value, reflective of the absence of the feature, while MAR features should be imputed using hot deck approaches, i.e. methods that impute using similar or matching values.
We would like to caution users on the risks of imputation, in
particular when a high proportion of values are missing. Given the
different types of missingness, wrongly imputing values can
substantially distort downstream analyses and their validity. In such
situations, it might be safer to avoid imputation altogether, and
maintain missing values. Is can also be helpful to filter out features
with too many missing values - the filterNA()
function can be used to such effect.
The imputation methods available in the MsCoreUtils
package can be listed programmatically with the
imputeMethods() function and are documented in the
?impute_matrix documentation
page.
## [1] "bpca" "knn" "QRILC" "MLE" "MLE2" "MinDet" "MinProb"
## [8] "min" "zero" "mixed" "nbavg" "with" "RF" "none"
Note that 0s are technically impossible to be recorded by a mass
spectrometer, and should never be observed in a dataset. If present,
these are the result of a prior zero-imputation by the pre-processing
software that erroneously suggest that the feature was of the MNAR type
and effectively absent in the sample. We advise to start your processing
by replacing these misleading 0 by a missing value. This could be
achieved with the zeroIsNA() method if your data is
formated as a SummarizedExperiment object (Morgan et al. 2019).
Example data
m <- matrix(1:50, nrow = 10)
diag(m) <- NA
m[which(is.na(m)) + 5] <- NA
dimnames(m) <- list(paste0("F", 1:10), paste0("S", 1:5))
randna <- rep(c(TRUE, FALSE), each = 5)
se <- SummarizedExperiment(assays = m,
rowData = data.frame(randna))
se## class: SummarizedExperiment
## dim: 10 5
## metadata(0):
## assays(1): ''
## rownames(10): F1 F2 ... F9 F10
## rowData names(1): randna
## colnames(5): S1 S2 S3 S4 S5
## colData names(0):
We are going to use the small SummarizedExperiment to
illustrate the different imputation approaches and their
parametrisation. It is composed of 10 features and 5 samples, and
contains 5 missing values aligned diagonally along the top and bottom
parts of the data matrix.
assay(se)## S1 S2 S3 S4 S5
## F1 NA 11 21 31 41
## F2 2 NA 22 32 42
## F3 3 13 NA 33 43
## F4 4 14 24 NA 44
## F5 5 15 25 35 NA
## F6 NA 16 26 36 46
## F7 7 NA 27 37 47
## F8 8 18 NA 38 48
## F9 9 19 29 NA 49
## F10 10 20 30 40 NA
In the following sections, we will use the impute()
method (see ?QFeatures::impute) and apply it to the
SummarizedExperiment instance se above. The
method is also applicable to QFeatures objects. The
individual impute_matrix() and other impute_*
from the MsCoreUtils package can be applied directly on
matrix objects.
Simple imputation
We refer to simple (or single) imputation when a
single imputation method is used across the whole dataset. For example,
if we want to replace all missing values by 0, we can use the
impute() method as shown below.
impute(se, method = "zero") |> assay()## S1 S2 S3 S4 S5
## F1 0 11 21 31 41
## F2 2 0 22 32 42
## F3 3 13 0 33 43
## F4 4 14 24 0 44
## F5 5 15 25 35 0
## F6 0 16 26 36 46
## F7 7 0 27 37 47
## F8 8 18 0 38 48
## F9 9 19 29 0 49
## F10 10 20 30 40 0
Setting the method to "zero" will apply the
MsCoreUtils::impute_zero() function on the object’s
assay.
Passing paramters to the imputation function
Or, if we want to impute all missing values with a specific value
such as 0.5, we can use the "with" method to apply the
MsCoreUtils::impute_with() function. This function requires
an additional argument, val, that defines the specific
value that should be used to replace missing values.
impute(se, method = "with", val = 0.5) |> assay()## S1 S2 S3 S4 S5
## F1 0.5 11.0 21.0 31.0 41.0
## F2 2.0 0.5 22.0 32.0 42.0
## F3 3.0 13.0 0.5 33.0 43.0
## F4 4.0 14.0 24.0 0.5 44.0
## F5 5.0 15.0 25.0 35.0 0.5
## F6 0.5 16.0 26.0 36.0 46.0
## F7 7.0 0.5 27.0 37.0 47.0
## F8 8.0 18.0 0.5 38.0 48.0
## F9 9.0 19.0 29.0 0.5 49.0
## F10 10.0 20.0 30.0 40.0 0.5
Each of the underlying function’s details are documented in the
?impute_zero, ?impute_with, … manual pages,
that all lead to the main ?impute_matrix documentation.
The MARGIN argument
In the two simple examples above, there is no sense of direction when
imputing, as every missing value is replaced by a single, pre-defined
value. In many cases however, this is not the case. To illustrate this,
let’s use the "MinDet" method (see
?impute_MinDet). MinDet performs the imputation of
left-censored missing data using a deterministic minimal value approach.
Considering a expression data with n samples and p
features, for each sample, the missing entries are replaced
with a minimal value observed in that sample. The minimal value observed
is estimated as being the q-th quantile (default q = 0.01)
of the observed values in that sample.
Below, we are going to set q = 0 to impute with the
minimal value within each sample.
impute(se, method = "MinDet", q = 0) |> assay()## Imputing along margin 2 (samples/columns).
## S1 S2 S3 S4 S5
## F1 2 11 21 31 41
## F2 2 11 22 32 42
## F3 3 13 21 33 43
## F4 4 14 24 31 44
## F5 5 15 25 35 41
## F6 2 16 26 36 46
## F7 7 11 27 37 47
## F8 8 18 21 38 48
## F9 9 19 29 31 49
## F10 10 20 30 40 41
As can be seen, the missing values in sample S1, namely F1 and F6, have been imputed by the smallest observed value in S1, namely 2. And similarly for the four other samples.
In the definition above, it is explicitly stated that the imputation
is done for each sample, i.e. along the columns of the quantitative
matrix, also called the second margin. We can repeat the same imputation
by explicitly setting MARGIN = 2.
impute(se, method = "MinDet", q = 0, MARGIN = 2) |> assay()## Imputing along margin 2 (samples/columns).
## S1 S2 S3 S4 S5
## F1 2 11 21 31 41
## F2 2 11 22 32 42
## F3 3 13 21 33 43
## F4 4 14 24 31 44
## F5 5 15 25 35 41
## F6 2 16 26 36 46
## F7 7 11 27 37 47
## F8 8 18 21 38 48
## F9 9 19 29 31 49
## F10 10 20 30 40 41
And indeed, the default margin for the "MinDet" method
is 2:
getImputeMargin("impute_MinDet")## [1] 2
The imputation margin is not always 2. The nearest neighbour imputation method chooses a certain number of similar features. By similar features, we explicitly refer to other rows, i.e. the first margin:
getImputeMargin("impute_knn")## [1] 1
It is possible to change the margin from its default value. Below, we
now use "MinDet" and choose the smallest value within each
feature/row.
impute(se, method = "MinDet", q = 0, MARGIN = 1) |> assay()## Imputing along margin 1 (features/rows).
## S1 S2 S3 S4 S5
## F1 11 11 21 31 41
## F2 2 2 22 32 42
## F3 3 13 3 33 43
## F4 4 14 24 4 44
## F5 5 15 25 35 5
## F6 16 16 26 36 46
## F7 7 7 27 37 47
## F8 8 18 8 38 48
## F9 9 19 29 9 49
## F10 10 20 30 40 10
Now, we see that the missing F1 value in S1 has been imputed by the smallest observed value along the first row, namely 11.
We can extract all default margin values for all
MsCoreUtils::impute_* functions as show below.
## $impute_bpca
## [1] 1
##
## $impute_fun
## [1] 1
##
## $impute_knn
## [1] 1
##
## $impute_matrix
## [1] NA
##
## $impute_min
## [1] NA
##
## $impute_MinDet
## [1] 2
##
## $impute_MinProb
## [1] 2
##
## $impute_mixed
## c(1L, 1L)
##
## $impute_mle
## [1] 2
##
## $impute_neighbour_average
## [1] 1
##
## $impute_QRILC
## [1] 2
##
## $impute_RF
## [1] 2
##
## $impute_with
## [1] NA
##
## $impute_zero
## [1] NA
A missing margin means that, as for "with" or
"zero" above, there is not margin along which the
imputation is performed. Mixed imputation is a special case that has two
margins, which we will describe in the next section.
The relevance of the imputation margin can also depend on downstream analyses. In (Vanderaa and Gatto 2023), the authors illustrate that imputation along the first margin increases the correlation between features, while imputation along the second margin increases the correlation between samples. These artificially improved correlations can then in turn impact any analyses that rely on the identification of sample or protein clusters.
Mixed imputation
As we have seen above, different underlying processes can lead to
different types of missing values, namely MAR and MNAR. One view of this
is to define these processes at the feature level 1. In such cases, one
might want to impute different sets of features in a mixed way:
MAR features with a MAR-appropriate method, and MNAR features with a
MNAR-appropriate method. This is possible is the "mixed"
method.
To be able to apply mixed imputation, we need to define features that are MAR, and features that are MNAR. This is done using a logical vector whose length is equal to the number of features.
rowData(se)$randna## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
The TRUE values define the MAR features (F1 to F5 in our
case) and FALSE defines MNAR features (F6 to F10).
To use mixed imputation, we need to specify the MAR and MNAR features, two imputation methods, one for MAR features, and another one for MNAR features.
impute(se, method = "mixed",
randna = rowData(se)$randna,
mar = "MinDet",
mnar = "zero") |>
assay()## Imputing along margin 1 (features/rows).
## S1 S2 S3 S4 S5
## F1 11.3 11.0 21.0 31.0 41.0
## F2 2.0 2.6 22.0 32.0 42.0
## F3 3.0 13.0 3.3 33.0 43.0
## F4 4.0 14.0 24.0 4.3 44.0
## F5 5.0 15.0 25.0 35.0 5.3
## F6 0.0 16.0 26.0 36.0 46.0
## F7 7.0 0.0 27.0 37.0 47.0
## F8 8.0 18.0 0.0 38.0 48.0
## F9 9.0 19.0 29.0 0.0 49.0
## F10 10.0 20.0 30.0 40.0 0.0
We can see that the bottom-half of the matrix corresponding to MNAR
features have been imputed by zero, while the other have imputed by
"MinDet". We also see that the margin used for
"MinDet" was 1 (along the rows). Indeed, the default
margins are 1 for both MAR and MNAR features: the first one is for MAR,
and the second one for MNAR.
getImputeMargin("impute_mixed")## c(1L, 1L)
Different margins
It is of course possible to change the margins when performing mixed imputation:
impute(se, method = "mixed",
randna = rowData(se)$randna,
mar = "MinDet",
mnar = "zero",
MARGIN = c(2, NA)) |>
assay()## Imputing along margin 2 (samples/columns).
## S1 S2 S3 S4 S5
## F1 2.03 11.00 21.00 31.00 41.00
## F2 2.00 11.06 22.00 32.00 42.00
## F3 3.00 13.00 21.03 33.00 43.00
## F4 4.00 14.00 24.00 31.03 44.00
## F5 5.00 15.00 25.00 35.00 41.03
## F6 0.00 16.00 26.00 36.00 46.00
## F7 7.00 0.00 27.00 37.00 47.00
## F8 8.00 18.00 0.00 38.00 48.00
## F9 9.00 19.00 29.00 0.00 49.00
## F10 10.00 20.00 30.00 40.00 0.00
We set NA for zero-imputation (it could also have been
1, as it is irrelevant anyway) and 2 for MinDet-imputation. And we can
confirm that this time, the the MAR features have been imputed using the
smallest values have been choosen for each sample/column.
Passing paramters to the imputation functions
It is possible to pass arguments to the respective MAR and MNAR
function using the marArgs and mnarArg
arguments as named lists. Below, we are going to use MinDet in
both cases, with different parameters.
impute(se,
method = "mixed",
randna = rowData(se)$randna,
mar = "MinDet",
mnar = "MinDet",
marArgs = list(q = 0),
mnarArgs = list(q = 1),
MARGIN = c(1, 1)) |>
assay()## Imputing along margin 1 (features/rows).
## Imputing along margin 1 (features/rows).
## S1 S2 S3 S4 S5
## F1 11 11 21 31 41
## F2 2 2 22 32 42
## F3 3 13 3 33 43
## F4 4 14 24 4 44
## F5 5 15 25 35 5
## F6 46 16 26 36 46
## F7 7 47 27 37 47
## F8 8 18 48 38 48
## F9 9 19 29 49 49
## F10 10 20 30 40 40
In both cases, we impute along the rows. For the MAR features (top
half of the matrix), we impute using the minimal value of that row
(using q = 0), while for the MNAR feature (bottom half of
the matrix), we impute using the maximal value of that row (using
q = 1). As anticipated, the value of F1 in S1 gets 11, and
F5 in S1 gets 46.
Using the whole matrix to compute imputated values
When doing mixed imputation, the respective MAR and MNAR sub-matrices
are split and imputed separately. It is also possible the use the whole
data matrix to compute the MAR and MNAR imputated values. This is
controlled by the split argument that, by default, is set
to TRUE.
Below, we are going to repeat a mixed imputation, imputing the MAR
values (the top half of the matrix) with the highest value of the
whole columns using MARGIN = 2 and
split = TRUE. The NMAR values (the bottom half of the
matrix) are impute using the smallest value along the rows using
MARGIN = 1, and are hence not impacted by the
split value.
impute(se,
method = "mixed",
randna = rowData(se)$randna,
mar = "MinDet",
mnar = "MinDet",
marArgs = list(q = 1),
mnarArgs = list(q = 0),
MARGIN = c(2, 1),
split = FALSE) |>
assay()## Imputing along margin 2 (samples/columns).
## Imputing along margin 1 (features/rows).
## S1 S2 S3 S4 S5
## F1 10 11 21 31 41
## F2 2 20 22 32 42
## F3 3 13 30 33 43
## F4 4 14 24 40 44
## F5 5 15 25 35 49
## F6 16 16 26 36 46
## F7 7 7 27 37 47
## F8 8 18 8 38 48
## F9 9 19 29 9 49
## F10 10 20 30 40 10
We see that the value of F1 in S1 gets 10, the highest value from F10
in S1. If we keep the default split = TRUE, it would have
gotten 5 from F5, the highest value among the MAR values. The MNAR
imputation isn’t affected by the split and get the smallest values in
each row.
Session information
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] QFeatures_1.23.1 MultiAssayExperiment_1.38.0
## [3] SummarizedExperiment_1.42.0 Biobase_2.72.0
## [5] GenomicRanges_1.64.0 Seqinfo_1.2.0
## [7] IRanges_2.46.0 S4Vectors_0.50.0
## [9] BiocGenerics_0.58.0 generics_0.1.4
## [11] MatrixGenerics_1.24.0 matrixStats_1.5.0
## [13] MsCoreUtils_1.25.3 BiocStyle_2.40.0
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.2 sass_0.4.10 SparseArray_1.12.2
## [4] stringi_1.8.7 lattice_0.22-9 magrittr_2.0.5
## [7] digest_0.6.39 evaluate_1.0.5 grid_4.6.0
## [10] bookdown_0.46 fastmap_1.2.0 plyr_1.8.9
## [13] jsonlite_2.0.0 Matrix_1.7-5 ProtGenerics_1.44.0
## [16] BiocManager_1.30.27 purrr_1.2.2 lazyeval_0.2.3
## [19] textshaping_1.0.5 jquerylib_0.1.4 abind_1.4-8
## [22] cli_3.6.6 rlang_1.2.0 XVector_0.52.0
## [25] cachem_1.1.0 DelayedArray_0.38.1 yaml_2.3.12
## [28] otel_0.2.0 S4Arrays_1.12.0 tools_4.6.0
## [31] reshape2_1.4.5 dplyr_1.2.1 vctrs_0.7.3
## [34] R6_2.6.1 lifecycle_1.0.5 stringr_1.6.0
## [37] fs_2.1.0 htmlwidgets_1.6.4 clue_0.3-68
## [40] MASS_7.3-65 ragg_1.5.2 cluster_2.1.8.2
## [43] pkgconfig_2.0.3 desc_1.4.3 pillar_1.11.1
## [46] pkgdown_2.2.0.9000 bslib_0.10.0 Rcpp_1.1.1-1.1
## [49] glue_1.8.1 systemfonts_1.3.2 tidyselect_1.2.1
## [52] tibble_3.3.1 xfun_0.57 knitr_1.51
## [55] AnnotationFilter_1.36.0 igraph_2.3.0 htmltools_0.5.9
## [58] rmarkdown_2.31 compiler_4.6.0
License
This vignette is distributed under a CC BY-SA license license.
