
Group rows of a diagonal matrix using a threshold
Source:R/grouping-functions.R
groupSimilarityMatrix.RdThis function groups elements (rows or columns) of a diagonal matrix, such as
a pairwise correlation matrix or similarity matrix, with a value >= threshold. This creates clusters of elements in which all elements have
a value >= threshold with any other element in that cluster. On a
correlation matrix (such as created with cor) it will generate small
clusters of highly correlated elements. Note however that single elements in
one cluster could also have a correlation >= threshold to another element
in another cluster. The average similarity to its own cluster will however
be higher to that of the other.
Details
The algorithm is defined as follows:
all pairs of values in
xwhich are>= thresholdare identified and sorted decreasingly.starting with the pair with the highest correlation, groups are defined:
if none of the two is in a group, both are put into the same new group.
if one of the two is already in a group, the other is put into the same group if all correlations of it to that group are
>= threshold(and are notNA).if both are already in the same group nothing is done.
if both are in different groups: an element is put into the group of the other if a) all correlations of it to members of the other's group are not
NAand>= thresholdand b) the average correlation to the other group is larger than the average correlation to its own group.
This ensures that groups are defined in which all elements have a correlation
>= threshold with each other and the correlation between members of the
same group is maximized.
See also
Other grouping operations:
groupClosest(),
groupConsecutive(),
groupSimilarityMatrixTree()
Examples
x <- rbind(
c(1, 0.9, 0.6, 0.8, 0.5),
c(0.9, 1, 0.7, 0.92, 0.8),
c(0.6, 0.7, 1, 0.91, 0.7),
c(0.8, 0.92, 0.91, 1, 0.9),
c(0.5, 0.8, 0.7, 0.9, 1)
)
groupSimilarityMatrix(x, threshold = 0.9)
#> [1] 2 1 3 1 4
groupSimilarityMatrix(x, threshold = 0.1)
#> [1] 1 1 1 1 1
## Add also a correlation between 3 and 2
x[2, 3] <- 0.9
x[3, 2] <- 0.9
x
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.0 0.90 0.60 0.80 0.5
#> [2,] 0.9 1.00 0.90 0.92 0.8
#> [3,] 0.6 0.90 1.00 0.91 0.7
#> [4,] 0.8 0.92 0.91 1.00 0.9
#> [5,] 0.5 0.80 0.70 0.90 1.0
groupSimilarityMatrix(x, threshold = 0.9)
#> [1] 2 1 1 1 3
## Add a higher correlation between 4 and 5
x[4, 5] <- 0.99
x[5, 4] <- 0.99
x
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.0 0.90 0.60 0.80 0.50
#> [2,] 0.9 1.00 0.90 0.92 0.80
#> [3,] 0.6 0.90 1.00 0.91 0.70
#> [4,] 0.8 0.92 0.91 1.00 0.99
#> [5,] 0.5 0.80 0.70 0.99 1.00
groupSimilarityMatrix(x, threshold = 0.9)
#> [1] 2 2 3 1 1
## Increase correlation between 2 and 3
x[2, 3] <- 0.92
x[3, 2] <- 0.92
x
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.0 0.90 0.60 0.80 0.50
#> [2,] 0.9 1.00 0.92 0.92 0.80
#> [3,] 0.6 0.92 1.00 0.91 0.70
#> [4,] 0.8 0.92 0.91 1.00 0.99
#> [5,] 0.5 0.80 0.70 0.99 1.00
groupSimilarityMatrix(x, threshold = 0.9) ## Don't break previous cluster!
#> [1] 3 2 2 1 1