A long dataframe can be reduced by mergeing certain rows into a
single one. These new variables are constructed as a SimpleList
containing all the original values. Invariant columns, i.e columns
that have the same value along all the rows that need to be
merged, can be shrunk into a new variables containing that
invariant value (rather than in list columns). The grouping of
rows, i.e. the rows that need to be shrunk together as one, is
defined by a vector.
The opposite operation is expand. But note that for a
DataFrame
to be expanded back, it must not to be simplified.
reduceDataFrame(x, k, count = FALSE, simplify = TRUE, drop = FALSE)
expandDataFrame(x, k = NULL)
The DataFrame
to be reduced or expanded.
A ‘vector’ of length nrow(x)
defining the grouping
based on which the DataFrame
will be shrunk.
logical(1)
specifying of an additional column
(called by default .n
) with the tally of rows shrunk into on
new row should be added. Note that if already existing, .n
will be silently overwritten.
A logical(1)
defining if invariant columns
should be converted to simple lists. Default is TRUE
.
A logical(1)
specifying whether the non-invariant
columns should be dropped altogether. Default is FALSE
.
An expanded (reduced) DataFrame
.
Missing values do have an important effect on reduce
. Unless all
values to be reduces are missing, they will result in an
non-invariant column, and will be dropped with drop = TRUE
. See
the example below.
The presence of missing values can have side effects in higher
level functions that rely on reduction of DataFrame
objects.
library("IRanges")
k <- sample(100, 1e3, replace = TRUE)
df <- DataFrame(k = k,
x = round(rnorm(length(k)), 2),
y = seq_len(length(k)),
z = sample(LETTERS, length(k), replace = TRUE),
ir = IRanges(seq_along(k), width = 10),
r = Rle(sample(5, length(k), replace = TRUE)),
invar = k + 1)
df
#> DataFrame with 1000 rows and 7 columns
#> k x y z ir r invar
#> <integer> <numeric> <integer> <character> <IRanges> <Rle> <numeric>
#> 1 62 1.11 1 C 1-10 1 63
#> 2 25 -0.74 2 W 2-11 2 26
#> 3 96 -0.25 3 U 3-12 2 97
#> 4 91 -0.10 4 X 4-13 2 92
#> 5 40 -0.57 5 T 5-14 5 41
#> ... ... ... ... ... ... ... ...
#> 996 22 -0.65 996 E 996-1005 1 23
#> 997 99 -0.10 997 V 997-1006 5 100
#> 998 63 -0.94 998 E 998-1007 4 64
#> 999 100 0.43 999 X 999-1008 4 101
#> 1000 25 0.27 1000 J 1000-1009 1 26
## Shinks the DataFrame
df2 <- reduceDataFrame(df, df$k)
df2
#> DataFrame with 100 rows and 7 columns
#> k x y z
#> <integer> <NumericList> <IntegerList> <CharacterList>
#> 1 1 -0.17, 0.37,-1.05,... 58,193,323,... C,F,G,...
#> 2 2 1.55,-0.08, 0.12,... 118,295,439,... E,P,I,...
#> 3 3 0.29,-2.80, 1.29,... 32,140,477,... X,E,U,...
#> 4 4 0.87,0.24,1.20,... 18,194,319,... Q,Q,V,...
#> 5 5 0.86,-0.89,-0.28,... 15,63,121,... J,Q,V,...
#> ... ... ... ... ...
#> 96 96 -0.25,-0.17,-0.13,... 3,261,410,... U,M,C,...
#> 97 97 0.31,-0.27,-0.57,... 37,42,104,... L,C,C,...
#> 98 98 1.14, 0.84,-0.17,... 144,163,221,... W,T,P,...
#> 99 99 0.04,2.16,0.47,... 33,82,159,... L,L,A,...
#> 100 100 0.38, 0.27,-0.60,... 43,66,489,... X,F,L,...
#> ir r invar
#> <IRangesList> <RleList> <numeric>
#> 1 58-67,193-202,323-332,... 2,4,2,... 2
#> 2 118-127,295-304,439-448,... 3,5,1,... 3
#> 3 32-41,140-149,477-486,... 4,2,4,... 4
#> 4 18-27,194-203,319-328,... 1,2,3,... 5
#> 5 15-24,63-72,121-130,... 2,5,3,... 6
#> ... ... ... ...
#> 96 3-12,261-270,410-419,... 2,4,1,... 97
#> 97 37-46,42-51,104-113,... 3,4,1,... 98
#> 98 144-153,163-172,221-230,... 4,1,4,... 99
#> 99 33-42,82-91,159-168,... 4,1,4,... 100
#> 100 43-52,66-75,489-498,... 1,1,4,... 101
## With a tally of the number of members in each group
reduceDataFrame(df, df$k, count = TRUE)
#> DataFrame with 100 rows and 8 columns
#> k x y z
#> <integer> <NumericList> <IntegerList> <CharacterList>
#> 1 1 -0.17, 0.37,-1.05,... 58,193,323,... C,F,G,...
#> 2 2 1.55,-0.08, 0.12,... 118,295,439,... E,P,I,...
#> 3 3 0.29,-2.80, 1.29,... 32,140,477,... X,E,U,...
#> 4 4 0.87,0.24,1.20,... 18,194,319,... Q,Q,V,...
#> 5 5 0.86,-0.89,-0.28,... 15,63,121,... J,Q,V,...
#> ... ... ... ... ...
#> 96 96 -0.25,-0.17,-0.13,... 3,261,410,... U,M,C,...
#> 97 97 0.31,-0.27,-0.57,... 37,42,104,... L,C,C,...
#> 98 98 1.14, 0.84,-0.17,... 144,163,221,... W,T,P,...
#> 99 99 0.04,2.16,0.47,... 33,82,159,... L,L,A,...
#> 100 100 0.38, 0.27,-0.60,... 43,66,489,... X,F,L,...
#> ir r invar .n
#> <IRangesList> <RleList> <numeric> <integer>
#> 1 58-67,193-202,323-332,... 2,4,2,... 2 12
#> 2 118-127,295-304,439-448,... 3,5,1,... 3 9
#> 3 32-41,140-149,477-486,... 4,2,4,... 4 9
#> 4 18-27,194-203,319-328,... 1,2,3,... 5 12
#> 5 15-24,63-72,121-130,... 2,5,3,... 6 9
#> ... ... ... ... ...
#> 96 3-12,261-270,410-419,... 2,4,1,... 97 7
#> 97 37-46,42-51,104-113,... 3,4,1,... 98 12
#> 98 144-153,163-172,221-230,... 4,1,4,... 99 10
#> 99 33-42,82-91,159-168,... 4,1,4,... 100 16
#> 100 43-52,66-75,489-498,... 1,1,4,... 101 11
## Much faster, but more crowded result
df3 <- reduceDataFrame(df, df$k, simplify = FALSE)
df3
#> DataFrame with 100 rows and 7 columns
#> k x y z
#> <IntegerList> <NumericList> <IntegerList> <CharacterList>
#> 1 1,1,1,... -0.17, 0.37,-1.05,... 58,193,323,... C,F,G,...
#> 2 2,2,2,... 1.55,-0.08, 0.12,... 118,295,439,... E,P,I,...
#> 3 3,3,3,... 0.29,-2.80, 1.29,... 32,140,477,... X,E,U,...
#> 4 4,4,4,... 0.87,0.24,1.20,... 18,194,319,... Q,Q,V,...
#> 5 5,5,5,... 0.86,-0.89,-0.28,... 15,63,121,... J,Q,V,...
#> ... ... ... ... ...
#> 96 96,96,96,... -0.25,-0.17,-0.13,... 3,261,410,... U,M,C,...
#> 97 97,97,97,... 0.31,-0.27,-0.57,... 37,42,104,... L,C,C,...
#> 98 98,98,98,... 1.14, 0.84,-0.17,... 144,163,221,... W,T,P,...
#> 99 99,99,99,... 0.04,2.16,0.47,... 33,82,159,... L,L,A,...
#> 100 100,100,100,... 0.38, 0.27,-0.60,... 43,66,489,... X,F,L,...
#> ir r invar
#> <IRangesList> <RleList> <NumericList>
#> 1 58-67,193-202,323-332,... 2,4,2,... 2,2,2,...
#> 2 118-127,295-304,439-448,... 3,5,1,... 3,3,3,...
#> 3 32-41,140-149,477-486,... 4,2,4,... 4,4,4,...
#> 4 18-27,194-203,319-328,... 1,2,3,... 5,5,5,...
#> 5 15-24,63-72,121-130,... 2,5,3,... 6,6,6,...
#> ... ... ... ...
#> 96 3-12,261-270,410-419,... 2,4,1,... 97,97,97,...
#> 97 37-46,42-51,104-113,... 3,4,1,... 98,98,98,...
#> 98 144-153,163-172,221-230,... 4,1,4,... 99,99,99,...
#> 99 33-42,82-91,159-168,... 4,1,4,... 100,100,100,...
#> 100 43-52,66-75,489-498,... 1,1,4,... 101,101,101,...
## Drop all non-invariant columns
reduceDataFrame(df, df$k, drop = TRUE)
#> DataFrame with 100 rows and 2 columns
#> k invar
#> <integer> <numeric>
#> 1 1 2
#> 2 2 3
#> 3 3 4
#> 4 4 5
#> 5 5 6
#> ... ... ...
#> 96 96 97
#> 97 97 98
#> 98 98 99
#> 99 99 100
#> 100 100 101
## Missing values
d <- DataFrame(k = rep(1:3, each = 3),
x = letters[1:9],
y = rep(letters[1:3], each = 3),
y2 = rep(letters[1:3], each = 3))
d
#> DataFrame with 9 rows and 4 columns
#> k x y y2
#> <integer> <character> <character> <character>
#> 1 1 a a a
#> 2 1 b a a
#> 3 1 c a a
#> 4 2 d b b
#> 5 2 e b b
#> 6 2 f b b
#> 7 3 g c c
#> 8 3 h c c
#> 9 3 i c c
## y is invariant and can be simplified
reduceDataFrame(d, d$k)
#> DataFrame with 3 rows and 4 columns
#> k x y y2
#> <integer> <CharacterList> <character> <character>
#> 1 1 a,b,c a a
#> 2 2 d,e,f b b
#> 3 3 g,h,i c c
## y isn't not dropped
reduceDataFrame(d, d$k, drop = TRUE)
#> DataFrame with 3 rows and 3 columns
#> k y y2
#> <integer> <character> <character>
#> 1 1 a a
#> 2 2 b b
#> 3 3 c c
## BUT with a missing value
d[1, "y"] <- NA
d
#> DataFrame with 9 rows and 4 columns
#> k x y y2
#> <integer> <character> <character> <character>
#> 1 1 a NA a
#> 2 1 b a a
#> 3 1 c a a
#> 4 2 d b b
#> 5 2 e b b
#> 6 2 f b b
#> 7 3 g c c
#> 8 3 h c c
#> 9 3 i c c
## y isn't invariant/simplified anymore
reduceDataFrame(d, d$k)
#> DataFrame with 3 rows and 4 columns
#> k x y y2
#> <integer> <CharacterList> <CharacterList> <character>
#> 1 1 a,b,c NA,a,a a
#> 2 2 d,e,f b,b,b b
#> 3 3 g,h,i c,c,c c
## y now gets dropped
reduceDataFrame(d, d$k, drop = TRUE)
#> DataFrame with 3 rows and 2 columns
#> k y2
#> <integer> <character>
#> 1 1 a
#> 2 2 b
#> 3 3 c