A long dataframe can be reduced by mergeing certain rows into a
single one. These new variables are constructed as a SimpleList
containing all the original values. Invariant columns, i.e columns
that have the same value along all the rows that need to be
merged, can be shrunk into a new variables containing that
invariant value (rather than in list columns). The grouping of
rows, i.e. the rows that need to be shrunk together as one, is
defined by a vector.
The opposite operation is expand. But note that for a
DataFrame
to be expanded back, it must not to be simplified.
reduceDataFrame(x, k, count = FALSE, simplify = TRUE, drop = FALSE)
expandDataFrame(x, k = NULL)
The DataFrame
to be reduced or expanded.
A ‘vector’ of length nrow(x)
defining the grouping
based on which the DataFrame
will be shrunk.
logical(1)
specifying of an additional column
(called by default .n
) with the tally of rows shrunk into on
new row should be added. Note that if already existing, .n
will be silently overwritten.
A logical(1)
defining if invariant columns
should be converted to simple lists. Default is TRUE
.
A logical(1)
specifying whether the non-invariant
columns should be dropped altogether. Default is FALSE
.
An expanded (reduced) DataFrame
.
Missing values do have an important effect on reduce
. Unless all
values to be reduces are missing, they will result in an
non-invariant column, and will be dropped with drop = TRUE
. See
the example below.
The presence of missing values can have side effects in higher
level functions that rely on reduction of DataFrame
objects.
library("IRanges")
k <- sample(100, 1e3, replace = TRUE)
df <- DataFrame(k = k,
x = round(rnorm(length(k)), 2),
y = seq_len(length(k)),
z = sample(LETTERS, length(k), replace = TRUE),
ir = IRanges(seq_along(k), width = 10),
r = Rle(sample(5, length(k), replace = TRUE)),
invar = k + 1)
df
#> DataFrame with 1000 rows and 7 columns
#> k x y z ir r invar
#> <integer> <numeric> <integer> <character> <IRanges> <Rle> <numeric>
#> 1 83 0.43 1 D 1-10 3 84
#> 2 60 0.02 2 C 2-11 5 61
#> 3 5 -0.86 3 K 3-12 1 6
#> 4 50 0.29 4 O 4-13 1 51
#> 5 89 -2.36 5 K 5-14 1 90
#> ... ... ... ... ... ... ... ...
#> 996 18 0.64 996 E 996-1005 4 19
#> 997 27 1.51 997 F 997-1006 5 28
#> 998 52 0.53 998 T 998-1007 2 53
#> 999 52 0.15 999 L 999-1008 4 53
#> 1000 92 -1.36 1000 C 1000-1009 2 93
## Shinks the DataFrame
df2 <- reduceDataFrame(df, df$k)
df2
#> DataFrame with 100 rows and 7 columns
#> k x y z
#> <integer> <NumericList> <IntegerList> <CharacterList>
#> 1 1 -0.79, 1.72, 0.35,... 63,67,323,... W,G,T,...
#> 2 2 1.02,-0.26, 0.52,... 114,204,324,... G,L,J,...
#> 3 3 -0.31,-0.32, 2.19,... 8,122,222,... K,M,U,...
#> 4 4 -0.91, 0.11, 1.14,... 95,155,242,... S,M,Y,...
#> 5 5 -0.86, 0.96,-0.12,... 3,62,278,... K,S,A,...
#> ... ... ... ... ...
#> 96 96 0.55,-0.54, 0.19,... 40,410,429,... P,D,L,...
#> 97 97 0.50,-0.76, 0.37,... 333,487,536,... X,V,N,...
#> 98 98 0.82,-0.03,-0.77,... 88,274,427,... T,L,Q,...
#> 99 99 -0.18, 1.32,-1.01,... 61,64,288,... B,U,B,...
#> 100 100 -1.50, 0.65, 1.49,... 17,47,231,... H,K,N,...
#> ir r invar
#> <IRangesList> <RleList> <numeric>
#> 1 63-72,67-76,323-332,... 4,2,3,... 2
#> 2 114-123,204-213,324-333,... 2,5,4,... 3
#> 3 8-17,122-131,222-231,... 1,1,4,... 4
#> 4 95-104,155-164,242-251,... 2,3,2,... 5
#> 5 3-12,62-71,278-287,... 1,1,4,... 6
#> ... ... ... ...
#> 96 40-49,410-419,429-438,... 3,1,2,... 97
#> 97 333-342,487-496,536-545,... 4,5,2,... 98
#> 98 88-97,274-283,427-436,... 2,5,2,... 99
#> 99 61-70,64-73,288-297,... 5,3,2,... 100
#> 100 17-26,47-56,231-240,... 1,1,1,... 101
## With a tally of the number of members in each group
reduceDataFrame(df, df$k, count = TRUE)
#> DataFrame with 100 rows and 8 columns
#> k x y z
#> <integer> <NumericList> <IntegerList> <CharacterList>
#> 1 1 -0.79, 1.72, 0.35,... 63,67,323,... W,G,T,...
#> 2 2 1.02,-0.26, 0.52,... 114,204,324,... G,L,J,...
#> 3 3 -0.31,-0.32, 2.19,... 8,122,222,... K,M,U,...
#> 4 4 -0.91, 0.11, 1.14,... 95,155,242,... S,M,Y,...
#> 5 5 -0.86, 0.96,-0.12,... 3,62,278,... K,S,A,...
#> ... ... ... ... ...
#> 96 96 0.55,-0.54, 0.19,... 40,410,429,... P,D,L,...
#> 97 97 0.50,-0.76, 0.37,... 333,487,536,... X,V,N,...
#> 98 98 0.82,-0.03,-0.77,... 88,274,427,... T,L,Q,...
#> 99 99 -0.18, 1.32,-1.01,... 61,64,288,... B,U,B,...
#> 100 100 -1.50, 0.65, 1.49,... 17,47,231,... H,K,N,...
#> ir r invar .n
#> <IRangesList> <RleList> <numeric> <integer>
#> 1 63-72,67-76,323-332,... 4,2,3,... 2 11
#> 2 114-123,204-213,324-333,... 2,5,4,... 3 11
#> 3 8-17,122-131,222-231,... 1,1,4,... 4 8
#> 4 95-104,155-164,242-251,... 2,3,2,... 5 8
#> 5 3-12,62-71,278-287,... 1,1,4,... 6 6
#> ... ... ... ... ...
#> 96 40-49,410-419,429-438,... 3,1,2,... 97 7
#> 97 333-342,487-496,536-545,... 4,5,2,... 98 7
#> 98 88-97,274-283,427-436,... 2,5,2,... 99 14
#> 99 61-70,64-73,288-297,... 5,3,2,... 100 12
#> 100 17-26,47-56,231-240,... 1,1,1,... 101 10
## Much faster, but more crowded result
df3 <- reduceDataFrame(df, df$k, simplify = FALSE)
df3
#> DataFrame with 100 rows and 7 columns
#> k x y z
#> <IntegerList> <NumericList> <IntegerList> <CharacterList>
#> 1 1,1,1,... -0.79, 1.72, 0.35,... 63,67,323,... W,G,T,...
#> 2 2,2,2,... 1.02,-0.26, 0.52,... 114,204,324,... G,L,J,...
#> 3 3,3,3,... -0.31,-0.32, 2.19,... 8,122,222,... K,M,U,...
#> 4 4,4,4,... -0.91, 0.11, 1.14,... 95,155,242,... S,M,Y,...
#> 5 5,5,5,... -0.86, 0.96,-0.12,... 3,62,278,... K,S,A,...
#> ... ... ... ... ...
#> 96 96,96,96,... 0.55,-0.54, 0.19,... 40,410,429,... P,D,L,...
#> 97 97,97,97,... 0.50,-0.76, 0.37,... 333,487,536,... X,V,N,...
#> 98 98,98,98,... 0.82,-0.03,-0.77,... 88,274,427,... T,L,Q,...
#> 99 99,99,99,... -0.18, 1.32,-1.01,... 61,64,288,... B,U,B,...
#> 100 100,100,100,... -1.50, 0.65, 1.49,... 17,47,231,... H,K,N,...
#> ir r invar
#> <IRangesList> <RleList> <NumericList>
#> 1 63-72,67-76,323-332,... 4,2,3,... 2,2,2,...
#> 2 114-123,204-213,324-333,... 2,5,4,... 3,3,3,...
#> 3 8-17,122-131,222-231,... 1,1,4,... 4,4,4,...
#> 4 95-104,155-164,242-251,... 2,3,2,... 5,5,5,...
#> 5 3-12,62-71,278-287,... 1,1,4,... 6,6,6,...
#> ... ... ... ...
#> 96 40-49,410-419,429-438,... 3,1,2,... 97,97,97,...
#> 97 333-342,487-496,536-545,... 4,5,2,... 98,98,98,...
#> 98 88-97,274-283,427-436,... 2,5,2,... 99,99,99,...
#> 99 61-70,64-73,288-297,... 5,3,2,... 100,100,100,...
#> 100 17-26,47-56,231-240,... 1,1,1,... 101,101,101,...
## Drop all non-invariant columns
reduceDataFrame(df, df$k, drop = TRUE)
#> DataFrame with 100 rows and 2 columns
#> k invar
#> <integer> <numeric>
#> 1 1 2
#> 2 2 3
#> 3 3 4
#> 4 4 5
#> 5 5 6
#> ... ... ...
#> 96 96 97
#> 97 97 98
#> 98 98 99
#> 99 99 100
#> 100 100 101
## Missing values
d <- DataFrame(k = rep(1:3, each = 3),
x = letters[1:9],
y = rep(letters[1:3], each = 3),
y2 = rep(letters[1:3], each = 3))
d
#> DataFrame with 9 rows and 4 columns
#> k x y y2
#> <integer> <character> <character> <character>
#> 1 1 a a a
#> 2 1 b a a
#> 3 1 c a a
#> 4 2 d b b
#> 5 2 e b b
#> 6 2 f b b
#> 7 3 g c c
#> 8 3 h c c
#> 9 3 i c c
## y is invariant and can be simplified
reduceDataFrame(d, d$k)
#> DataFrame with 3 rows and 4 columns
#> k x y y2
#> <integer> <CharacterList> <character> <character>
#> 1 1 a,b,c a a
#> 2 2 d,e,f b b
#> 3 3 g,h,i c c
## y isn't not dropped
reduceDataFrame(d, d$k, drop = TRUE)
#> DataFrame with 3 rows and 3 columns
#> k y y2
#> <integer> <character> <character>
#> 1 1 a a
#> 2 2 b b
#> 3 3 c c
## BUT with a missing value
d[1, "y"] <- NA
d
#> DataFrame with 9 rows and 4 columns
#> k x y y2
#> <integer> <character> <character> <character>
#> 1 1 a NA a
#> 2 1 b a a
#> 3 1 c a a
#> 4 2 d b b
#> 5 2 e b b
#> 6 2 f b b
#> 7 3 g c c
#> 8 3 h c c
#> 9 3 i c c
## y isn't invariant/simplified anymore
reduceDataFrame(d, d$k)
#> DataFrame with 3 rows and 4 columns
#> k x y y2
#> <integer> <CharacterList> <CharacterList> <character>
#> 1 1 a,b,c NA,a,a a
#> 2 2 d,e,f b,b,b b
#> 3 3 g,h,i c,c,c c
## y now gets dropped
reduceDataFrame(d, d$k, drop = TRUE)
#> DataFrame with 3 rows and 2 columns
#> k y2
#> <integer> <character>
#> 1 1 a
#> 2 2 b
#> 3 3 c