2
votes

I have implemented a simple group-by-operation with the ?stats::aggregate function. It collects elements per group in a vector. I would like to make it faster using the data.table package. However I'm not able to reproduce the wanted behaviour with data.table.

Sample dataset:

df <- data.frame(group = c("a","a","a","b","b","b","b","c","c"), val = c("A","B","C","A","B","C","D","A","B"))

Output to reproduce with data.table:

by_group_aggregate <- aggregate(x = df$val, by = list(df$group), FUN = c)

What I've tried:

data_t <- data.table(df)
# working, but not what I want
by_group_datatable <- data_t[,j = paste(val,collapse=","), by = group]
# no grouping done when using c or as.vector
by_group_datatable <- data_t[,j = c(val), by = group]
by_group_datatable <- data_t[,j = as.vector(val), by = group]
# grouping leads to error when using as.list
by_group_datatable <- data_t[,j = as.list(val), by = group]

Is it possible to have vectors of different size in a data.table column? If yes, how do I achieve it?

2

2 Answers

4
votes

Here's one way:

data_t[, list(list(val)), by = group]
#   group      V1
#1:     a   A,B,C
#2:     b A,B,C,D
#3:     c     A,B

The first list() is used because you want to aggregate the result. The second list is used because you want to aggregate the val column into separate lists per group.

To check the structure:

str(data_t[, list(list(val)), by = group])
#Classes ‘data.table’ and 'data.frame': 3 obs. of  2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ V1   :List of 3
#  ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3
#  ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
#  ..$ : Factor w/ 4 levels "A","B","C","D": 1 2
# - attr(*, ".internal.selfref")=<externalptr> 

Using dplyr, you could do the following:

library(dplyr)
df %>% group_by(group) %>% summarise(val = list(val))
#Source: local data frame [3 x 2]
#
#   group         val
#  (fctr)       (chr)
#1      a <S3:factor>
#2      b <S3:factor>
#3      c <S3:factor>

Check the structure:

df %>% group_by(group) %>% summarise(val = list(val)) %>% str
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  3 obs. of  2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ val  :List of 3
#  ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3
#  ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
#  ..$ : Factor w/ 4 levels "A","B","C","D": 1 2
2
votes

Here is another option with dplyr/tidyr

library(dplyr)
library(tidyr)
res <- df %>% 
        nest(-group)
str(res)
#'data.frame':   3 obs. of  2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ data :List of 3
#  ..$ :'data.frame':    3 obs. of  1 variable:
#  .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2 3
#  ..$ :'data.frame':    4 obs. of  1 variable:
#  .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
#  ..$ :'data.frame':    2 obs. of  1 variable:
#  .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2