0
votes

I need to group a data.table by 30 columns.

Rather than list each column name, can I use a regular expression like grepl for the by argument?

Below is a minimum reproducible example

library(dplyr)

library(ggplot2)

library(data.table)

data <- diamonds %>% cbind("Another Column" = diamonds$cut == "Ideal") %>% data.table

data[, lapply(.SD, sum, na.rm = TRUE), by = list(carat, cut, color, clarity, depth, table, price, `Another Column`)]

I have already tried the following

data[, lapply(.SD, sum, na.rm = TRUE), by = colnames(data) %>% .[!grepl(pattern = "^[a-z]$", .)] %>% as.list]

Update:

  • please use pattern ^[a-z]$
  • sum column x, y, z by carat, cut, color, clarity, depth, table, price, Another Column
1
What is the "pattern" in column name that you are looking for? Also note that regex varies for each case, so if this is not your real data and real column names chances are that the answer would not work on your real data. - Ronak Shah
please use the regex from minimum reproducible example, the pattern is under I have already tried the following. - Kevin Ho
Do you want to group by those numerical columns? or sum them? data[, lapply(.SD, sum, na.rm = TRUE), .SDcols = grep(pattern = "^[a-z]$", names(data), value = TRUE)] ? - Ronak Shah
sum column x, y, z by carat, cut, color, clarity, depth, table, price, Another Column - Kevin Ho

1 Answers

2
votes

Not sure if there's a better way, but wrapping an expression returning character column names with eval or c (c used below) works

set.seed(2019)
df <- as.data.table(matrix(sample(1:2, 100, T), 10 , 10)) %>% 
        setnames(c(letters[1:3], paste0('zebra', 1:7)))

names(df)
# [1] "a"      "b"      "c"      "zebra1" "zebra2" "zebra3" "zebra4" "zebra5" "zebra6"
# [10] "zebra7"

df[, lapply(.SD, sum), by = c(grep('^[a-z]$', names(df), value = TRUE))]

#   a b c zebra1 zebra2 zebra3 zebra4 zebra5 zebra6 zebra7
# 1: 2 2 1      2      2      1      1      1      1      2
# 2: 2 2 2      1      1      1      1      1      2      2
# 3: 1 1 2      3      3      2      4      2      4      3
# 4: 2 1 1      3      3      2      3      3      2      3
# 5: 1 2 1      2      3      3      4      4      3      4
# 6: 2 1 2      1      2      2      1      1      2      1
# 7: 1 2 2      1      2      2      1      2      2      2

Edit: For example in question

data[, lapply(.SD, sum), by = c(grep('^[a-z]$', names(data), value = TRUE, invert = TRUE))]

#        carat       cut color clarity depth table price Another Column    x    y    z
#     1:  0.23     Ideal     E     SI2  61.5    55   326           TRUE 3.95 3.98 2.43
#     2:  0.21   Premium     E     SI1  59.8    61   326          FALSE 3.89 3.84 2.31
#     3:  0.23      Good     E     VS1  56.9    65   327          FALSE 4.05 4.07 2.31
#     4:  0.29   Premium     I     VS2  62.4    58   334          FALSE 4.20 4.23 2.63
#     5:  0.31      Good     J     SI2  63.3    58   335          FALSE 4.34 4.35 2.75
#    ---                                                                              
# 53133:  0.72     Ideal     D     SI1  60.8    57  2757           TRUE 5.75 5.76 3.50
# 53134:  0.72      Good     D     SI1  63.1    55  2757          FALSE 5.69 5.75 3.61
# 53135:  0.70 Very Good     D     SI1  62.8    60  2757          FALSE 5.66 5.68 3.56
# 53136:  0.86   Premium     H     SI2  61.0    58  2757          FALSE 6.15 6.12 3.74
# 53137:  0.75     Ideal     D     SI2  62.2    55  2757           TRUE 5.83 5.87 3.64