1
votes

When in interactive mode, exploring aggregations in data.table, I may do dozens, or hundreds of experiments. The default column names after aggregations (V1, V2, etc) obviously aren't very informative, after writing the minimal amount of code needed to generate the aggregations. Often, I'd be happier with the default column name of a simple aggregation of one column, like a mean or sum, to be just the name of the underlying variable.

All the extra column name typing gets tiring, and I want to avoid this.

Is there any easy way to do this in data.table?

e.g. a simplified example to demonstrate if it's not clear:

DT = data.table(x =rep(c("b","a","c"),each=3), y_a_long_name=c(1,3,6), v_a_long_name=1:9)

DT[, .(sum(v_a_long_name), mean(y_a_long_name)), by = x]
#    x V1       V2
# 1: b  6 3.333333
# 2: a 15 3.333333
# 3: c 24 3.333333

When you start working with several columns, using different aggregation type functions, the above labeling, V1, V2, isn't helpful.

All the extra typing of repeating the names is tedious, but I'd like to see something like this:

DT[, .(v_a_long_name = sum(v_a_long_name), y_a_long_name = mean(y_a_long_name)), by = x]
#    x v_a_long_name y_a_long_name
# 1: b             6      3.333333
# 2: a            15      3.333333
# 3: c            24      3.333333

while typing as minimal as possible. e.g. it would be ideal if

DT[, .(sum(v_a_long_name), mean(y_a_long_name)), by = x]

printed this by default:

#    x v_a_long_name y_a_long_name
# 1: b             6      3.333333
# 2: a            15      3.333333
# 3: c            24      3.333333
2
The answers below will stop GForce from optimizing your queries. Does your IDE not have auto-complete? If you're just playing around, I don't see much penalty to using names like V1, V5. If this is not ephemeral, writing explicit code will help others (including future-you) better understand your code. Choosing to eliminate sum and mean from the name of the aggregated variable strikes me as troublesome / begging for confusion/errors down the line.MichaelChirico
With that in mind, there are a few outstanding issues related to auto-naming columns: github.com/Rdatatable/data.table/issues/1604, github.com/Rdatatable/data.table/issues/618, github.com/Rdatatable/data.table/issues/1543, and probably most germane: github.com/Rdatatable/data.table/issues/1227MichaelChirico
Feel free to chime in on those issues. I don't think it makes any sense for auto-naming to exclude the method of aggregation (i.e. any solution I'd find acceptable would be like v_a_long_name_sum and y_a_long_name_mean)MichaelChirico
Yeah, automating including method would be even better (just more typing). Yes, most of the time the code would be kept long term, so auto naming would be niceuser7613376

2 Answers

0
votes

A bit of a roundabout solution, but you can pass a list of functions and a list of columns to Map + do.call. The names for the resulting columns will be taken from the first argument to Map, which means you only have to specify the names in long-form in .SDcols once:

DT[, 
  Map(do.call, args=lapply(.SD,list), what=c(sum,mean)),
  by=x, .SDcols=c("v_a_long_name","y_a_long_name")
]

#   x v_a_long_name y_a_long_name
#1: b             6      3.333333
#2: a            15      3.333333
#3: c            24      3.333333

If the auto-naming is required, you could define the variables ahead of time:

nms  <- c("v_a_long_name","y_a_long_name")
funs <- c("sum","mean")

DT[, 
  setNames(Map(do.call, args=lapply(.SD,list), what=funs), paste(nms,funs,sep="_")),
  by=x, .SDcols=nms
]

#   x v_a_long_name_sum y_a_long_name_mean
#1: b                 6           3.333333
#2: a                15           3.333333
#3: c                24           3.333333
2
votes

We can use setNames to wrap around the list of columns

DT[, setNames(.(sum(v_a_long_name), mean(y_a_long_name)), names(DT)[2:3]), by = x]
#   x y_a_long_name v_a_long_name
#1: b             6      3.333333
#2: a            15      3.333333
#3: c            24      3.333333

Or with setnames after getting the output

setnames(DT[, .(sum(v_a_long_name), mean(y_a_long_name)),
                         by = x], 2:3, names(DT)[2:3])[]

Or make it more compact by extracting the columns from .SD

setnames(DT[, .(sum(.SD[[2]]), mean(.SD[[1]])), by = x], 2:3, names(DT)[2:3])[]