1
votes

I am trying to convert an R data.table code snippet into an appropriate function, but I am not having success.

I would like to summarize a variable using this code:

library(data.table)
mtcars_dt <- 
  data.table(mtcars)

# Expression
mtcars_dt[,list(.N),by=cyl][order(cyl),list(cyl,N,Proportion=N/sum(N))]

The result is a data.table (as intended):

   cyl  N Proportion
1:   4 11    0.34375
2:   6  7    0.21875
3:   8 14    0.43750

Then I wish to create a function and apply this function across multiple variables, with a data.table resulting from each iteration of the loop applying the function:

# turn into function and apply loop, returning DT from each iteration
var_list <- c('cyl','gear')
for (i in var_list){
  # generalize the code above
}

I am not sure the best way to proceed. I tried this solution, but I am losing the variable name in the first column. I wonder if I am on the wrong track with eval(quote(...))

# My attempt, not working yet!
var_list <- c(quote(cyl),quote(gear))
f_numeric_cat <- 
  function(dt,var1) {
    dt[,list(.N),by=eval(var1)][order(eval(var1)),Proportion:=N/sum(N)][]
  }
for (i in var_list){
  print(f_numeric_cat(mtcars_dt,i))
}
   var1  N Proportion
1:    6  7    0.21875
2:    4 11    0.34375
3:    8 14    0.43750
   var1  N Proportion
1:    4 12    0.37500
2:    3 15    0.46875
3:    5  5    0.15625
1

1 Answers

2
votes

The goal is a function of DT & var to do:

DT[, list(.N), by=var][order(var), list(var, N, Proportion=N/sum(N))]

To do a calculation by a variable and then order by it, use keyby=. So your function can become:

f_tabulate <- 
  function(DT, var) {
    DT[, list(.N), keyby=var][, Proportion := N/sum(N)][]
  } 
# usage
for (i in c('cyl', 'gear')) print(f_tabulate(mtcars_dt, i))

keyby= and by= can take arguments in many forms, and you don't need to quote or eval a simple vector of variable names. (Other ways include .(var1, var2), list(var1, var2), "var1,var2".) You could also extend to counting by multiple variables...

f_tabulate2 <- 
  function(DT, ...) {
    DT[, list(.N), keyby=c(...)][, Proportion := N/sum(N)][]
  } 
# usage
f_tabulate2(mtcars_dt, 'cyl', 'gear')

For this operation (excepting the ordering part), you could also use groupingsets():

> groupingsets(mtcars_dt, .N, keyby=c('cyl', 'gear'), sets=list("cyl", "gear"))[, 
    Proportion := N/nrow(mtcars_dt)][]
   cyl gear  N Proportion
1:   6   NA  7    0.21875
2:   4   NA 11    0.34375
3:   8   NA 14    0.43750
4:  NA    4 12    0.37500
5:  NA    3 15    0.46875
6:  NA    5  5    0.15625

As a function (and adding ordering back in)...

f_tabulate_all = function(DT, vars){
  lvars = as.list(vars)
  ocall = as.call(lapply(c("order", vars), as.name))
  groupingsets(DT[eval(ocall)], .N, by=vars, sets=as.list(vars))[, 
    Proportion := N/nrow(DT)][]
}
# usage
f_tabulate_all(mtcars_dt, c('cyl', 'gear'))

The as.name function is achieving the same thing as quote when applied to a string naming a function or other object.