First of all: thanks to @MattDowle; data.table
is among the best things that
ever happened to me since I started using R
.
Second: I am aware of many workarounds for various use cases of variable column
names in data.table
, including:
- Select / assign to data.table variables which names are stored in a character vector
- pass column name in data.table using variable in R
- Referring to data.table columns by names saved in variables
- passing column names to data.table programmatically
- Data.table meta-programming
- How to write a function that calls a function that calls data.table?
- Using dynamic column names in `data.table`
- dynamic column names in data.table, R
- Assign multiple columns using := in data.table, by group
- Setting column name in "group by" operation with data.table
- R summarizing multiple columns with data.table
and probably more I haven't referenced.
But: even if I learned all the tricks documented above to the point that I never had to look them up to remind myself how to use them, I still would find that working with column names that are passed as parameters to a function is an extremely tedious task.
What I'm looking for is a "best-practices-approved" alternative to the following workaround / workflow. Consider that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.
I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,eval(anntot):=
the.table[,rep(eval(varname),each=12)]]
new.table[,eval(monthly):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,eval(rolling):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,eval(scaled):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table
actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table
special objects .I
, .SD
, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote
-ing and eval
-ing.
In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table
and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ...
approach is usually not a workable substitute. Also replacing each call of eval(a.column.name)
with DT[[a.column.name]]
neither simplifies much nor works completely in general since that doesn't play nice with the other data.table
operations, as far as I am aware.
get(varname)
on RHS (wherevarname
is e.g. "var1" and not the quoted expression), and for LHS of:=
you can simply do:dt[, paste0(varname, '.rolling') := ...]
– eddiget()
, I considered it, but stackoverflow.com/a/12392269/241643 implied that it was sub-optimal. Is that no longer the case? – Philipget
can be suboptimal if you have a large number of columns, but in practice I find it much easier to use (which generally means faster overall runtime, when you include time spent writing/understanding/maintaining the code). It will likely be optimized to be as efficient aseval(quote
at some point. I also remember an FR that was asking to implement the.
as a function inj
-expressions that would do the equivalent ofget
, but efficiently (it also included using..
as a function to access variables outside of the local scope). – eddiquote.convert()
is justas.name()
– hadley