0
votes

I am writing a function which re-write the column names in order to output the data.table in a standard. The inputs are user-provided data.tables which might differ by several names.

Here is the format which should be the output of all input data.tables:

length    width    height    weight

The input data.tables may look like, e.g.

input_dt = data.table(
  length = 194,
  wide = 36,
  tall = 340,
  kilogram = 231.2
)

My function would take this data.table (or data.frame) as an input, and change the columns, outputting this data.table:

length    width    height    weight
194       36      340     231.2

I've created a key for the function which would check possible names:

key = list(
    length = c('long'),
    width = c('girth', 'WIDTH', 'wide'),
    height = c('tall', 'high'),
    weight =  c('Weight', 'WEIGHT', 'kilogram', 'pound', 'kilograms', 'pounds')
)

Now, within the function, I can check the input column names of input_dt to check whether they need to be changed by checking the intersection:

> intersect(names(input_dt), unlist(key))
[1] "wide"     "tall"     "kilogram"

And then change these appropriately. My question is:

Writing this custom function would be full of for-loops, and quite inefficient. Are there other more data.table-friendly solutions available, given a custom "key" of values?

1

1 Answers

2
votes

Keep the key not as a list but as a data.table, then merge:

# easier to edit this list if you need to update your keywords later
key_list = list(
  length = c('long'),
  width  = c('girth', 'WIDTH', 'wide'),
  height = c('tall', 'high'),
  weight = c('Weight', 'WEIGHT', 'kilogram', 'pound', 'kilograms', 'pounds')
)
# build into data.table
keyDT = data.table(
  # can't name a column key
  key_name = rep(names(key_list), lengths(key_list)),
  synonym = unlist(key_list),
  # easier merging
  key = 'synonym'
)

# nomatch = 0 to skip unmatched columns
keyDT[.(names(input_dt)), setnames(input_dt, synonym, key_name), nomatch = 0L]

With input_dt afterwards:

input_dt
#    length width height weight
# 1:    194    36    340  231.2

For robustness, you may want to add self to the key_list (e.g., length = c('length', 'long')); this way, you can more easily throw an error / warning in the case that input_dt has an as-yet-unseen synonym in its names.