I find manipulating factor variables in R unduly complicated. Things I frequently want to do when cleaning factors include:
Resorting levels – not just to set a reference category, but also put all levels in a logical (non-alphabetical order) for summary tables.
x <- factor(x, levels = new.order)
Recode / rename factor levels – to simplify names and/or collapse multiple categories into one group. For one-to-one recoding
levels(x) <- new.levels(x)
orplyr::revalue
, see here or here for examples.car::recode
can perform several one-to-many matches in a single statement, but doesn't support regex matching.Drop levels – not just drop unused levels, but set some levels to missing. (Eg. those with error codes).
x <- factor(as.character(x), exclude = drop.levels)
Add levels – to show categories with zero counts.
What would be great is to have a single function that can do all of the above at once, allows fuzzy (regex) matching for recoding and dropping factors, can be used within other functions (eg. lapply
or dplyr::mutate
), and has a simple (consistent) syntax.
I’ve posted my best attempt at this as an answer below, but please let me know if I've missed a function that already exists or if the code can be improved.
EDIT
I've been made aware of the forcats
package, which is subtitled Tools for working with Categorical Variables (Factors). The package has many options for resorting levels ('fct_infreq', 'fct_reorder', 'fct_relevel', ...), recoding/grouping levels ('fct_recode', 'fct_lump', 'fct_collapse'), dropping levels ('fct_recode'), and adding levels ('fct_expand'). But there are no plans for it to support regex matching (https://github.com/tidyverse/forcats/issues/214).
car::recode
does but with a friendlier syntax. – JWilliman