2
votes

I find manipulating factor variables in R unduly complicated. Things I frequently want to do when cleaning factors include:

  • Resorting levels – not just to set a reference category, but also put all levels in a logical (non-alphabetical order) for summary tables. x <- factor(x, levels = new.order)

  • Recode / rename factor levels – to simplify names and/or collapse multiple categories into one group. For one-to-one recoding levels(x) <- new.levels(x) or plyr::revalue, see here or here for examples. car::recode can perform several one-to-many matches in a single statement, but doesn't support regex matching.

  • Drop levels – not just drop unused levels, but set some levels to missing. (Eg. those with error codes). x <- factor(as.character(x), exclude = drop.levels)

  • Add levels – to show categories with zero counts.

What would be great is to have a single function that can do all of the above at once, allows fuzzy (regex) matching for recoding and dropping factors, can be used within other functions (eg. lapply or dplyr::mutate), and has a simple (consistent) syntax.

I’ve posted my best attempt at this as an answer below, but please let me know if I've missed a function that already exists or if the code can be improved.

EDIT

I've been made aware of the forcats package, which is subtitled Tools for working with Categorical Variables (Factors). The package has many options for resorting levels ('fct_infreq', 'fct_reorder', 'fct_relevel', ...), recoding/grouping levels ('fct_recode', 'fct_lump', 'fct_collapse'), dropping levels ('fct_recode'), and adding levels ('fct_expand'). But there are no plans for it to support regex matching (https://github.com/tidyverse/forcats/issues/214).

1
What do you mean by 'in a single step'?effel
@effel I guess I was thinking of having a one-line command to do everything that could be incorporated in a lapply command or similar. Although I acknowledge this can be accomplished in R by packaging everything into a custom function. I also wondered if I had missed a command from dplyr or other package that does what car::recode does but with a friendlier syntax.JWilliman

1 Answers

2
votes

Edit: A few years later I've added the xfactor function on github to accomplish the above. It is still a work in progress so please let me know if there are any bugs etc.

devtools::install_github("jwilliman/xfactor")

library(xfactor)

# Create example factor
x <- xfactor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
levels(x)
#> [1] "catfish" "dirt"    "dogfish" "mouse"   "rabbit"

# Factor levels can be reordered by passing an unnamed vector to the levels
# statement. Levels not included in the replace statement get moved to the end
# or dropped if exclude = TRUE.
xfactor(x, levels = c("mouse", "rabbit"))
#> [1] dogfish rabbit  catfish mouse   dirt   
#> Levels: mouse rabbit catfish dirt dogfish

xfactor(x, levels = c("mouse", "rabbit"), exclude = TRUE)
#> [1] <NA>   rabbit <NA>   mouse  <NA>  
#> Levels: mouse rabbit

# Factor levels can be recoded, collapse, and ordered by passing a named
# vector to the levels statement. Where the vector names are the new factor
# levels and the vector values are regex expressions for the old levels.
# Duplicated new levels will be collapsed.

xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou"))
#> [1] Sea  Land Sea  Land dirt
#> Levels: Sea Land dirt

# Factor levels can be dropped by passing a regex expression (or vector) to
# the exclude statement

xfactor(x, exclude = "fish")
#> [1] <NA>   rabbit <NA>   mouse  dirt  
#> Levels: dirt mouse rabbit

# The function will work within other functions

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(n = 1:5, x)
df %>%
  mutate(y = xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou", "Air"), exclude = "di"))
#>   n       x    y
#> 1 1 dogfish  Sea
#> 2 2  rabbit Land
#> 3 3 catfish  Sea
#> 4 4   mouse Land
#> 5 5    dirt <NA>

Created on 2020-04-16 by the reprex package (v0.3.0)