12
votes

When doing data analysis, I sometimes need to recode values to factors in order to carry out groups analysis. I want to keep the order of factor same as the order of conversion specified in case_when. In this case, the order should be "Excellent" "Good" "Fail". How can I achieve this without tediously mention it again as in levels=c('Excellent', 'Good', 'Fail')?

Thank you very much.


library(dplyr, warn.conflicts = FALSE)             
                                                   
set.seed(1234)                                     
score <- runif(100, min = 0, max = 100)     
   
Performance <- function(x) {                       
  case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  ) %>% factor(levels=c('Excellent', 'Good', 'Fail'))
}                                                  
                                                   
performance <- Performance(score)                  
levels(performance)                                
#> [1] "Excellent" "Good"      "Fail"
table(performance)                                 
#> performance
#> Excellent      Good      Fail 
#>        15        30        55
3
that's what he doesn't want to do (and is already doing) - De Novo
That's a nice solution! - Luke Hayden
Beautiful, thank you for this! - jzadra
To allow for expressions on the RHS, insert levels = sapply(levels, FUN = eval) on the second-to-last line. This makes it possible to do result = fct_case_when(x < 5 ~ my_vec[3]) and not getting "my_vec[3]" as result. - Jonas Lindeløv
Please do not edit solution announcements into the question. Accept (i.e. click the "tick" next to it) one of the existing answer, if there are any. You can also create your own answer, and even accept it, if your solution is not yet covered by an existing answer. Compare stackoverflow.com/help/self-answer - Yunnosch

3 Answers

4
votes

levels are set in lexicographic order by default. If you don't want to specify them, you can set them up so that lexicographic order is correct (Performance1), or create a levels vector once, and use it when generating the factor and when setting the levels (Performance2). I don't know how much effort or tediousness either of these would save you, but here they are. Take a look at my 3rd recommendation for what I think would be the least tedious way.

Performance1 <- function(x) {                       
  case_when(
    is.na(x) ~ NA_character_,                          
    x > 80 ~ 'Excellent',  
    x <= 50 ~ 'Fail',
    TRUE ~ 'Good',
  ) %>% factor()
}

Performance2 <- function(x, levels = c("Excellent", "Good", "Fail")){
  case_when(
    is.na(x) ~ NA_character_,
    x > 80 ~ levels[1],
    x > 50 ~ levels[2],
    TRUE ~ levels[3]
  ) %>% factor(levels)
}
performance1 <- Performance1(score)
levels(performance1)
# [1] "Excellent" "Fail"     "Good"
table(performance1)
# performance1
# Excellent      Fail      Good 
#        15        55        30 

performance2 <- Performance2(score)
levels(performance2)
# [1] "Excellent" "Good"      "Fail"  
table(performance2)
# performance2
# Excellent      Good      Fail 
#        15        30        55 

If I could suggest an even less tedious way:

performance <- cut(score, breaks = c(0, 50, 80, 100), 
                   labels = c("Fail", "Good", "Excellent"))
levels(performance)
# [1] "Fail"      "Good"      "Excellent"
table(performance)
# performance
#      Fail      Good Excellent 
#        55        30        15
1
votes

While my solution replaces your piping with a messy intermediate variable, this works:

    library(dplyr, warn.conflicts = FALSE)             

set.seed(1234)                                     
score <- runif(100, min = 0, max = 100)     

Performance <- function(x) {                       
  t <- case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  ) 
  to <- subset(t, !duplicated(t))
  factor(t, levels=(to[order(subset(x, !duplicated(t)), decreasing=T)] ))
}                                                  
performance <- Performance(score)                
levels(performance)  

Edited to fix!

0
votes

This is an implementation I have been using:

library(dplyr)
library(purrr)
library(rlang)
library(forcats)

factored_case_when <- function(...) {
  args <- list2(...)
  rhs <- map(args, f_rhs)
  
  cases <- case_when(
    !!!args
  )
  
  exec(fct_relevel, cases, !!!rhs)
}


numbers <- c(2, 7, 4, 3, 8, 9, 3, 5, 2, 7, 5, 4, 1, 9, 8)

factored_case_when(
  numbers <= 2 ~ "Very small",
  numbers <= 3 ~ "Small",
  numbers <= 6 ~ "Medium",
  numbers <= 8 ~ "Large",
  TRUE    ~ "Huge!"
)
#>  [1] Very small Large      Medium     Small      Large      Huge!     
#>  [7] Small      Medium     Very small Large      Medium     Medium    
#> [13] Very small Huge!      Large     
#> Levels: Very small Small Medium Large Huge!

This has the advantage of not having to manually spoecify the factor levels.

I have also submitted a feature request to dplyr for this functionality: https://github.com/tidyverse/dplyr/issues/6029