29
votes

I have an example data set with a column that reads somewhat like this:

Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee

What I'd like to do is replace it into just two factors - "Candy" and "Non-Candy". I can do this with Python/Pandas, but can't seem to figure out a dplyr based solution. Thank you!

5

5 Answers

67
votes

In dplyr and tidyr

dat %>% 
    mutate(var = replace(var, var != "Candy", "Not Candy"))

Significantly faster than the ifelse approaches. Code to create the initial dataframe can be as below:

library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"
15
votes

Assuming your data frame is dat and your column is var:

dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
7
votes

Another solution with dplyr using case_when:

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           TRUE ~ 'Non-Candy'))

The syntax for case_when is condition ~ value to replace. Documentation here.

Probably less efficient than the solution using replace, but an advantage is that multiple replacements could be performed in a single command while still being nicely readable, i.e. replacing to produce three levels:

dat %>%
    mutate(var = case_when(var == 'Candy' ~ 'Candy',
                           var == 'Water' ~ 'Water',
                           TRUE ~ 'Neither-Water-Nor-Candy'))
6
votes

No need for dplyr. Assuming var is stored as a factor already:

non_c <- setdiff(levels(dat$var), "Candy")

levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)

See ?levels.

This is much more efficient than the ifelse approach, which is bound to be slow:

library(microbenchmark)
set.seed(01239)
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"

times <- 
  replicate(50, 
            {cop <- smp
            s <- get_nanotime()
            levs <- setdiff(levels(cop$var), "Candy")
            levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
            d1 <- get_nanotime() - s
            cop <- smp
            s <- get_nanotime()
            cop = cop %>%
              mutate(candy.flag = factor(ifelse(var == "Candy", 
                                                "Candy", "Non-Candy")))
            d2 <- get_nanotime() - s
            cop <- smp
            s <- get_nanotime()
            cop$var <- 
              factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
            d3 <- get_nanotime() - s
            c(levels = d1, dplyr = d2, direct = d3)})

(x <- apply(times, 1, median))[2]/x[1]
#    dplyr   direct 
# 8.894303 4.962791 

That is, this is 9 times faster.

0
votes

When you only need two values, a simple ifelse() is prettiet, I think.

Furthermore, embedded ifelses can simulate the same situation as the case_when solution proposed by PhJ (I do like his readability, though)!

dat %>%
    mutate(
        var = ifelse(var == "Candy", "Candy", "Non-Candy")
    )