How to use a loop with mutate dplyr

Question

I have a question regarding a for-loop within R's dplyr. Imagine I have the following dataframe:

id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df$count_x <- cumsum(df$check)
df$count_y <- NA
df$count_y[1] <- ifelse(df$check[1] == 0, 0, 1)
co <- df$count_y[1]

I want to fill the variable count_y based on an adjusted cumulative function below:

for (idx in 2:nrow(df)){
   if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
     co <- 1
     df[idx, 4] <- co 
   } else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
     co <- co + 1
     df[idx, 4] <- co
   } else if (df[idx, 2] == 0){
     df[idx, 4] <- co
   } 
 }

The output of this for-loop is correct. However, in my current data set, I have many IDs, and using a for loop to iterate over the IDs will take too much time. I'm trying to use the functionality of dplyr to speed up the process.

id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df <- df %>% group_by(id) %>% mutate(count_x = cumsum(check),
                                 count_y = NA) %>% ungroup()
df <- df %>% group_by(id) %>% mutate(count_y = replace(count_y, 1, ifelse(check[1] == 0, 0 , 1)))

count_n <- function(df){

  co <- df$count_y[1]

  for (idx in 2:nrow(df)){
     if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
       co <- 1
       df[idx, 4] <- co 
     } else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
       co <- co + 1
       df[idx, 4] <- co
     } else if (df[idx, 2] == 0){
       df[idx, 4] <- co
     }
   }
 }

I want to use mutate to call the function count_n to fill count_y as described above. I'm aware that I'm passing just one variable, where I have to pass a data frame as the function relies on the column 'check' (col number 2) and 'count_y' (col number 4). I have tried multiple options (mutate_at, all, etc) but I couldn't make it to work. What can I do differently?

df <- df %>% group_by(id) %>% mutate_at(vars(count_y), ~count_n(.))

Dan Chaltiel Dan Chaltiel · Accepted Answer · 2021-05-05T19:31:13

I think this is the perfect case to use purrr::accumulate2().

purrr::accumulate() is often used to calculate conditional cumulative sums. It takes a function as the second argument. This function should have 2 arguments: the cumulative output co, and the currently evaluated value x.

purrr::accumulate2() allows us to use a second variable to iterate on, and here we use lag(check) as lx. The tricky part is that this second variable should be one item shorter, as it does not matter for the initial value.

Here is the code, matching your expected output.

library(tidyverse)

df = structure(list(id = c(8, 8, 8, 8, 8, 8, 8, 8, 8), 
                    check = c(0, 1, 1, 0, 0, 1, 0, 0, 0), 
                    count_x = c(0, 1, 2, 2, 2, 3, 3, 3, 3)), 
               row.names = c(NA, -9L), class = "data.frame")


df %>% 
  mutate(
    count_y = accumulate2(check, lag(check)[-1], function(co, x, lx){
      case_when(
        x==0 ~ co,
        x==1 & lx==0 ~ 1,
        x==1 & lx==1 ~ co+1,
        TRUE ~ 999 #error value in case of unexpected input
      )
    })
  )
#>   id check count_x count_y
#> 1  8     0       0       0
#> 2  8     1       1       1
#> 3  8     1       2       2
#> 4  8     0       2       2
#> 5  8     0       2       2
#> 6  8     1       3       1
#> 7  8     0       3       1
#> 8  8     0       3       1
#> 9  8     0       3       1

^{Created on 2021-05-05 by the reprex package (v2.0.0)}

How to use a loop with mutate dplyr

2 Answers