1
votes

I have a question regarding a for-loop within R's dplyr. Imagine I have the following dataframe:

id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df$count_x <- cumsum(df$check)
df$count_y <- NA
df$count_y[1] <- ifelse(df$check[1] == 0, 0, 1)
co <- df$count_y[1]

I want to fill the variable count_y based on an adjusted cumulative function below:

for (idx in 2:nrow(df)){
   if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
     co <- 1
     df[idx, 4] <- co 
   } else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
     co <- co + 1
     df[idx, 4] <- co
   } else if (df[idx, 2] == 0){
     df[idx, 4] <- co
   } 
 }

The output of this for-loop is correct. However, in my current data set, I have many IDs, and using a for loop to iterate over the IDs will take too much time. I'm trying to use the functionality of dplyr to speed up the process.

id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df <- df %>% group_by(id) %>% mutate(count_x = cumsum(check),
                                 count_y = NA) %>% ungroup()
df <- df %>% group_by(id) %>% mutate(count_y = replace(count_y, 1, ifelse(check[1] == 0, 0 , 1)))

count_n <- function(df){

  co <- df$count_y[1]

  for (idx in 2:nrow(df)){
     if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
       co <- 1
       df[idx, 4] <- co 
     } else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
       co <- co + 1
       df[idx, 4] <- co
     } else if (df[idx, 2] == 0){
       df[idx, 4] <- co
     }
   }
 }

I want to use mutate to call the function count_n to fill count_y as described above. I'm aware that I'm passing just one variable, where I have to pass a data frame as the function relies on the column 'check' (col number 2) and 'count_y' (col number 4). I have tried multiple options (mutate_at, all, etc) but I couldn't make it to work. What can I do differently?

df <- df %>% group_by(id) %>% mutate_at(vars(count_y), ~count_n(.)) 
2

2 Answers

2
votes

I think this is the perfect case to use purrr::accumulate2().

purrr::accumulate() is often used to calculate conditional cumulative sums. It takes a function as the second argument. This function should have 2 arguments: the cumulative output co, and the currently evaluated value x.

purrr::accumulate2() allows us to use a second variable to iterate on, and here we use lag(check) as lx. The tricky part is that this second variable should be one item shorter, as it does not matter for the initial value.

Here is the code, matching your expected output.

library(tidyverse)

df = structure(list(id = c(8, 8, 8, 8, 8, 8, 8, 8, 8), 
                    check = c(0, 1, 1, 0, 0, 1, 0, 0, 0), 
                    count_x = c(0, 1, 2, 2, 2, 3, 3, 3, 3)), 
               row.names = c(NA, -9L), class = "data.frame")


df %>% 
  mutate(
    count_y = accumulate2(check, lag(check)[-1], function(co, x, lx){
      case_when(
        x==0 ~ co,
        x==1 & lx==0 ~ 1,
        x==1 & lx==1 ~ co+1,
        TRUE ~ 999 #error value in case of unexpected input
      )
    })
  )
#>   id check count_x count_y
#> 1  8     0       0       0
#> 2  8     1       1       1
#> 3  8     1       2       2
#> 4  8     0       2       2
#> 5  8     0       2       2
#> 6  8     1       3       1
#> 7  8     0       3       1
#> 8  8     0       3       1
#> 9  8     0       3       1

Created on 2021-05-05 by the reprex package (v2.0.0)

1
votes

The first issue is that you weren't returning anything in your function. The second issue is that you don't need to use a mutate_at (or even a mutate as would be more appropriate for a single variable) when you're writing the function that modifies the entire tibble. The simplest way to get it working is adding a return statement and running it in line like so:

count_n <- function(df){
  
  co <- df$count_y[1]
  
  for (idx in 2:nrow(df)){
    if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
      co <- 1
      df[idx, 4] <- co 
    } else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
      co <- co + 1
      df[idx, 4] <- co
    } else if (df[idx, 2] == 0){
      df[idx, 4] <- co
    }
  }
  
  return(df)
}

df %>% group_by(id) %>% count_n(.)

However, I would use Dan's answer above because it's much cleaner and has the advantage of not running a for loop, which isn't very "R". :)