1
votes

I am attempting to create a new column that is a conditional difference based on a column of TRUE and FALSE. If the lag 1 row is FALSE then we should compute a difference from either the beginning or the last TRUE row, whichever is later in the dataframe, however if the lag 1 row is TRUE then the difference should be should be reset.

I would like to use the dplyr::mutate function as much as possible. I'm attempting to use dplyr::lag with an ifelse() but I'm having a hard time with the conditions

 dat <- data.frame(logic_col = c(F, F, T, T, F, F, F, T, F),
                   time_col = c(200, 435, 567, 895, 1012, 1345, 1456, 1700, 1900),
                   expected_col_unseen = c(200, 435, 567, 328, 117, 450, 561, 805, 200))
2
The expected column is inconsistent with "cumulative sum". Since row 2 is false, then row 3 expected should be 200+435+567=1202, not 1002 as you have it. From there, it seems as if your expected column is not even close, as row 3 is true, so row 4 should be 895. I think you may be trying to subtract the previous row's time_col from the expected, but even then the cumulative sum doesn't carry-over correctly. Can you either fix your expected data or expand on how you are calculating it?r2evans
Thank you for pointing this out. I reworded the question to be more accurate, I'm not trying to do a cumulative sum but a difference. from the last {TRUE} value.jackbio
I agree with your rows 1-5, but since rows 5-6 are both false, shouldn't row 6 expected be equal to row 6 time_col?r2evans
Minor detail, but if you are using dplyr, I encourage the use of dplyr::if_else (vice base ifelse) as it will guard against common mistakes (protect you from yourself, so to speak).r2evans

2 Answers

1
votes

We can do something like this using tidyr and dplyr:

library(dplyr)
library(tidyr)

dat %>% 
  mutate(tmp = lag(logic_col * time_col),
         tmp = ifelse(tmp==0, NA,tmp)) %>% 
  tidyr::fill(tmp, .direction = c("down")) %>% 
  mutate(out = time_col - ifelse(is.na(tmp), 0,tmp)) %>% 
  select(-tmp)

#>   logic_col time_col expected_col_unseen out
#> 1     FALSE      200                 200 200
#> 2     FALSE      435                 435 435
#> 3      TRUE      567                 567 567
#> 4      TRUE      895                 328 328
#> 5     FALSE     1012                 117 117
#> 6     FALSE     1345                 450 450
#> 7     FALSE     1456                 561 561
#> 8      TRUE     1700                 805 805
#> 9     FALSE     1900                 200 200
1
votes

I'll throw this out there in case manually calculating it is just confusing things:

library(dplyr)
dat %>%
  group_by(grp = cumsum(lag(!logic_col, default=FALSE))) %>%
  mutate(out = c(time_col[1], diff(time_col))) %>%
  ungroup()
# # A tibble: 9 x 5
#   logic_col time_col expected_col_unseen   grp   out
#   <lgl>        <dbl>               <dbl> <int> <dbl>
# 1 FALSE          200                 200     0   200
# 2 FALSE          435                 435     1   435
# 3 TRUE           567                 567     2   567
# 4 TRUE           895                 328     2   328
# 5 FALSE         1012                 117     2   117
# 6 FALSE         1345                 450     3  1345
# 7 FALSE         1456                 561     4  1456
# 8 TRUE          1700                 805     5  1700
# 9 FALSE         1900                 200     5   200