1
votes

I have a dataset that looks something like this

df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1971,1972,1973,1974,1977,1978,1990), 
                 "Group" = c(1,NA,1,NA,NA,2,2,NA),
                 "Val" = c(2,3,3,5,2,5,3,5))

And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum

df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))

However, I would like my final data to look like this

final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1971,1972,1973,1974,1977,1978,1990), 
                 "Group" = c(1,NA,1,NA,NA,2,2,NA),
                 "Val" = c(2,3,3,5,2,5,3,5),
                 "cumval" = c(2,5,6,11,2,7,5,10))

The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one. For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.

I thank you in advance for your help

1
Something like df %>% group_by(id) %>% mutate(cumval = cumsum(replace(Val, duplicated(Val), first(Val)))) - akrun
Because val at time 1972 (3) is "group" 1 like val at time 1970 (2). Basically, for val within the same group the latter value takes the place of the previous. Hence here val 1972 takes the place of val 1970 ( and it is summed with val 1971) that is a different group (NA) - Alex
sure: obs 1: cumval = val =2, obs 2: cumval = val(obs1) + val(obs2) =2+3=5, obs 3: cumval = val (obs2) +val(obs3) = 3+3 = 6... etc..the idea is that because obs 1 and 3 are of the same group, and id obs 3 takes the place of obs 1 - Alex
Ok, I tried couple of things, but still the 'Beta' part is not matching - akrun
that is unfortunate! thanks for trying someting anyways.. - Alex

1 Answers

1
votes

In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.

mylist <- split(df, f = df$id)

og <- list()
vals <- list()

df_num <- 1

We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.

We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.

After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.

If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.

for (my_df in mylist) {

  x <- 1
  r <- 0

  for (j in my_df$Group) {

    r <- r + 1

    if (r == 1) {

      vals[[1]] <- my_df$Val[1]

      if (is.na(j)==FALSE) {
        og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
        temp <- j 
        x <- x + 1
      }

      next
    }

    if (is.na(j)==TRUE) {

      vals[[r]] <- vals[[r-1]] + my_df$Val[r]

    } else if (is.na(j)==FALSE & j==temp) {

      old <- og[[x-1]]
      old <- old[,2]

      vals[[r]] <- vals[[r-1]] - old + df$Val[r]
      og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]

    } else {

      vals[[r]] <- vals[[r-1]] + my_df$Val[r]
      og[[x]] <- my_df[r, c('Group', 'Val')] 
      temp <- j
      x <- x + 1

    }

    }

  cumval <- unlist(vals) %>% as.data.frame()
  colnames(cumval) <- 'cumval'
  my_df <- cbind(my_df, cumval)
  mylist[[df_num]] <- my_df
  df_num <- df_num + 1
}

Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE

final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE