1
votes

I have the following table:

day_1 = as.Date('2020-03-02')
day_2 = as.Date('2021-03-02')
day_3 = as.Date('2021-02-02')
data_example <- data.frame(pay_date = c(rep(day_1,4),rep(day_2,3),rep(day_3,3)),
                           arrears = c(0,0:2,rep(0:2,2)),
                           to_pay = c(2:5,2:4,4:6),
                           paid = c(2:5,2,3,0,4,0,0))

The data is:

> data_example
     pay_date arrears to_pay paid
1  2020-03-02       0      2    2
2  2020-03-02       0      3    3
3  2020-03-02       1      4    4
4  2020-03-02       2      5    5
5  2021-03-02       0      2    2
6  2021-03-02       1      3    3
7  2021-03-02       2      4    0
8  2021-02-02       0      4    4
9  2021-02-02       1      5    0
10 2021-02-02       2      6    0

I want to aggregate the data by pay_date and arrears, and accumulate the amounts to_pay (in col1) and paid (in col2). Then I want to create col 3, which is the cumulate sum of col 1. For example, for '2020-03-02', for arrears 0, the cumulative sum is 5. For arrears 1, the cumulative sum is 5 (from arrears 0) + 4 (from arrears 1) = 9. For arrears 2, the cumulative sum is 5 + 4 + 5 (from arrears 2:

  pay_date   arrears  col1  col2  col3  
* <date>       <dbl> <int> <dbl> <int> 
1 2020-03-02       0     5     5     5  
2 2020-03-02       1     4     4     9 
3 2020-03-02       2     5     5    14 

The next calculation is col 4, which is the cumulative sum of col 2.

Col 5 is the total sum of to_pay by pay_date.

The payment performance is col4/col5. This is the expected result:

  pay_date   arrears  col1  col2  col3  col4  col5 `Payment performance`
* <date>       <dbl> <int> <dbl> <int> <dbl> <int>                 <dbl>
1 2020-03-02       0     5     5     5     5    14                  0.36
2 2020-03-02       1     4     4     9     9    14                  0.64
3 2020-03-02       2     5     5    14    14    14                  1.00
4 2021-02-02       0     4     4     4     4    15                  0.27
5 2021-02-02       1     5     0     9     4    15                  0.27
6 2021-02-02       2     6     0    15     4    15                  0.27
7 2021-03-02       0     2     2     2     2     9                  0.22
8 2021-03-02       1     3     3     5     5     9                  0.56
9 2021-03-02       2     4     0     9     5     9                  0.56

I tried this code but I failed to group the sums:

data_example %>% 
  group_by(pay_date, arrears) %>%
  summarise(col1 = sum(to_pay),
            col2 = sum(paid),
            .groups = "drop") %>%
  mutate(col3 = cumsum(col1),
         col4 = cumsum(col2),
         col5 = sum(col1),
         `Payment performance` = round(col4/col5, digits = 2))

Please, can you help me? In this case my problem is cumsum, because it sums all the values in the column, but I want to sum it by pay_date. Any help will be greatly appreciated.

2
You have groups = "drop", but then it sounds like you want to do some sort of aggregation. It's not clear from your question how you want col3 to be summed. Perhaps you could add some clarification?Ian Campbell
Hello @IanCampbell, let me clarify the question as requested. Thank you!Manu

2 Answers

3
votes

Isn't it just removing .groups = "drop" which gets your desired result

data_example %>% 
  group_by(pay_date, arrears) %>%
  summarise(col1 = sum(to_pay),
            col2 = sum(paid)
            ) %>%
  mutate(col3 = cumsum(col1),
         col4 = cumsum(col2),
         col5 = sum(col1),
         `Payment performance` = round(col4/col5, digits = 2))

 #A tibble: 9 x 8
# Groups:   pay_date [3]
  pay_date   arrears  col1  col2  col3  col4  col5 `Payment performance`
  <date>       <dbl> <int> <dbl> <int> <dbl> <int>                 <dbl>
1 2020-03-02       0     5     5     5     5    14                  0.36
2 2020-03-02       1     4     4     9     9    14                  0.64
3 2020-03-02       2     5     5    14    14    14                  1   
4 2021-02-02       0     4     4     4     4    15                  0.27
5 2021-02-02       1     5     0     9     4    15                  0.27
6 2021-02-02       2     6     0    15     4    15                  0.27
7 2021-03-02       0     2     2     2     2     9                  0.22
8 2021-03-02       1     3     3     5     5     9                  0.56
9 2021-03-02       2     4     0     9     5     9                  0.56

Another solution using data.table :

library(data.table)
setDT(data_example)
df1 <- data_example[, .(col1 = sum(to_pay),
                        col2 = sum(paid)),
                    .(pay_date, arrears)][order(pay_date)]

df1[, `:=`(col3 = cumsum(col1), col4 = cumsum(col2)), pay_date]
df1[, col5 := sum(col1), pay_date]

#Finally
df1[, `Payment performance` := round(col4 / col5, digits = 2)]
df1
#Same results
     pay_date arrears col1 col2 col3 col4 col5 Payment performance
1: 2020-03-02       0    5    5    5    5   14                0.36
2: 2020-03-02       1    4    4    9    9   14                0.64
3: 2020-03-02       2    5    5   14   14   14                1.00
4: 2021-02-02       0    4    4    4    4   15                0.27
5: 2021-02-02       1    5    0    9    4   15                0.27
6: 2021-02-02       2    6    0   15    4   15                0.27
7: 2021-03-02       0    2    2    2    2    9                0.22
8: 2021-03-02       1    3    3    5    5    9                0.56
9: 2021-03-02       2    4    0    9    5    9                0.56
1
votes

add one more level group_by(tmp = data.table::rleid(pay_date)) after summarise

data_example %>% 
  group_by(pay_date, arrears) %>%
  summarise(col1 = sum(to_pay),
            col2 = sum(paid),
            .groups = "drop") %>%
  group_by(tmp = data.table::rleid(pay_date)) %>% 
  mutate(col3 = cumsum(col1),
         col4 = cumsum(col2),
         col5 = sum(col1),
         `Payment performance` = round(col4/col5, digits = 2))

# A tibble: 9 x 9
# Groups:   tmp [3]
  pay_date   arrears  col1  col2   tmp  col3  col4  col5 `Payment performance`
  <date>       <dbl> <int> <dbl> <int> <int> <dbl> <int>                 <dbl>
1 2020-03-02       0     5     5     1     5     5    14                  0.36
2 2020-03-02       1     4     4     1     9     9    14                  0.64
3 2020-03-02       2     5     5     1    14    14    14                  1   
4 2021-02-02       0     4     4     2     4     4    15                  0.27
5 2021-02-02       1     5     0     2     9     4    15                  0.27
6 2021-02-02       2     6     0     2    15     4    15                  0.27
7 2021-03-02       0     2     2     3     2     2     9                  0.22
8 2021-03-02       1     3     3     3     5     5     9                  0.56
9 2021-03-02       2     4     0     3     9     5     9                  0.56