Calculating difference between dates based on grouping one or more columns

Question

A sample of my dataset is as below:

| id |       Date | Buyer    |
|:--:|-----------:|----------|
|  9 | 11/29/2018 | Jenny    |
|  9 | 11/29/2018 | Jenny    |
|  9 | 11/29/2018 | Jenny    |
| 4  | 5/30/2018  | Chang    |
| 4  | 7/4/2018   | Chang    |
| 4  | 8/17/2018  | Chang    |
| 5  | 5/25/2018  | Chunfei  |
| 5  | 2/13/2019  | Chunfei  |
| 5  | 2/16/2019  | Chunfei  |
| 5  | 2/16/2019  | Chunfei  |
| 5  | 2/23/2019  | Chunfei  |
| 5  | 2/25/2019  | Chunfei  |
| 8  | 2/28/2019  | Chunfei  |
| 8  | 2/28/2019  | Chunfei  |

I have two sets of questions with this dataset:

I need to calculate the difference between dates but this difference will be calculated based on grouping 'Buyer' and 'id', which means, the date difference for the Buyer 'Jenny' and Id '9' will be one group, Buyer 'Chang' with Id '4' will be another group and Buyer 'Chunfei' with Id '5' will be another group and 'Chunfei' with Id '8' will be another group. So, the output will be:

| id |       Date | Buyer_id | Diff |
|:--:|-----------:|----------|------|
|  9 | 11/29/2018 | Jenny    | NA   |
|  9 | 11/29/2018 | Jenny    | 0    |
|  9 | 11/29/2018 | Jenny    | 0    |
| 4  | 5/30/2018  | Chang    | NA   |
| 4  | 7/4/2018   | Chang    | 35   |
| 4  | 8/17/2018  | Chang    | 44   |
| 5  | 5/25/2018  | Chunfei  | NA   |
| 5  | 2/13/2019  | Chunfei  | 264  |
| 5  | 2/16/2019  | Chunfei  | 3    |
| 5  | 2/16/2019  | Chunfei  | 0    |
| 5  | 2/23/2019  | Chunfei  | 7    |
| 5  | 2/25/2019  | Chunfei  | 2    |
| 8  | 2/28/2019  | Chunfei  | NA   |
| 8  | 2/28/2019  | Chunfei  | 0    |

The issue is that I'm not understanding why the group_by isn't working. The following code subtracts the consecutive rows rather than grouping them for same buyer and id and then subtracting.

df=data.frame(id=c("9","9","9","4","4","4","5","5","5","5","5","5","8","8"), 
              Date=c("11/29/2018","11/29/2018","11/29/2018","5/30/2018","7/4/2018", 
                      "8/17/2018","5/25/2018","2/13/2019","2/16/2019","2/16/2019","2/23/2019",
                      "2/25/2019","2/28/2019","2/28/2019"),Buyer=c("Jenny","Jenny","Jenny",
                      "Chang","Chang","Chang","Chunfei","Chunfei","Chunfei","Chunfei","Chunfei",
                      "Chunfei","Chunfei","Chunfei"))
df$id=as.numeric(as.character(df$id))
df$Date=as.Date(df$Date, "%m/%d/%Y")
df$Buyer=as.character(df$Buyer)

df1=df %>% group_by(Buyer,id) %>%
  mutate(diff=as.numeric(difftime(Date,lag(Date),units='days')))

After calculating the date difference, I need to filter those records whose differences between dates are 5 days. In the above example, the date difference between "5/25/2018", "2/13/2019","2/16/2019","2/16/2019","2/23/2019","2/25/2019" will be NA,264,3,0,7,2. However, if I provide a filter for n<6, I would miss on the dates "2/13/2019" and "2/23/2019". These dates will be important to retain in the final output, because even though the difference between dates "2/13/2019" and "5/25/2018" is 264, the difference between "2/16/2019" and "2/13/2019" is 3. Similarly, even though the difference between "2/16/2019" and "2/23/2019" is 7, the difference between "2/23/2019" and "2/25/2019" is 2. So,I need to retain these dates. How can this be achieved?

We can mask the column 'diff' in the final output and it should look like below:

| id |    Date    | Buyer_id |
|----|:----------:|---------:|
| 9  | 11/29/2018 |    Jenny |
| 9  | 11/29/2018 |    Jenny |
| 9  | 11/29/2018 |    Jenny |
| 5  | 2/13/2019  | Chunfei  |
| 5  | 2/16/2019  | Chunfei  |
| 5  | 2/16/2019  | Chunfei  |
| 5  | 2/23/2019  | Chunfei  |
| 5  | 2/25/2019  | Chunfei  |
| 8  | 2/28/2019  | Chunfei  |
| 8  | 2/28/2019  | Chunfei  |

Ronak Shah Ronak Shah · Accepted Answer · 2019-09-11T15:09:21

We can use diff to subtract Date and select groups where there is at-least one value which is less than equal to 5 days.

library(dplyr)

df %>%
  group_by(id, Buyer) %>%
  filter(any(diff(Date) <= 5))

#      id Date       Buyer  
#   <dbl> <date>     <chr>  
# 1     9 2018-11-29 Jenny  
# 2     9 2018-11-29 Jenny  
# 3     9 2018-11-29 Jenny  
# 4     5 2018-05-25 Chunfei
# 5     5 2019-02-13 Chunfei
# 6     5 2019-02-16 Chunfei
# 7     5 2019-02-16 Chunfei
# 8     5 2019-02-23 Chunfei
# 9     5 2019-02-25 Chunfei
#10     8 2019-02-28 Chunfei
#11     8 2019-02-28 Chunfei

After re-reading the question I think you might not be looking to filter entire groups but only those rows which have difference of 5 days. We can get indices which have diff value of less than 5 and select it's previous index as well.

df %>%
  group_by(id, Buyer) %>%
  mutate(diff = c(NA, diff(Date))) %>%
  slice({i1 <- which(diff <= 5); unique(c(i1, i1-1))}) %>%
  select(-diff)

#      id Date       Buyer  
#   <dbl> <date>     <chr>  
# 1     5 2019-02-16 Chunfei
# 2     5 2019-02-16 Chunfei
# 3     5 2019-02-25 Chunfei
# 4     5 2019-02-13 Chunfei
# 5     5 2019-02-23 Chunfei
# 6     8 2019-02-28 Chunfei
# 7     8 2019-02-28 Chunfei
# 8     9 2018-11-29 Jenny  
# 9     9 2018-11-29 Jenny  
#10     9 2018-11-29 Jenny

data

df <- structure(list(id = c(9, 9, 9, 4, 4, 4, 5, 5, 5, 5, 5, 5, 8, 
8), Date = structure(c(17864, 17864, 17864, 17681, 17716, 17760, 
17676, 17940, 17943, 17943, 17950, 17952, 17955, 17955), class = "Date"), 
Buyer = c("Jenny", "Jenny", "Jenny", "Chang", "Chang", "Chang", 
"Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei", 
"Chunfei", "Chunfei")), row.names = c(NA, -14L), class = "data.frame")

Calculating difference between dates based on grouping one or more columns

1 Answers