A sample of my dataset is as below:
| id | Date | Buyer |
|:--:|-----------:|----------|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 4 | 5/30/2018 | Chang |
| 4 | 7/4/2018 | Chang |
| 4 | 8/17/2018 | Chang |
| 5 | 5/25/2018 | Chunfei |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
I have two sets of questions with this dataset:
- I need to calculate the difference between dates but this difference will be calculated based on grouping 'Buyer' and 'id', which means, the date difference for the Buyer 'Jenny' and Id '9' will be one group, Buyer 'Chang' with Id '4' will be another group and Buyer 'Chunfei' with Id '5' will be another group and 'Chunfei' with Id '8' will be another group. So, the output will be:
| id | Date | Buyer_id | Diff |
|:--:|-----------:|----------|------|
| 9 | 11/29/2018 | Jenny | NA |
| 9 | 11/29/2018 | Jenny | 0 |
| 9 | 11/29/2018 | Jenny | 0 |
| 4 | 5/30/2018 | Chang | NA |
| 4 | 7/4/2018 | Chang | 35 |
| 4 | 8/17/2018 | Chang | 44 |
| 5 | 5/25/2018 | Chunfei | NA |
| 5 | 2/13/2019 | Chunfei | 264 |
| 5 | 2/16/2019 | Chunfei | 3 |
| 5 | 2/16/2019 | Chunfei | 0 |
| 5 | 2/23/2019 | Chunfei | 7 |
| 5 | 2/25/2019 | Chunfei | 2 |
| 8 | 2/28/2019 | Chunfei | NA |
| 8 | 2/28/2019 | Chunfei | 0 |
The issue is that I'm not understanding why the group_by isn't working. The following code subtracts the consecutive rows rather than grouping them for same buyer and id and then subtracting.
df=data.frame(id=c("9","9","9","4","4","4","5","5","5","5","5","5","8","8"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/30/2018","7/4/2018",
"8/17/2018","5/25/2018","2/13/2019","2/16/2019","2/16/2019","2/23/2019",
"2/25/2019","2/28/2019","2/28/2019"),Buyer=c("Jenny","Jenny","Jenny",
"Chang","Chang","Chang","Chunfei","Chunfei","Chunfei","Chunfei","Chunfei",
"Chunfei","Chunfei","Chunfei"))
df$id=as.numeric(as.character(df$id))
df$Date=as.Date(df$Date, "%m/%d/%Y")
df$Buyer=as.character(df$Buyer)
df1=df %>% group_by(Buyer,id) %>%
mutate(diff=as.numeric(difftime(Date,lag(Date),units='days')))
- After calculating the date difference, I need to filter those records whose differences between dates are 5 days. In the above example, the date difference between "5/25/2018", "2/13/2019","2/16/2019","2/16/2019","2/23/2019","2/25/2019" will be NA,264,3,0,7,2. However, if I provide a filter for n<6, I would miss on the dates "2/13/2019" and "2/23/2019". These dates will be important to retain in the final output, because even though the difference between dates "2/13/2019" and "5/25/2018" is 264, the difference between "2/16/2019" and "2/13/2019" is 3. Similarly, even though the difference between "2/16/2019" and "2/23/2019" is 7, the difference between "2/23/2019" and "2/25/2019" is 2. So,I need to retain these dates. How can this be achieved?
We can mask the column 'diff' in the final output and it should look like below:
| id | Date | Buyer_id |
|----|:----------:|---------:|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |