4
votes

I am trying to add rows to a data frame based on the minimum and maximum data within each group. Suppose this is my original data frame:

df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01", "2018-02-01")),
            Group = c(1,1,2,2,2,3,3),
            Value = c(100, 200, 150, 125, 200, 150, 175))

Notice that Group 1 has 2 consecutive dates, group 2 has 3 consecutive dates, and group 3 is missing the date in the middle (2018-01-01). I'd like to be able to complete the data frame by adding rows for missing dates. But the thing is I only want to add additional dates based on dates that are missing between the minimum and maximum date within each group. So if I were to complete this data frame it would look like this:

df_complete = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01","2018-01-01", "2018-02-01")),
            Group = c(1,1,2,2,2,3,3,3),
            Value = c(100, 200, 150, 125, 200, 150,NA, 175))

Only one row was added because Group 3 was missing one date. There was no date added for Group 1 because it had all the dates between its minimum (2017-12-01) and maximum date (2018-01-01).

2

2 Answers

12
votes

You can use tidyr::complete with dplyr to find a solution. The interval between consecutive dates seems to be month. The approach will be as below:

library(dplyr)
library(tidyr)

df %>% group_by(Group) %>%
  complete(Group, Date = seq.Date(min(Date), max(Date), by = "month"))

# A tibble: 8 x 3
# Groups: Group [3]
# Group Date       Value
# <dbl> <date>     <dbl>
#   1  1.00 2017-12-01   100
# 2  1.00 2018-01-01   200
# 3  2.00 2017-12-01   150
# 4  2.00 2018-01-01   125
# 5  2.00 2018-02-01   200
# 6  3.00 2017-12-01   150
# 7  3.00 2018-01-01    NA
# 8  3.00 2018-02-01   175

Data

df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01",
               "2018-02-01","2017-12-01", "2018-02-01")),
                Group = c(1,1,2,2,2,3,3),
                Value = c(100, 200, 150, 125, 200, 150, 175))
1
votes

@MKR's approach of using tidyr::complete with dplyr is good, but will fail if the group column is not numeric. It will then be typecast as factors and the complete() operation will then result in a tibble with a row for every factor/time combination for each group.

complete() does not need the group variable as first argument, so the solution is

library(dplyr)
library(tidyr)

df %>% group_by(Group) %>%
  complete(Date = seq.Date(min(Date), max(Date), by = "month"))