I have 2 dates columns in a dataframe with more than 100k observations
| date1 | startdate |
|---|---|
| 2020-07-30 23:00:00 | NA |
| 2020-12-10 04:00:00 | 2021-06-30 20:00:00 |
| 2020-10-26 21:00:00 | NA |
| 2019-12-03 03:01:00 | 2020-02-01 01:00:00 |
| NA | 2020-06-28 07:30:00 |
I have to fill the missing values in startdate column, so my idea is to compute the average of days between date1 and startdate and to replace the NA in startdate after by doing an addition between this average and the date1 date.
DESIRED OUTPUT
For instance, if the average of days is 70, then :
| date1 | startdate |
|---|---|
| 2020-07-30 23:00:00 | 2020-10-08 23:00:00 |
| 2020-12-10 04:00:00 | 2021-06-30 20:00:00 |
| 2020-10-26 21:00:00 | 2021-01-04 21:00:00 |
| 2019-12-03 03:01:00 | 2020-02-01 01:00:00 |
| NA | 2020-06-28 07:30:00 |
Reproducible example :
structure(list(date1 = structure(c(1594069500, 1575320400, 1603742400, NA, 1574975100, 1570845660, 1575061500, 1564714860, 1576544400, 1574802300, 1576198800, 1575338460, 1575666180, NA, 1594327800, 1595365200, 1594069800, 1591905600, 1594414800, NA), class = c("POSIXct", "POSIXt"), tzone = ""), startdate = structure(c(1599242400, 1577127600, NA, 1603396800, 1577516400, 1573714800, 1577689200, 1566374400, 1577343600, 1577516400, 1577343600, NA, 1577257200, NA, 1605193200, 1605106800, 1600358400, 1600358400, 1600272000, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(1L, 2L, 7591L, 8301L, 8692L, 8694L, 8699L, 8703L, 8706L, 8709L, 8710L, 8714L, 8715L, 8730L, 8732L, 8733L, 8736L, 8740L, 8745L, 8749L ), class = "data.frame")