1
votes

I have 2 dates columns in a dataframe with more than 100k observations

date1 startdate
2020-07-30 23:00:00 NA
2020-12-10 04:00:00 2021-06-30 20:00:00
2020-10-26 21:00:00 NA
2019-12-03 03:01:00 2020-02-01 01:00:00
NA 2020-06-28 07:30:00

I have to fill the missing values in startdate column, so my idea is to compute the average of days between date1 and startdate and to replace the NA in startdate after by doing an addition between this average and the date1 date.

DESIRED OUTPUT

For instance, if the average of days is 70, then :

date1 startdate
2020-07-30 23:00:00 2020-10-08 23:00:00
2020-12-10 04:00:00 2021-06-30 20:00:00
2020-10-26 21:00:00 2021-01-04 21:00:00
2019-12-03 03:01:00 2020-02-01 01:00:00
NA 2020-06-28 07:30:00

Reproducible example :

structure(list(date1 = structure(c(1594069500, 1575320400, 1603742400, NA, 1574975100, 1570845660, 1575061500, 1564714860, 1576544400, 1574802300, 1576198800, 1575338460, 1575666180, NA, 1594327800, 1595365200, 1594069800, 1591905600, 1594414800, NA), class = c("POSIXct", "POSIXt"), tzone = ""), startdate = structure(c(1599242400, 1577127600, NA, 1603396800, 1577516400, 1573714800, 1577689200, 1566374400, 1577343600, 1577516400, 1577343600, NA, 1577257200, NA, 1605193200, 1605106800, 1600358400, 1600358400, 1600272000, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(1L, 2L, 7591L, 8301L, 8692L, 8694L, 8699L, 8703L, 8706L, 8709L, 8710L, 8714L, 8715L, 8730L, 8732L, 8733L, 8736L, 8740L, 8745L, 8749L ), class = "data.frame")

2

2 Answers

1
votes

You can use difftime to calculate average duration between startdate and date1. Replace NA values in startdate by adding the average value to date1.

avg <- as.numeric(mean(difftime(df$startdate, df$date1, units = 'secs'), na.rm = TRUE))
df$startdate[is.na(df$startdate)] <- df$date1[is.na(df$startdate)] + avg
df
0
votes

something like this?? I cannot verify, since your desired output does not match your sample data..

library(data.table)
setDT(mydata)
mydata[is.na(startdate) & !is.na(date1), 
       startdate := date1 + round(mean(abs(DT$date1 - DT$startdate), na.rm = TRUE))]