I've got a dataset with a starting date and an enddate, and I want to split the rows in this dataframe based on the year in the period. Take this data frame for example:
df <- data.frame("starting_date"=as.Date("2015-06-01"),"end_date"=as.Date("2017-09-30"))
It should be split into 3 rows, one with starting date 2015-06-01 and end date 2015-12-31, one with starting date 2016-01-01 and end date 2016-12-31, and one with starting date 2017-01-01 and end date 2017-09-30. Any idea how to do this? It should look like this in the end:
starting_date end_date
1 2015-06-01 2015-12-31
2 2016-01-01 2016-12-31
3 2017-01-01 2017-09-30
EDIT: I have adjusted the code to work in base R.
EDIT2: I tried
library(dplyr)
df2 <- df[1,]
df2 <- df[-1,]
for (i in 1:dim(df)[1]){
for (j in year(df$starting_date[i]):year(df$end_date[i]))
{
df2 <- bind_rows(df2,df[i,])
}
}
It works, but it is impossibly slow.
EDIT3: I managed to replicate the lines equal to the number of years involved:
df2 <- df[rep(seq_len(nrow(df)),year(df$end_date)-year(df$starting_date)+1),]
Now I would need another column with the years like this:
starting_date end_date years
1 2015-06-01 2017-09-30 2015
2 2015-06-01 2017-09-30 2016
3 2015-06-01 2017-09-30 2017
Once I am here it is easy to get the required end result.... Any idea on how to do this? I tried to make a seperate vector with the years, in order to cbind it with df2, but it didn't work....
years <- lapply(df,function(x) seq(x[,"starting_date"],length.out=x[,"year"]))
EDIT4: Finally managed to do it with the help of this post: R Create a time sequence as xts index based on two columns in data.frame The code can probably be improved upon a lot, but it works....
diffs <- abs(with(df, year(starting_date)-year(end_date)))+1
df.rep <- df[rep(1:nrow(df), times=diffs), ]
reps <- rep(diffs, times=diffs)
dates.l <- apply(
df[colnames(df) %in% c("starting_date", "end_date")], 1,
function(x) {
seq(min(year(as.Date(x))), max(year(as.Date(x))))
})
years <- do.call(c, dates.l)
df.long <- cbind(df.rep, reps, years)
df.long$yearstart <- as.Date(paste0(year(df.long$years),"-01-01"))
df.long$yearend <- as.Date(paste0(year(df.long$years),"-12-31"))
df.long$starting_date2 <- pmax(df.long$starting_date,df.long$yearstart)
df.long$end_date2 <- pmin(df.long$end_date,df.long$yearend)
date
is not a base R function... eg,x = as.Date(c("2015-06-01", "2017-09-30"))
seems to fit. – Frankdate
is a base R function, but it does not take any arguments. – lmotimestamp
function at some point recently. – lmo