0
votes

I think I have a pretty simple request. I have the following dataframe, where "place" is a unique identifier, while start_date and end_date may overlap. The values are unique for each ID "place".

place  start_date   end_date value
1     2007-09-01  2010-10-12  0.5
2     2013-09-27  2015-10-11  0.7
...

What I need is to create a year-based variable, where I expand the time series by each year (starting from first of January (i.e. 2011-01-01) starts a new row for that particular "place" and "value". I mean something like this:

place  year  value 
1     2007    0.5   
1     2008    0.5   
1     2009    0.5   
1     2010    0.5   
2     2013    0.7    
2     2014    0.7    
2     2015    0.7    
...

There are some cases with overlap (ie. "place"=1 & "year"=2007) for two separate cases, where one observations starts with one year and the other observation continues from that year. In that case I would prefer the "value" that ends on that specific year. So if one observation for place=1 ends with 2007 in March and another place=1 starts with 2007 in April, year=2007 value for place=1 would be marked with the previous "ending" value if that makes sense.

I've only gotten this far: library(data.table) data <- data.table(dat) data[,:=(start_date = as.Date(start_date), end_date = as.Date(end_date))] data[,num_mons:= length(seq(from=start_date, to=end_date, by='year')),by=1:nrow(data)]

I guess writing a loop makes the most sense?

Thank you for your help and advice.

2

2 Answers

1
votes

Using a tidyverse solution could look like:

library(dplyr)
library(stringr)
library(purrr)
library(tidyr)

data <- tibble(place = c(1, 2),
               start_date = c('2007-09-01',
                              '2013-09-27'),
               end_date = c('2010-10-12',
                            '2015-10-11'),
               value = c(0.5, 0.7))

data %>%
  mutate(year = map2(start_date,
                     end_date,
                     ~ as.character(str_extract(.x, '\\d{4}'): 
                                    str_extract(.y, '\\d{4}')))) %>%
  separate_rows(year) %>%
  filter(!year %in% c('c', '')) %>%
  select(place, year, value)

#   place year  value
#   <dbl> <chr> <dbl>
# 1     1 2007    0.5
# 2     1 2008    0.5
# 3     1 2009    0.5
# 4     1 2010    0.5
# 5     2 2013    0.7
# 6     2 2014    0.7
# 7     2 2015    0.7

I'm having problems understanding the third paragraph of your question ("There are ..."). It seems to me to be a separate question. If that is the case, please consider moving the question to a separate post here on SO. If it is not a separate question, please reformulate the paragraph.

1
votes

You could do the following:

library(lubridate)
library(tidyverse)
df %>%
  group_by(place) %>%
  mutate(year = list(seq(year(ymd(start_date)), year(ymd(end_date)))))%>%
  unnest(year)%>%
  select(place,year,value)

# A tibble: 7 x 3
# Groups:   place [2]
  place  year value
  <int> <int> <dbl>
1     1  2007   0.5
2     1  2008   0.5
3     1  2009   0.5
4     1  2010   0.5
5     2  2013   0.7
6     2  2014   0.7
7     2  2015   0.7