2
votes

I’ve to transform my dataframe from the current to the new format (see image or structure below). I’ve no idea how I can accomplish that. I want a year for each ID, from 2013-2018 (so each ID has 6 rows, one for every year). The dates are the dates of living on that adress (entry date) and when they left that adress (end date). So each ID and year gives the zipcode and city they lived. The place the ID lived (for each year) should be were they lived the longest that year. I've already set the enddate to 31-12-2018 if they still live there (here showed with NA). Below a picture and the first 3 rows. Hopefully you guys can help me out!

Current format:

  • ID (1, 1, 2)
  • ZIPCODE (1234AB, 5678CD, 9012EF)
  • CITY (NEWYORK, LA, MIAMI)
  • ENTRY_DATE (2-1-2014, 13-3-2017, 10-11-2011)
  • END_DATE (13-5-2017, 21-12-2018, 6-9-2017)

New format:

  • ID (1, 1, 1, 1, 1, 1, 2)
  • YEAR (2013, 2014, 2015, 2016, 2017, 2018, 2013)
  • ZIPCODE (NA, 1234AB, 1234AB, 1234AB, 5678CD, 5678CD, 9012EF)
  • CITY (NA, NEWYORK, NEWYORK, NEWYORK, LA, LA, MIAMI)

    See link below

1
Can you explain how you assign values for ZIPCODE and CITY and usually a better way to share data is using `dput. Read about how to give a reproducible example.Ronak Shah

1 Answers

2
votes

Here is one approach.

First, create date intervals for each location from start to end dates. Using map2 and unnest you will create additional rows for each year.

Since you wish to include the location information where there were the greatest number of days for that calendar year, you could look at overlaps between 2 intervals: one interval is the calendar year, and the second interval is the ENTRY_DATE to END_DATE. For each year, you can filter by max(WEEKS) (or to ensure a single address per year, arrange in descending order by WEEKS and slice(1) --- or with latest tidyr consider slice_max). This will keep the row where there is the greatest number of weeks duration overlap between intervals.

The final complete will ensure you have rows for all years between 2013-2018.

library(tidyverse)
library(lubridate)

df %>%
  mutate(ENTRY_END_INT = interval(ENTRY_DATE, END_DATE),
         YEAR = map2(year(ENTRY_DATE), year(END_DATE), seq)) %>%
  unnest(YEAR) %>%
  mutate(YEAR_INT = interval(as.Date(paste0(YEAR, '-01-01')), as.Date(paste0(YEAR, '-12-31'))),
         WEEKS = as.duration(intersect(ENTRY_END_INT, YEAR_INT))) %>%
  group_by(ID, YEAR) %>%
  arrange(desc(WEEKS)) %>%
  slice(1) %>%
  group_by(ID) %>%
  complete(YEAR = seq(2013, 2018, 1)) %>%
  arrange(ID, YEAR) %>%
  select(-c(ENTRY_DATE, END_DATE, ENTRY_END_INT, YEAR_INT, WEEKS))

Output

# A tibble: 14 x 4
# Groups:   ID [2]
      ID  YEAR ZIPCODE CITY   
   <dbl> <dbl> <chr>   <chr>  
 1     1  2013 NA      NA     
 2     1  2014 1234AB  NEWYORK
 3     1  2015 1234AB  NEWYORK
 4     1  2016 1234AB  NEWYORK
 5     1  2017 5678CD  LA     
 6     1  2018 5678CD  LA     
 7     2  2011 9012EF  MIAMI  
 8     2  2012 9012EF  MIAMI  
 9     2  2013 9012EF  MIAMI  
10     2  2014 9012EF  MIAMI  
11     2  2015 9012EF  MIAMI  
12     2  2016 9012EF  MIAMI  
13     2  2017 9012EF  MIAMI  
14     2  2018 NA      NA    

Data

df <- structure(list(ID = c(1, 1, 2), ZIPCODE = c("1234AB", "5678CD", 
"9012EF"), CITY = c("NEWYORK", "LA", "MIAMI"), ENTRY_DATE = structure(c(16072, 
17238, 15288), class = "Date"), END_DATE = structure(c(17299, 
17896, 17415), class = "Date")), class = "data.frame", row.names = c(NA, 
-3L))