Using survSplit in R for multiple start dates

Question

If I have the data set

ID     DOB         STARTDATE     ENDDATE     FAILURE
1      10/10/75    5/10/84       15/5/03     0
2      15/9/76     10/3/84       19/6/92     0
2      15/9/76     22/2/93       15/1/99     0
2      15/9/76     15/4/99       15/1/03     0

Where ID is the patient
DOB is their date of birth
STARTDATE is when they entered the study
ENDDATE is when they left the study
FAILURE is if they died or not

There are multiple start dates for some individuals

I'm trying to manipulate it so that we have as well as this, a new STARTDATE entry 10 and 20 years after the DOB (until the ENDDATE is reached). So

ID     DOB         STARTDATE     ENDDATE     FAILURE
1      10/10/75    5/10/84       15/5/03     0
2      15/9/76     10/3/84       19/6/92     0
2      15/9/76     14/9/86       19/6/92     0
2      15/9/76     22/2/93       15/1/99     0
2      15/9/76     15/9/96       15/1/99     0
2      15/9/76     15/4/99       15/1/03     0

ID = 1 remains unchanged since it DOB to STARTDATE difference is already less than 10
ID = 2 keeps the three current rows and adds two new rows. One 10 years after DOB and one 20 years after DOB

So far, I've tried to solve it by adding in a new column which calculates the age of entry (STARTDATE - DOB):

library(eeptools)
AGEENTRY <- age_calc(DOB, STARTDATE, units = "years")

And then run survSplit as so

survSplit(DATA, cut = c(10, 20), end = "AGEENTRY", 
                        event = "FAILURE", start = "START")

I know in STATA it can be done quite nicely with stsplit newvariable, at(10(10)20)

However, this is not doing exactly what I was hoping it would. I've been stuck on this problem for over a day now, so any help would be very much appreciated!

1.618 1.618 · Accepted Answer · 2018-06-27T08:26:37

One of the approach could be

library(tidyverse)
library(lubridate)
library(anytime)

#convert character columns of sample data to date columns    
df <- df %>% 
  mutate_if(grepl("/", .), ~ as.Date(., format = "%d/%m/%Y")) 

#identify STARTDATE + 10 (20, 30 ...) years and then process data to have the desired result
df %>%
  group_by(ID, DOB) %>%
  mutate(temp = paste(seq.Date(unique(DOB), max(STARTDATE), by = "10 year")[-1], collapse = ',')) %>%
  separate_rows(temp, sep = ',') %>%
  group_by(STARTDATE, add = T) %>%
  filter(anydate(temp) == min(anydate(temp)[anydate(temp) > STARTDATE])) %>%
  ungroup() %>%   #STARTDATE + 10 (20, 30 ...) years are identified here
  right_join(df) %>%
  mutate(STARTDATE = ifelse(!is.na(temp), paste(STARTDATE, temp, sep = ','), as.character(STARTDATE))) %>%
  separate_rows(STARTDATE, sep = ',') %>%
  select(-temp)

which gives

     ID DOB        STARTDATE  ENDDATE    FAILURE
1     1 1975-10-10 1984-10-05 2003-05-15       0
2     2 1976-09-15 1984-03-10 1992-06-19       0
3     2 1976-09-15 1986-09-15 1992-06-19       0
4     2 1976-09-15 1993-02-22 1999-01-15       0
5     2 1976-09-15 1996-09-15 1999-01-15       0
6     2 1976-09-15 1999-04-15 2003-01-15       0

Sample data:

df <- structure(list(ID = c(1L, 2L, 2L, 2L), DOB = c("10/10/1975", 
"15/9/1976", "15/9/1976", "15/9/1976"), STARTDATE = c("5/10/1984", 
"10/3/1984", "22/2/1993", "15/4/1999"), ENDDATE = c("15/5/2003", 
"19/6/1992", "15/1/1999", "15/1/2003"), FAILURE = c(0L, 0L, 0L, 
0L)), .Names = c("ID", "DOB", "STARTDATE", "ENDDATE", "FAILURE"
), class = "data.frame", row.names = c(NA, -4L))

#  ID        DOB STARTDATE   ENDDATE FAILURE
#1  1 10/10/1975 5/10/1984 15/5/2003       0
#2  2  15/9/1976 10/3/1984 19/6/1992       0
#3  2  15/9/1976 22/2/1993 15/1/1999       0
#4  2  15/9/1976 15/4/1999 15/1/2003       0

Using survSplit in R for multiple start dates

1 Answers