Calculate Duration Omitting Overlapping Dates in R

Question

Sorry I'm very new to R and I'm not a data expert. I'm trying to calculate a duration omitting overlapping dates. I suspect lubridate is the answer. My data set looks like this:

patientnumber  rxnumber                       startdate          stopdate
100                   1                        1/1/2014          1/5/2014
100                   2                        1/1/2014          1/5/2014
100                   3                       1/20/2014         1/22/2014
200                   4                       2/14/2014         2/14/2014
200                   5                       2/15/2014         2/20/2014

I'd like to calculate obtain a value for patient 100 of 8 (5 + 3) and 7 for patient 200 (1 +6) to calculate a total exposure for each patient.

The way I think I need to approach this is. Calculate the minimum start date, and maximum stop date for each patient then use a counter variable to count starting from the minimum start date. If the counter variable overlaps with one of the intervals then add one and move along. If it doesn't, just move along until the max stop date is reached.

I just don't know how to code this. This would be the most complex coding I've done in R and the first time I'd use a loop.Please help!

Update @ Richard Appreciate the help. While scaling this up I noticed some problems.

assuming 1 same patient number and increasing rx#

startdate stopdate duration overlap
3/26/2014 3/26/2014 1 3 (this overlap is coming from the record above) 3/27/2014 3/27/2014 1 0
3/27/2014 3/27/2014 1 1 3/27/2014 3/30/2014 4 1 3/28/2014 3/28/2014 1 3 (this unfortunately I'm not sure hwo to fix)

The code is working, just needs to be fine tuned. Hope you can help. I'll continue to try to figure this out.

SimonG SimonG · Accepted Answer · 2014-08-15T18:22:49

One solution would be to calulate the duration and then correct for an overlap with the preceeding entry.

BEWARE: This solution assumes a certain type of ordering and will not work for unordered data.frames. It is assumed (like in the provided example) that the data are ordered by patientnumber and cronologically.

If the data are in a different format, they have to be sorted accordingly.

# example data
dat <- read.table(header=TRUE, text=
  "patientnumber rxnumber startdate stopdate
   100 1 1/1/2014 1/5/2014
   100 2 1/1/2014 1/5/2014
   100 3 1/20/2014 1/22/2014
   200 4 2/14/2014 2/14/2014
   200 5 2/15/2014 2/20/2014
   300 5 2/19/2014 2/22/2014 
   300 6 3/27/2014 3/27/2014
   300 7 3/27/2014 3/27/2014 
   300 8 3/27/2014 3/30/2014 
   300 9 3/28/2014 3/28/2014")

# convert to date
dat$startdate <- as.Date(dat$startdate, "%m/%d/%Y")
dat$stopdate <- as.Date(dat$stopdate, "%m/%d/%Y")
# base duration
dat$duration <- difftime(dat$stopdate, dat$startdate, units="days")+1

# calculate overlap
dat$overlap <- 0

for(i in 2:nrow(dat)){
  samepat <- dat$patientnumber[i]==dat$patientnumber[i-1]
    curovl <- min(dat$stopdate[i],dat$stopdate[i-1]) - dat$startdate[i]+1
    if(curovl>0 & samepat) dat$overlap[i] <- curovl
}

# aggregate duration and overlap
res <- aggregate(duration ~ patientnumber, data=dat, sum)
res$overlap <- aggregate(overlap ~ patientnumber, data=dat, sum)[,2]

# calculate corrected value
res$corrected <- res$duration - res$overlap

Result:

>     res
  patientnumber duration overlap corrected
1           100      13        5        8 
2           200       7        0        7 
3           300      11        3        8

EDIT:

Fixed some issues regarding the calculation of the overlap (if patientnumber changes, if overlap is only partial). See if the results match your expectations now. Thanks for pointing this out!

Calculate Duration Omitting Overlapping Dates in R

2 Answers