2
votes

Sorry I'm very new to R and I'm not a data expert. I'm trying to calculate a duration omitting overlapping dates. I suspect lubridate is the answer. My data set looks like this:

patientnumber  rxnumber                       startdate          stopdate
100                   1                        1/1/2014          1/5/2014
100                   2                        1/1/2014          1/5/2014
100                   3                       1/20/2014         1/22/2014
200                   4                       2/14/2014         2/14/2014
200                   5                       2/15/2014         2/20/2014

I'd like to calculate obtain a value for patient 100 of 8 (5 + 3) and 7 for patient 200 (1 +6) to calculate a total exposure for each patient.

The way I think I need to approach this is. Calculate the minimum start date, and maximum stop date for each patient then use a counter variable to count starting from the minimum start date. If the counter variable overlaps with one of the intervals then add one and move along. If it doesn't, just move along until the max stop date is reached.

I just don't know how to code this. This would be the most complex coding I've done in R and the first time I'd use a loop.Please help!

Update @ Richard Appreciate the help. While scaling this up I noticed some problems.

assuming 1 same patient number and increasing rx#

startdate stopdate duration overlap
3/26/2014 3/26/2014 1 3 (this overlap is coming from the record above) 3/27/2014 3/27/2014 1 0
3/27/2014 3/27/2014 1 1 3/27/2014 3/30/2014 4 1 3/28/2014 3/28/2014 1 3 (this unfortunately I'm not sure hwo to fix)

The code is working, just needs to be fine tuned. Hope you can help. I'll continue to try to figure this out.

2

2 Answers

3
votes

One solution would be to calulate the duration and then correct for an overlap with the preceeding entry.

BEWARE: This solution assumes a certain type of ordering and will not work for unordered data.frames. It is assumed (like in the provided example) that the data are ordered by patientnumber and cronologically.

If the data are in a different format, they have to be sorted accordingly.

# example data
dat <- read.table(header=TRUE, text=
  "patientnumber rxnumber startdate stopdate
   100 1 1/1/2014 1/5/2014
   100 2 1/1/2014 1/5/2014
   100 3 1/20/2014 1/22/2014
   200 4 2/14/2014 2/14/2014
   200 5 2/15/2014 2/20/2014
   300 5 2/19/2014 2/22/2014 
   300 6 3/27/2014 3/27/2014
   300 7 3/27/2014 3/27/2014 
   300 8 3/27/2014 3/30/2014 
   300 9 3/28/2014 3/28/2014")

# convert to date
dat$startdate <- as.Date(dat$startdate, "%m/%d/%Y")
dat$stopdate <- as.Date(dat$stopdate, "%m/%d/%Y")
# base duration
dat$duration <- difftime(dat$stopdate, dat$startdate, units="days")+1

# calculate overlap
dat$overlap <- 0

for(i in 2:nrow(dat)){
  samepat <- dat$patientnumber[i]==dat$patientnumber[i-1]
    curovl <- min(dat$stopdate[i],dat$stopdate[i-1]) - dat$startdate[i]+1
    if(curovl>0 & samepat) dat$overlap[i] <- curovl
}

# aggregate duration and overlap
res <- aggregate(duration ~ patientnumber, data=dat, sum)
res$overlap <- aggregate(overlap ~ patientnumber, data=dat, sum)[,2]

# calculate corrected value
res$corrected <- res$duration - res$overlap

Result:

>     res
  patientnumber duration overlap corrected
1           100      13        5        8 
2           200       7        0        7 
3           300      11        3        8 

EDIT:

Fixed some issues regarding the calculation of the overlap (if patientnumber changes, if overlap is only partial). See if the results match your expectations now. Thanks for pointing this out!

1
votes

This solution uses the unique function to remove duplicate dates. Before using unique, the original data frame needs to be reformatted to a tall-and-skinny version.

# example data
dat <- read.table(header=TRUE, text=
"patientnumber rxnumber startdate stopdate
   100 1 1/1/2014 1/5/2014
   100 2 1/1/2014 1/5/2014
   100 3 1/20/2014 1/22/2014
   200 4 2/14/2014 2/14/2014
   200 5 2/15/2014 2/20/2014
   300 5 2/19/2014 2/22/2014 
   300 6 3/27/2014 3/27/2014
   300 7 3/27/2014 3/27/2014 
   300 8 3/27/2014 3/30/2014 
   300 9 3/28/2014 3/28/2014")
# convert to date
dat$startdate <- as.Date(dat$startdate, "%m/%d/%Y")
dat$stopdate <- as.Date(dat$stopdate, "%m/%d/%Y")

# Create integer versions of the dates
dat$startdate <- as.integer(dat$startdate)
dat$stopdate <- as.integer(dat$stopdate)

# Initialize a "long" version of the original data frame
dat2 <- data.frame(patientnumber = as.integer(),
                   date = as.integer())

# Loop through each row in the original data frame
for (i in 1:nrow(dat)) {
  # Loop through the days between the startdate and stopdate
  for (j in dat[i, "startdate"]:dat[i, "stopdate"]) {
    # Create a new row for each day
    rowij <- data.frame(patientnumber = dat[i, "patientnumber"],
                        date = j)
    # Concatenate the new row to the "long" version of the original data frame
    dat2 <- rbind(dat2, rowij)
  }
}

# Use the unique() function to get rid of duplicate days
dat3 <- unique(dat2)

# Aggregate the days
dat4 <- aggregate(date ~ patientnumber, data=dat3, length)
names(dat4)[2] <- "numberNonoverlappingDays"
dat4

Result:

> dat4
  patientnumber numberNonoverlappingDays
1           100                        8
2           200                        7
3           300                        8