I'd like to combine rows of a data frame such that the ranges described by a "start" and "end" column include all values from the original data set. There might be overlaps, repeats, and nested ranges. Some ranges might be missing.
Here's an example of the kind of data I'd like to collapse:
data = data.frame(rbind(
c("Roger", 1, 10),
c("Roger", 10, 15),
c("Roger", 16, 17),
c("Roger", 3, 6),
c("Roger", 20, 25),
c("Roger", NA, NA),
c("Susan", 2, 8)))
names(data) = c("name", "start", "end")
data$start = as.numeric(as.character(data$start))
data$end = as.numeric(as.character(data$end))
The desired result would be:
name start end
Roger 1 17
Roger 20 25
Susan 2 8
My attempt has been to expand out every item in the range for each row. This works, but then I'm not sure how to shrink it back. Additionally, the full dataset I'm working with has ~30 million rows and very large ranges, so this method is VERY slow.
pb <- txtProgressBar(min = 0, max = length(data$name), style = 3)
mylist = list()
for(i in 1:length(data$name)){
subdata = data[i,]
if(is.na(subdata$start)){
mylist[[i]] = subdata
mylist[[i]]$each = NA
}
if(!is.na(subdata$start)){
sequence = seq(subdata$start, subdata$end)
mylist[[i]] = subdata[rep(1, each = length(sequence)),]
mylist[[i]]$daily = sequence
}
setTxtProgressBar(pb, i)
}
rbindlist(mylist)