6
votes

I'd like to combine rows of a data frame such that the ranges described by a "start" and "end" column include all values from the original data set. There might be overlaps, repeats, and nested ranges. Some ranges might be missing.

Here's an example of the kind of data I'd like to collapse:

data = data.frame(rbind(
    c("Roger", 1,  10),
    c("Roger", 10, 15),
    c("Roger", 16, 17),
    c("Roger", 3,  6),
    c("Roger", 20, 25),
    c("Roger", NA, NA),
    c("Susan", 2,  8)))
names(data) = c("name", "start", "end")
data$start = as.numeric(as.character(data$start))
data$end = as.numeric(as.character(data$end))

The desired result would be:

name   start end
Roger  1     17
Roger  20    25
Susan  2     8

My attempt has been to expand out every item in the range for each row. This works, but then I'm not sure how to shrink it back. Additionally, the full dataset I'm working with has ~30 million rows and very large ranges, so this method is VERY slow.

pb <- txtProgressBar(min = 0, max = length(data$name), style = 3)
mylist = list()
for(i in 1:length(data$name)){
  subdata = data[i,]
  if(is.na(subdata$start)){
    mylist[[i]] = subdata
    mylist[[i]]$each = NA
  }
  if(!is.na(subdata$start)){
    sequence = seq(subdata$start, subdata$end)  
    mylist[[i]] = subdata[rep(1, each = length(sequence)),]
    mylist[[i]]$daily = sequence
  }
  setTxtProgressBar(pb, i)
}

rbindlist(mylist)
1
Maybe it's obvious but why is Roger appearing twice? and not in one row with start = 1 and end = 25?sindri_baldur
@snoram Good question. Because Roger didn't have an 18 or 19, so the two records reflects the gap in his ranges.Nancy

1 Answers

10
votes

I'm guessing IRanges is much more efficient for this, but...

library(data.table)

# remove missing values
DT = na.omit(setDT(data))

# sort
setorder(DT, name, start)

# mark threshold for a new group
DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name]

# group and summarise
DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )]

#     name g V1 V2
# 1: Roger 0  1 17
# 2: Roger 1 20 25
# 3: Susan 1  2  8

How it works:

  • cummax is the cumulative maximum, so the highest value so far, including the current row.
  • To take the value excluding the current row, use shift (which draws from the prior row).
  • cumsum(some_condition) is a standard way of making a grouping variable.
  • .N is the last row of the group determined by by=.

The columns can be named in the last step like .(s = start[1L], e = end[.N]) if desired.


With date intervals. If working with dates, I'd suggest the IDate class; just use as.IDate to convert a Date.

We can +1 on dates, but unfortunately cannot cummax, so...

cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

set.seed(1)
d = sample(as.IDate("2011-11-11") + 1:10)
cummax_idate(d)
#  [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18"
#  [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21"

I think this function can be used in place of cummax.

The extra () in the function are there because setattr won't print its output.