1
votes

How would you subset a time-series data.frame based on time and a threshold value?

I have this data:

year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
df <- data.frame(year, v1)

That looks like this:

> df
  year v1
1  2000  9
2  2001  4
3  2002  5
4  2003  4
5  2004  5
6  2005  3
7  2006  3
8  2007  3
9  2008  9
10 2009  6

I want to subset the data by groups of sequential years for which the summed score on v1 exceeds the value of 10.

On this example data the first subset should hold observations of the year 2000 & 2001. The second subset should hold the observations of year 2002, 2003 and 2004.

The real data has about 8 million observations covering 120 years.

1
Please use set.seed when create examples using sampleDavid Arenburg

1 Answers

2
votes

You can implement a customized cumsum using Reduce function, reset the sum when the total exceeds 10 and at same time increment a count as group variable:

library(data.table)
transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), 
                 init = c(0, 1), df$v1, accumulate = T))[[2]][-1]

# here the init parameter will take two parameters, the first one keep track of the cumsum,
# and the second one serves as a group variable, when the sum exceeds 10, reset the sum to 
# zero and increase the group variable by one

# [1] 1 1 2 2 2 3 3 3 3 4

It takes around 20 seconds to run over 10 million observations vector:

v = sample(1:10, 10000000, replace = T)
system.time(transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), init = c(0, 1), v, accumulate = T))[[2]])

#   user  system elapsed 
# 19.509   0.552  20.081