2
votes

I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.

Reproducible example:

set.seed(100)
DT <- data.table(year=2000:2009, month=1:10,  price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)

I have just tried:

  • merging the DT and intervals.dt data.tables by month/year,
  • creating a new intervalsstring column consisting of all the V* columns to one column string, (not very elegant, I admit), and finally
  • substringing it to a vector, so as I can use it in findInterval() but the solution does not work for every row (!)

So, after:

DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]

I get

> DT
      year month     price               intervalsstring interval
   1: 2000     1 30.776611 12 21 36 46 48 51 63 72 91 95        2
   2: 2000     1 62.499648 12 21 36 46 48 51 63 72 91 95        6
   3: 2000     1 53.581115 12 21 36 46 48 51 63 72 91 95        6
   4: 2000     1 48.830599 12 21 36 46 48 51 63 72 91 95        5
   5: 2000     1 33.066053 12 21 36 46 48 51 63 72 91 95        2
---                                                            
3376: 2009    10 33.635924 12 40 45 48 50 65 75 90 96 97        2
3377: 2009    10 38.993769 12 40 45 48 50 65 75 90 96 97        3
3378: 2009    10 75.065820 12 40 45 48 50 65 75 90 96 97        8
3379: 2009    10  6.277403 12 40 45 48 50 65 75 90 96 97        0
3380: 2009    10 64.189162 12 40 45 48 50 65 75 90 96 97        7

which is correct for the first rows, but not for the last (or other) rows. For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring.

Thank you!

1

1 Answers

3
votes

You have to us the argument by = year to apply the function to all subsets:

DT[, interval := findInterval(price, intervals[as.character(year), ]), by = year]

      year     price interval
   1: 2000 30.776611        4
   2: 2001 25.767250        1
   3: 2002 55.232243        4
   4: 2003  5.638315        0
   5: 2004 46.854928        2
  ---                        
3376: 2005 97.497761       10
3377: 2006 50.141227        5
3378: 2007 50.186270        7
3379: 2008 99.229338       10
3380: 2009 64.189162        8

Update (based on edited question):

DT[ , interval := findInterval(price, 
                               unlist(intervals.dt[J(year[1], month[1]), 1:10])), 
   by = c("year", "month")]
      year month     price V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 interval
   1: 2000     1 30.776611 12 21 36 46 48 51 63 72 91  95        2
   2: 2000     1 62.499648 12 21 36 46 48 51 63 72 91  95        6
   3: 2000     1 53.581115 12 21 36 46 48 51 63 72 91  95        6
   4: 2000     1 48.830599 12 21 36 46 48 51 63 72 91  95        5
   5: 2000     1 33.066053 12 21 36 46 48 51 63 72 91  95        2
  ---                                                             
3376: 2009    10 33.635924 12 40 45 48 50 65 75 90 96  97        2
3377: 2009    10 38.993769 12 40 45 48 50 65 75 90 96  97        3
3378: 2009    10 75.065820 12 40 45 48 50 65 75 90 96  97        8
3379: 2009    10  6.277403 12 40 45 48 50 65 75 90 96  97        0
3380: 2009    10 64.189162 12 40 45 48 50 65 75 90 96  97        7