Sliding window over data.frame with nested hierarchy

Question

Description of the data

My data.frame represents the salary of people living in different cities (city) in different countries (country). city names, country names and salaries are integers. In my data.frame, the variable country is ordered, the variable city is ordered within each country and the variable salary is ordered within each city (and country). There are two additional columns called arg1 and arg2, which contain floats/doubles.

Goal

For each country and each city, I want to consider a window of size WindowSize of salaries and calculate D = sum(arg1)/sum(arg2) over this window. Then, the window slide by WindowStep and D should be recalculated and so on. For example, let's consider a WindowSize = 1000 and WindowStep = 10. Within each country and within each city, I would like to get D for the range of salaries between 0 and 1000 and for the range between 10 and 1010 and for the range 20 and 1020, etc...

At the end the output should be a data.frame associating a D statistic to each window. If a given window has no entry (for example nobody has a salary between 20 and 1020 in country 1, city 3), then the D statistic should be NA.

Note on performance

I will have to run this algorithm about 10000 times on pretty big tables (that have nothing to do with countries, cities and salaries; I don't yet have a good estimate of the size of these tables), so performance is of concern.

Example data

set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))

arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
  country city salary       arg1        arg2
1       1    1  22791 -1.4606212  1.07084528
2       1    1  34598  0.9244679  1.19519158
3       1    1  76411  0.8288587  0.86737330
4       1    1  76790  1.3013056  0.07380115
5       1    1  87297 -1.4021137  1.62395596
6       1    2  12581  1.3062181 -1.03360620

With this example, if windowSize = 70000 and windowStep = 30000, the first values of D are -0.236604 and 0.439462 which are the results of sum(dt$arg1[1:2])/sum(dt$arg2[1:2]) and sum(dt$arg1[2:5])/sum(dt$arg2[2:5]), respectively.

alexis_laz alexis_laz · Accepted Answer · 2015-09-01T15:18:14

Unless I've misunderstood something, the following might be helpful.

Define a simple function regardless of hierarchical groupings:

ff = function(salary, wSz, wSt, arg1, arg2) 
{
    froms = (wSt * (0:ceiling(max(salary) / wSt)))
    tos = froms + wSz
    Ds = mapply(function(from, to, salaries, args1, args2) {
                  inds = salaries > from & salaries < to
                  sum(args1[inds]) / sum(args2[inds])
                },          
                from = froms, to = tos, 
                MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
    list(from = froms, to = tos, D = Ds)                
}

Compute on the groups with, for example, data.table:

library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
#    country city  from     to          D
# 1:       1    1     0  70000 -0.2366040
# 2:       1    1 30000 100000  0.4394620
# 3:       1    1 60000 130000  0.2838260
# 4:       1    1 90000 160000        NaN
# 5:       1    2     0  70000  1.8112196
# 6:       1    2 30000 100000  0.6134090
# 7:       1    2 60000 130000  0.5959344
# 8:       1    2 90000 160000        NaN
# 9:       1    3     0  70000  1.3216255
#10:       1    3 30000 100000  1.8812397

I.e. a faster equivalent of

lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
       function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))

Sliding window over data.frame with nested hierarchy

3 Answers