0
votes

Description of the data

My data.frame represents the salary of people living in different cities (city) in different countries (country). city names, country names and salaries are integers. In my data.frame, the variable country is ordered, the variable city is ordered within each country and the variable salary is ordered within each city (and country). There are two additional columns called arg1 and arg2, which contain floats/doubles.

Goal

For each country and each city, I want to consider a window of size WindowSize of salaries and calculate D = sum(arg1)/sum(arg2) over this window. Then, the window slide by WindowStep and D should be recalculated and so on. For example, let's consider a WindowSize = 1000 and WindowStep = 10. Within each country and within each city, I would like to get D for the range of salaries between 0 and 1000 and for the range between 10 and 1010 and for the range 20 and 1020, etc...

At the end the output should be a data.frame associating a D statistic to each window. If a given window has no entry (for example nobody has a salary between 20 and 1020 in country 1, city 3), then the D statistic should be NA.

Note on performance

I will have to run this algorithm about 10000 times on pretty big tables (that have nothing to do with countries, cities and salaries; I don't yet have a good estimate of the size of these tables), so performance is of concern.

Example data

set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))

arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
  country city salary       arg1        arg2
1       1    1  22791 -1.4606212  1.07084528
2       1    1  34598  0.9244679  1.19519158
3       1    1  76411  0.8288587  0.86737330
4       1    1  76790  1.3013056  0.07380115
5       1    1  87297 -1.4021137  1.62395596
6       1    2  12581  1.3062181 -1.03360620

With this example, if windowSize = 70000 and windowStep = 30000, the first values of D are -0.236604 and 0.439462 which are the results of sum(dt$arg1[1:2])/sum(dt$arg2[1:2]) and sum(dt$arg1[2:5])/sum(dt$arg2[2:5]), respectively.

3

3 Answers

3
votes

Unless I've misunderstood something, the following might be helpful.

Define a simple function regardless of hierarchical groupings:

ff = function(salary, wSz, wSt, arg1, arg2) 
{
    froms = (wSt * (0:ceiling(max(salary) / wSt)))
    tos = froms + wSz
    Ds = mapply(function(from, to, salaries, args1, args2) {
                  inds = salaries > from & salaries < to
                  sum(args1[inds]) / sum(args2[inds])
                },          
                from = froms, to = tos, 
                MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
    list(from = froms, to = tos, D = Ds)                
}

Compute on the groups with, for example, data.table:

library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
#    country city  from     to          D
# 1:       1    1     0  70000 -0.2366040
# 2:       1    1 30000 100000  0.4394620
# 3:       1    1 60000 130000  0.2838260
# 4:       1    1 90000 160000        NaN
# 5:       1    2     0  70000  1.8112196
# 6:       1    2 30000 100000  0.6134090
# 7:       1    2 60000 130000  0.5959344
# 8:       1    2 90000 160000        NaN
# 9:       1    3     0  70000  1.3216255
#10:       1    3 30000 100000  1.8812397

I.e. a faster equivalent of

lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
       function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))
1
votes

Without your expected outcome it is a bit hard to guess whether my result is correct but it should give you a head start for the first step. From a performance point of view the data.table package is very fast. Much faster than loops.

set.seed(84)
country <- rep(1:3, c(30, 22, 51))
city <- c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt <- paste0(city, country)
salary <- c()
for (i in unique(tt)) salary <- append(salary, sort(round(runif(sum(tt==i), 0,100000))))

arg1 <- rnorm(length(country), 1, 1)
arg2 <- rnorm(length(country), 1, 1)
dt <- data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dt)

# For data table
require(data.table)
# For rollapply
require(zoo)
setDT(dt)

WindowSize <- 10
WindowStep <- 3
dt[, .(D = (rollapply(arg1, width = WindowSize, FUN = sum, by = WindowStep) / 
            rollapply(arg2, width = WindowSize, FUN = sum, by = WindowStep)), 
       by = list(country = country, city = city))]

You can achieve the latter part of your goal by melting the data and doing and writing a custom summary function that you use to dcast your data together again.

0
votes
Table = NULL
StepNumber = 100
WindowSize = 1000
WindowRange = c(0,WindowSize)
WindowStep = 100
for(x in dt$country){
     #subset of data for that country
     CountrySubset = dt[dt$country == x,,drop=F]
     for(y in CountrySubset$city){
        #subset of data for citys within country
        CitySubset = CountrySubset[CountrySubset$city == y,,drop=F]
        for(z in 1:StepNumber){
            WinRange = WindowRange + (z*WindowStep)
            #subset of salarys within country of city via windowRange
            WindowData = subset(CitySubset, salary > WinRange[1] & salary < WinRange[2])
            CalcD = sum(WindowData$arg1)/sum(WindowData$arg2)
            Output = c(Country = x, City = y, WinStart = WinRange[1], WinEnd = WinRange[2], D = CalcD)
            Table = rbind(Table,Output)

        }
    }
}

Using your example code this should work, its just a series of nested loops that will write to Table. It does however duplicate a line every now and then because the only way I know to keep adding results to a table is rbind.

So if someone can alter this to fix that. Should be good.

WindowStep is the difference between each consecutive WindowSize you want.

StepNumber is how many steps you want to take in total, might be best to find out what the maximum salary is and then adjust for that.