0
votes

Currently I have the following data.table :

item      city    dummyvar
A        Austin       1
A        Austin       1
A        Austin      100
B        Austin       2 
B        Austin       2
B        Austin      200
A          NY         1
A          NY         1
A          NY        100
B          NY         2 
B          NY         2
B          NY        200

and I have a user-defined function called ImbalancePoints, which is applied to dummyvar and it returns the rows where it detects an abrupt change in dummyvar. The way I am doing this is as follows:

my.data.table[,
 .(item, city , imb.points = list(unique(try(ImbalancePoints(dummyvar), silent = T))) ),
 by = .(city, item)
]

And for the NY case lets say that I get a data.table object like the following:

 item   city   imb.points
   A     NY     3,449

where the column imb.points is a column with nested lists as its elements, and for this example the numbers 3 and 449 denote the rows where there is an abrupt change for the case of city = NY and item = A. However the problem that I am facing is that I have approx. 3000 different items for 12 different cities, and it is taking a long time to calculate this. I was wondering if you could give me an idea of how to vectorize/speed up this calculation since the last time that I attempted this it took almost 2 hours and it didn't finish.

I don't know if its of any help but I am also attaching the ImbalancePoints function:

library(pracma)

ImbalancePr <- function(eval.column) {
  n <- length(eval.column)
  imbalance <- rep(0, n)
  b_t = rep(0,n)
  elem_diff <- diff(eval.column)
  for(i in 2:n)
  {
    imbalance[i] <- sign(elem_diff[i-1]) * (elem_diff[i-1] != 0)
    + imbalance[i-1]*(elem_diff[i-1] == 0)
  }
  return(imbalance)
}

ImbalancePoints <- function(eval.column, w0 = 100, bkw_T = 10, bkw_b = 10){
  bv_t <- ImbalancePr(eval.column)
  w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
  Tstar <- w0
  E0t <- Tstar
  repeat{
    Tlast <- sum(Tstar)
    nbt <- min(bkw_b, Tlast-1)
    P <- pracma::movavg(bv_t[1:Tlast], n = nbt, type = "e")
    P <- tail(P,1)
    bv_t_expected <- E0t * abs(P)
    bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
    if(max(bv_t_cumsum) < bv_t_expected){break}else{
      Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
    }
    Tlast <- Tlast + Tnew
    if(Tlast > length(eval.column)[1]){break}else{
      Tstar <- c(Tstar,Tnew)
      if(length(Tstar) <= 2){
        E0t <- mean(Tstar)
      }else{
        nt <- min(bkw_T,length(Tstar)-1)
        E0t <- pracma::movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
        E0t <- tail(E0t,1)
      }
    }
  }
  return(sort(unique(Tstar)))
}

EDIT: Thanks to Paul insight then my problem is just to vectorize the repeat loop inside the ImbalancePoints function. However I am not very proficient coding and I don't see a straightforward solution to it. If someone could give me a suggestion or if you know about an auxiliary function/library I will appreciate it.

1
is it ok if you can explain in words how an imbal point is defined?chinsoon12
Looking at the number of loops in your code, I can see that this may be slow. You call ImbalancePoints for each row in the data. This works out to be at least 3000 times the number of cities. So that equals 36,000 times. For each time you call this function you call ImbalancePr. This function loops n times through the column. This then works out to 36,000*36,000 = 1,296,000,000 loops. No wonder. Your repeat loop will make this worse.Paul van Oppen
@chinsoon12 maybe without entering too much into the math details, it performs an exponential weighted average and when that average exceeds a threshold it detects the point where the imbalance isOliver

1 Answers

0
votes

This posting consist of several sections addressing different issues:

  • Vectorizing ImbalancePr()
  • Profiling ImbalancePoints()
  • Speeding-up movavg() with Rcpp by a factor of 4

Vectorizing ImbalancePr()

I believe ImbalancePr() can be replaced by

fImbalancePr <- function(x) c(0, sign(diff(x)))

At least, it returns the same result, wenn benchmarked (with check of results):

library(bench)
library(ggplot2)
bm <- press(
  n = c(10, 100, 1000, 10000),
  {
    x <- rep(0, n)
    set.seed(123)
    x[sample(n, n/5)] <- 100
    print(table(x))
    mark(
      ImbalancePr(x),
      fImbalancePr(x)
    )
  }
  
)
Running with:
      n
1    10
x
  0 100 
  8   2 
2   100
x
  0 100 
 80  20 
3  1000
x
  0 100 
800 200 
4 10000
x
   0  100 
8000 2000
autoplot(bm)

enter image description here

fImbalancePr() is always faster than OP's original version. The speed advantage increases with vector length.

Profiling ImbalancePoints()

However, this improvement does not have much impact on the overall performance of ImbalancePoints():

library(bench)
library(ggplot2)
bm <- press(
  n = c(10L, 100L, 1000L),
  {
    x <- replace(rep(0, n), n, 100)
    y <- c(rep(2, n), rep(-3, n), rep(5, n))
    mark(
      original = {
        list(
          ImbalancePoints(x),
          ImbalancePoints(y)
        )
      },
      modified = {
        list(
          fImbalancePoints(x),
          fImbalancePoints(y)
        )
      }
    )
  }
  
)

fImbalancePoint() is a variant of ImbalancePoint() where ImbalancePr() has been replaced by fImbalancePr().

autoplot(bm)

enter image description here

There is a minor improvement but this does not help to cut down the reported execution time of 2 hours significantly.

We can use profvis to identify where the time is spent within ImbalancePoints():

library(profvis)
x <- c(rep(0, 480L), rep(c(0:9, 9:0), 2L), rep(0, 480L))
profvis({
  for (i in 1:100) ffImbalancePoints(x)
})

Timings are collected by sampling, therefore a sufficient number of repetitions is required to get a good coverage.

The results from one run are shown in this screenshot from RStudio:

enter image description here

  • movavg() consumes 25% of the time spent in ImbalancePoints().
  • According to the profiling, another 20% are required for the double colon operator in pracma::movavg(). It might be worthwhile to test if there is a speedup from loading the pracma paackage beforehand using library(pracma).
  • 10% are spent in calls to tail(). tail(x, 1) can be replaced by x[length(x)] which is more than a magnitude faster.

If we look at code of movavg() by typing pracma::movavg (without parentheses) we see that there is a iterative loop which cannot be vectorized:

...
else if (type == "e") {
    a <- 2/(n + 1)
    y[1] <- x[1]
    for (k in 2:nx) y[k] <- a * x[k] + (1 - a) * y[k - 1]
}
...

In addition, only the last value of the time series created by the call to movavg() is used. So, there might be two options for performance improvements here:

  • Choose a different weighted means function which uses only data points within a limited window.
  • Re-implement movavg() in C++ using Rcpp.

Speeding-up movavg() with Rcpp

Replacing the call to pracma::movavg() and the subsequent call to tail() by on Rcpp function we can gain a speed-up up to a factor of 4 for ImbalancePoints() overall.

EMA_last_cpp(x, n) replaces tail(pracma::movavg(x, n, type = "e"), 1)

library(Rcpp)
cppFunction("
double EMA_last_cpp(const NumericVector& x, const int n) {
  int nx = x.size(); 
  double a = 2.0 / (n + 1.0);
  double b = 1.0 - a;
  double y;

  y = x[0];
  for(int k = 1; k < nx; k++){
    y = a * x[k] + b * y;
  }
  
  return y;
}
")

Now, we can modify ImbalancePoints() accordingly. In addition, the call to ImbalancePr() is replaced and the code is modified in two other places (see comments):

fImbalancePoints <-
  function(eval.column, 
           w0 = 100,
           bkw_T = 10,
           bkw_b = 10) {
    # bv_t <- ImbalancePr(eval.column)
    bv_t <- c(0, sign(diff(eval.column)))
    # w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
    w0 <- min(which(bv_t != 0)[1L], w0) # pick first change point
    Tstar <- w0
    E0t <- Tstar
    repeat {
      Tlast <- sum(Tstar)
      # remove warning: 
      # In max(bv_t_cumsum) : no non-missing arguments to max; returning -Inf
      if (Tlast >= length(bv_t)) break
      nbt <- min(bkw_b, Tlast - 1)
      # P <- movavg(bv_t[1:Tlast], n = nbt, type = "e")
      # P <- tail(P, 1)
      P <- EMA_last_cpp(bv_t[1:Tlast], n = nbt)
      bv_t_expected <- E0t * abs(P)
      bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
      if (max(bv_t_cumsum) < bv_t_expected) {
        break
      } else{
        Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
      }
      Tlast <- Tlast + Tnew
      if (Tlast > length(eval.column)[1]) {
        break
      } else{
        Tstar <- c(Tstar, Tnew)
        if (length(Tstar) <= 2) {
          E0t <- mean(Tstar)
        } else{
          nt <- min(bkw_T, length(Tstar) - 1)
          # E0t <- movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
          # E0t <- tail(E0t, 1)
          E0t <- EMA_last_cpp(Tstar[1:length(Tstar)], n = nt)
        }
      }
    }
    return(sort(unique(Tstar)))
  }

The benchmark

library(bench)
library(ggplot2)
bm <- press(
  n = c(10L, 100L, 1000L),
  {
    x <- replace(rep(0, n), n, 100)
    y <- c(rep(2, n), rep(-3, n), rep(5, n))
    mark(
      original = {
        list(
          ImbalancePoints(x),
          ImbalancePoints(y)
        )
      },
      modified = {
        list(
          fImbalancePoints(x),
          fImbalancePoints(y)
        )
      },
      min_time = 1
    )
  }
)
bm
  expression     n     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <int> <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 original      10 315.1us  369us   2318.      2.66KB     4.16  2231     4   962.49ms
2 modified      10   120us  136us   6092.    195.11KB     5.31  5733     5   940.99ms
3 original     100 583.4us  671us   1283.     55.09KB     4.16  1234     4   961.78ms
4 modified     100 145.4us  167us   5146.     47.68KB     4.19  4916     4   955.29ms
5 original    1000 438.1ms  469ms      2.17  157.37MB     4.33     3     6      1.38s
6 modified    1000  97.1ms  103ms      9.53  152.09MB    17.1     10    18      1.05s

shows that the modified version is about 3 to 5 times faster than the original version. This may help the OP to reduce the compute time for his production dataset from 2+ hours by a significant factor.