1
votes

I have a matrix of size 10000 x 100 and a vector of length 100. I'd like to apply a custom function, percentile, which takes in a vector argument and a scalar argument, to each column of the matrix such that on iteration j, the arguments used with percentile are column j of the matrix and entry j of the vector. Is there a way to use one of the apply functions to do this?

Here's my code. It runs, but doesn't return the correct result.

percentile <- function(x, v){
  length(x[x <= v]) / length(x)
}

X <- matrix(runif(10000 * 100), nrow = 10000, ncol = 100)
y <- runif(100)
result <- apply(X, 2, percentile, v = y)

The workaround that I've been using has been to just append y to X, and re-write the percentile function, as shown below.

X <- rbind(X, y)
percentile2 <- function(x){
  v <- x[length(x)]
  x <- x[-length(x)]
  length(x[x <= v]) / length(x)
}
result <- apply(X, 2, percentile2)

This code does return the correct result, but I would prefer something a bit more elegant.

2
Hi, welcome to SO. Since you are quite new here, you might want to read the about and FAQ sections of the website to help you get the most out of it. If an answer does solve your problem you may want to consider upvoting and/or marking it as accepted to show the question has been answered, by ticking the little green check mark next to the suitable answer. You are not obliged to do this, but it helps keep the site clean of unanswered questions and rewards those who take the time to solve your problem.Simon O'Hanlon

2 Answers

2
votes

If you understand that R is vectorised and know the right functions you can avoid loops entirely, and do the whole thing in one relatively simple line...

 colSums(  t( t( X ) <= y ) ) / nrow( X ) 

Through vectorisation R will recycle each element in y across each column of X (by default it will do this across the rows, so we use the transpose function t to turn the columns to rows, apply the logical comparison <= and then transpose back again.

Since TRUE and FALSE evaluate to 1 and 0 respectively we can use colSums to effectively get the number of rows in each column which met the condition and then divde each column by the total number of rows (remember the recycling rule!). It is the exact same result....

res1 <- apply(X2, 2, percentile2)
res2 <- colSums(  t( t( X ) <= y ) ) / nrow( X )
identical( res1 , res2 )
[1] TRUE

Obviously as this doesn't use any R loops it's a lot quicker (~10 times on this small matrix).

Even better would be to use rowMeans like this (thanks to @flodel):

     rowMeans(  t(X) <= y  ) 
2
votes

I think the easiest and clearest way is to use a for loop:

result2 <- numeric(ncol(X))
for (i in seq_len(ncol(X))) {
  result2[i] <- sum(X[,i] <= y[i])
}
result2 <- result2 / nrow(X)

the fastest and shortest solution I can think of is:

result1 <- rowSums(t(X) <= y) / nrow(X)

SimonO101 has an explanation in his answer how this works. As I said, it is fast. However, the disadvantage is that it is less clear what exactly is calculated here, although you could solve this by placing this piece of code in a well-named function.

flodel also suggester a solution using mapply which is an apply that can work on multiple vectors. However, for that to work you first need to put each of your columns or your matrix in a list or data.frame:

result3 <- mapply(percentile, as.data.frame(X), y)

Speed wise (see below for some benchmarking) the for-loop doesn't do that bad and it's faster than using apply (in this case at least). The trick with rowSums and vector recycling is faster, over 10 times as fast as the solution using apply.

> X <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)
> y <- runif(100)
> 
> system.time({result1 <- rowSums(t(X) <= y) / nrow(X)})
   user  system elapsed 
  0.020   0.000   0.018 
> 
> system.time({
+   X2 <- rbind(X, y)
+   percentile2 <- function(x){
+     v <- x[length(x)]
+     x <- x[-length(x)]
+     length(x[x <= v]) / length(x)
+   }
+   result <- apply(X2, 2, percentile2)
+ })
   user  system elapsed 
  0.252   0.000   0.249 
> 
> 
> system.time({
+   result2 <- numeric(ncol(X))
+   for (i in seq_len(ncol(X))) {
+     result2[i] <- sum(X[,i] <= y[i])
+   }
+   result2 <- result2 / nrow(X)
+ })
   user  system elapsed 
  0.024   0.000   0.024 
>
> system.time({
+   result3 <- mapply(percentile, as.data.frame(X), y)
+ })
   user  system elapsed 
  0.076   0.000   0.073 
>
> all(result2 == result1)
[1] TRUE
> all(result2 == result)
[1] TRUE
> all(result3 == result)
[1] TRUE