sapply on data.frame vs vector (colnames, rownames)

Question

I have the following function:

func <- function(scores, labels, thresholds) {
  labels <- if (is.data.frame(labels)) labels else data.frame(labels)
  sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[which(scores >= t)]) }) })
}

I also have the following that I'll pass into func.

> scores
[1] 0.187 0.975 0.566 0.793 0.524 0.481 0.005 0.756 0.062 0.124

> thresholds
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

> var1
[1] 1 1 0 0 0 1 0 1 1 1

> df
   var1 var2
1     1    0
2     1    1
3     0    0
4     0    0
5     0    0
6     1    1
7     0    1
8     1    1
9     1    1
10    1    0

Here are two different calls two func, one with labels as a vector, and the other with labels as a data.frame:

> func(scores, var1, thresholds)
labels labels labels labels labels labels labels labels labels labels labels 
 6      5      3      3      3      2      2      2      1      1      0 

> func(scores, df, thresholds)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
var1    6    5    3    3    3    2    2    2    1     1     0
var2    5    3    3    3    3    2    2    2    1     1     0

Why does "labels" get applied as a colname in the vector version, and "var1" and "var2" get applied as a rowname in the data.frame version?

What I'm looking for is the vector version to be more like:

> func(scores, var1, thresholds)
           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
    labels    6    5    3    3    3    2    2    2    1     1     0

To create the variables above:

scores <- sample(seq(0, 1, 0.001), 10, replace = T)
thresholds <- seq(0, 1, 0.1)
var1 <- sample(c(0, 1), 10, replace = T)
var2 <- sample(c(0, 1), 10, replace = T)
df <- data.frame(var1, var2)

If you want to convert labels to a data.frame use as.data.frame instead and see if that helps — Carles Mitjans
Thank you for including your data in this question, but the format is a little hard to read in. When you have your variables, rather than showing the plain console output, show the output of dput(varName) or simply something like scores <- c(0.187, 0.975, 0.566, 0.793, 0.524, 0.481, 0.005, 0.756, 0.062, 0.124). This makes it easier to replicate your problem and find a solution. — Barker
@CarlesMitjans thanks for the suggestion; I tried it, but got the same result. — user451151
@Barker good point -- updating the post now to include :) FYI, don't worry about the set.seed -- the actual values are irrelevant here. — user451151
Also, I believe the which is unnecessary. I can just do sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[scores >= t]) }) }) — user451151

Weihuang Wong Weihuang Wong · Accepted Answer · 2016-12-24T02:38:16

Try switching the order of the nested sapplys:

func <- function(scores, labels, thresholds) {
  labels <- if (is.data.frame(labels)) labels else data.frame(labels)
  t(sapply(labels, function(lbl) { 
    sapply(thresholds, function(t) sum(lbl[which(scores >= t)]))
  }))
}

From ?sapply:

‘sapply’ is a user-friendly version and wrapper of ‘lapply’ by default returning a vector, matrix or, if ‘simplify = "array"’, an array if appropriate, by applying ‘simplify2array()’.

To understand what's going on in your original function, it's perhaps useful to think about each sapply in turn.

The inner sapply(labels, ...) creates a named vector of length k (where k is the number of columns in labels -- so k is 1 in the vector case, and 2 in the dataframe example), where the names of the vector elements are given by the column names (labels in the vector case, and var1/var2 in the dataframe example).

The outer sapply(thresholds, ...) runs the inner sapply 11 times, each time with a different value of t. So in the vector case, you'll end up with 11 vectors of length 1 where the name of the one and only element in each vector is labels, which sapply returns ("simplifies") as one vector of length 11.

By switching the order of the sapplys, the inner sapply now returns an unnamed vector of length 11. The outer sapply then does this k times. In the vector case, k is 1, and the name of the vector returned is labels. In the dataframe example, k is 2, and the names of the 2 vectors returned are var1 and var2.

(It might also be a useful exercise to name the elements in the thresholds vector; e.g. thresholds <- setNames(seq(0, 1, 0.1), LETTERS[1:11]) and re-run func to see what happens.)

sapply on data.frame vs vector (colnames, rownames)

2 Answers