2
votes

Let's say I have the following:

myseq <- seq(0, 1, by = 0.1)
scores <- sample(seq(0, 1, by = 0.01), 10)
var1 <- sample(c(0,1), 10, replace = T)
var2 <- sample(c(0,1), 10, replace = T)
mydf <- data.frame(scores = scores, var1 = var1, var2 = var2)

myseq
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

mydf
  scores var1 var2
1   0.10    1    0
2   0.06    1    0
3   0.74    0    0
4   0.15    1    0
5   0.40    1    1
6   0.96    0    0
7   0.04    1    0
8   0.71    0    1
9   0.94    1    1
10  0.38    0    0

For each value in myseq, I want to sum var1 and var2 for the subset of records where scores is greater than the value in myseq.

I want to do this only using the apply-family functions (apply, lapply, tapply, sapply, mapply, etc.). In other words, no nested for-loops.

So, for example:

The first value in myseq is 0.0. All scores are greater than 0.0, so I want to return var1 = 6 and var2 = 3.

The second value in myseq is 0.1. Only 7 of the 10 scores are greater than 0.1, so I want to return var1 = 3 and var2 = 3.

...so on and so forth...

In the end, I'd like to the final output to be a 11(r) x 2(c) matrix (or data frame or list) containing the sums for each var.

var1 var2
   6    3
   3    3
   ...
   ...

Note: 11(r) is because the length of myseq is 11; 2(c) is because there are two vars, var1 and var2

5
slight note. Please use set.seed when generating data frames via functions such as rnorm or sample so we can double check our results with your expected output.Sotos
Great point; that's my bad. Thank you for pointing out.user451151

5 Answers

2
votes

Something like this ?

res<-t(sapply(myseq,function(x){apply(mydf[scores>x,2:3],2,sum)}))
2
votes

One idea,

t(sapply(lapply(myseq, function(i) mydf[mydf$scores >= i,-1]), function(j) colSums(j)))
 #       var1 var2
 #[1,]    6    7
 #[2,]    6    7
 #[3,]    6    7
 #[4,]    6    6
 #[5,]    3    4
2
votes

tidyverse solution:

myseq <- seq(0, 1, by = 0.1)
scores <- sample(seq(0, 1, by = 0.01), 10)
var1 <- sample(c(0,1), 10, replace = T)
var2 <- sample(c(0,1), 10, replace = T)
mydf <- data.frame(scores = scores, var1 = var1, var2 = var2)

myseq
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

mydf
##    scores var1 var2
## 1    0.85    0    0
## 2    0.06    1    0
## 3    0.23    1    1
## 4    0.98    1    1
## 5    0.32    0    1
## 6    0.58    0    0
## 7    0.45    0    0
## 8    0.90    1    1
## 9    0.22    1    1
## 10   0.15    0    0

library(purrr)
library(dplyr)

map_df(myseq, ~filter(mydf, scores>.) %>% summarise_each(funs(sum), -scores))
##    var1 var2
## 1     5    5
## 2     4    5
## 3     4    5
## 4     2    3
## 5     2    2
## 6     2    2
## 7     2    2
## 8     2    2
## 9     2    2
## 10    1    1
## 11    0    0
1
votes

Another alternative to avoid excessive computations:

Order the scores and find the index where every element of "myseq" is larger than "scores":

o = order(mydf$scores)

i = findInterval(myseq, mydf$scores[o])
z = rep_len(0L, sum(!i)) #zeroes to add, later on, because x[0] results in 0-length 

Calculate consecutive sums only once:

csv1 = cumsum(mydf$var1[o]) 
csv2 = cumsum(mydf$var2[o])

Subset the summations appropriately (I used set.seed(1821) to generate the data):

csv1[length(csv1)] - c(z, csv1[i])
# [1] 8 7 6 6 6 5 3 3 2 1 0
csv2[length(csv2)] - c(z, csv2[i])
# [1] 6 5 5 5 5 3 2 2 1 1 0

Since you mention >2 variables, the last operations can be substituted by

sapply(mydf[-1], function(x) { cs = cumsum(x[o]); cs[length(cs)] - c(z, cs[i]) })
0
votes

Using data table you could try:

require(data.table)
set.seed(5)
myseq <- seq(0, 1, by = 0.1)
scores <- sample(seq(0, 1, by = 0.01), 10)
var1 <- sample(c(0,1), 10, replace = T)
var2 <- sample(c(0,1), 10, replace = T)
mydf <- data.frame(scores = scores, var1 = var1, var2 = var2)

setDT(mydf)
result <- t(sapply(myseq, function(x){ mydf[scores > x, lapply(.SD[, -1, with = F], sum)]}))

> result
      var1      var2     
 [1,] 4         4        
 [2,] 4         4        
 [3,] 4         3        
 [4,] 3         3        
 [5,] 3         3        
 [6,] 3         3        
 [7,] 3         3        
 [8,] 3         2        
 [9,] 2         1        
[10,] 1         1        
[11,] Numeric,0 Numeric,0