0
votes

I am trying to find the top cumulative amount of rows of a column up to a specified number. So if I have this data set

df <- data.frame(x = rnorm(26, 10, 2))

and I want the number of rows from the top that sums to 100. I have tried using variations of this:

df %>% top_n(6)

But that does not work. Is there a function that can do this easily that I am missing?

The use case for this is if I have a frequency column that gives me the percentage that an observation appears in a dataset after conducting group_by and desc functions and then I want the top, say 15%, of all observations.

Thanks in advance.

2
Check out cumsum.Chirayu Chamoli
df %>% arrange(desc(x)) %>% filter(cumsum(x) < 100).tchakravarty

2 Answers

0
votes

You could do this using cumsum:

sum(cumsum(df$x)<100)
#[1] 10

Or using Reduce:

sum(Reduce('+', df$x, accumulate = T)<100)
#[1] 10

data

set.seed(100)
df <- data.frame(x = rnorm(26, 10, 2))
0
votes

The trick is to use the cumsum function, then narrow the frame to the values of cumsum < 100 and compute nrow on that result:

This will do it:

df <- data.frame(x = rnorm(26, 10, 2))
df$accum <- cumsum(df$x)
df
nrow(df[df$accum<100,])

Result:

          x     accum
1  10.045870  10.04587
2   6.626029  16.67190
3   9.585552  26.25745
4   3.311654  29.56911
5  12.247780  41.81689
6   7.973587  49.79047
7  12.194685  61.98516
8  12.415279  74.40044
9   8.617609  83.01805
10 10.339062  93.35711
11 11.976633 105.33374
12  7.301745 112.63549
13  9.247254 121.88274
14  5.697414 127.58015
15  9.721602 137.30176
16 12.371765 149.67352
17 10.231773 159.90529
18 10.474578 170.37987
19 13.562808 183.94268
20  7.419556 191.36224
21  7.613043 198.97528
22  9.416050 208.39133
23 10.047952 218.43928
24  9.788021 228.22730
25 10.709967 238.93727
26  8.766313 247.70358

10