1
votes

I have a data frame like the following:

c1 <- c(324, 213, 122, 34)
c2 <- c("SDOIHHFOEKN", "SDIUFONBSD", "DSLIHFEIHDFS", "DOOIUDBD")
c3 <- c("G", "T", "U", "T")

df <- data.frame(count = c1, seq = c2, other = c3)

I want the top N sequences in a data frame. For example, for N = 600, I want the final data frame to have a column sum of count to be 600, meaning that only the top 3 rows of this data frame would remain, and the count of the third row would now be 600-324-213 = 63.

How can I get the output data frame like this?

I would really appreciate it if you could provide a general solution, as the data frame I am working with has over 1000 rows and smaller numbers.

Thanks!

1

1 Answers

1
votes

A solution using . The idea is to arrange the data frame by count by descending order, subset for the first three rows, and then update the count column with the last row to be 600 minus all the count of previous row. df2 is the final output.

library(dplyr)

df2 <- df %>%
  arrange(desc(c1)) %>%
  slice(1:which(cumsum(c1) > 600)[1])) %>%
  mutate(count = ifelse(row_number() == n(), 
                        600 - sum(count[1:(n() - 1)]),
                        count))
df2
# # A tibble: 3 x 3
#   count seq          other
#   <dbl> <fct>        <fct>
# 1 324   SDOIHHFOEKN  G    
# 2 213   SDIUFONBSD   T    
# 3  63.0 DSLIHFEIHDFS U