1
votes

I have a large data.frame that looks like this:

   Statistic1    fdr1     Value1   Statistic2  fdr2   Value2
       2        0.0001    Signif      1.8      0.001   Signif 
      0.3        0.13       0          5        0.5      0
      1.5        0.01     Signif      0.4      0.009   Signif

I would like to split the data frame every 3 columns, for example Statistic1, fdr1 and Value1. Then sort each splitted data.frame by Statistic* column in descending order and take the first 20 row names of each sorted data.frames corresponding to the Signif label in column Value* of the sorted data.frame.

Desired output

>       df1         

>        Statistic1    fdr1     Value1   
>            2        0.0001    Signif            
>           1.5        0.01     Signif     

>        Statistic2    fdr2     Value2
>           1.8        0.001    Signif 
>           0.4        0.009    Signif

From each single data.frame I will take the first 20 row names.

Can anyone help me please?

3

3 Answers

3
votes

You can split the data frame by using split.default. Loop over the list and do the required actions. Translating your requirements would give,

lapply(split.default(df, gsub('\\D+', '', names(df))), function(i) 
                                                {i <- i[i[3] != 0,];
                                                 i <- i[order(i[1], decreasing = TRUE),]; 
                                                 i[1:20,]})

However, note that since your example only has 3 rows, then doing the last condition (1:20) will result in NA rows

1
votes

Here another base solution, based on this answers, it's going to split the dataset each three columns, without seeing the names:

lapply(seq(1, ncol(df), by=3), function(i) {
                                           i <- df[i: pmin((i+2), ncol(df))]
                                           i <- i[order(i[1], decreasing = TRUE),]
                                           head(i,2)  # put 2 to see the results, you need 20
                                           })

[[1]]
  Statistic1  fdr1 Value1
1        2.0 0.001 Signif
3        1.5 0.010 Signif

[[2]]
  Statistic2  fdr2 Value2
2        5.0 0.500      0
1        1.8 0.001 Signif

With fake data:

df <- data.frame(Statistic1  = c(2, 0.3, 1.5),
                 fdr1 = c(0.001, 0.13, 0.01),
                 Value1 = c("Signif",0,"Signif"),
                 Statistic2  = c(1.8,5,0.4),
                 fdr2 = c(0.001, 0.5, 0.009),
                 Value2 = c("Signif",0,"Signif"),
                 stringsAsFactors = FALSE)
0
votes

tidyverse answer using base R split.default we split every 3 columns, arrange according to 1st column, filter values from 3rd column and finally select rows.

library(tidyverse)

map(split.default(df, gl(ncol(df)/3, 3)), 
    . %>% arrange_at(1) %>% 
          filter_at(3, ~. != 0) %>%
          slice(n() : (n()- 2)))


#$`1`
#  Statistic1   fdr1 Value1
#1        2.0 0.0001 Signif
#2        1.5 0.0100 Signif

#$`2`
#  Statistic2  fdr2 Value2
#1        1.8 0.001 Signif
#2        0.4 0.009 Signif

For this example this is selecting only 2 rows, for your real data you can change 2 to 20 to get top 20 rows.