Group data by column and apply operations on subsets

Question

I set up the problem as follows:

1) I have n (in this case 2)-matrices in a list

2) each matrix has the same number of rows (508) but different number of columns

3) For each matrix in the list I want to group the data by "Cohort Number". I then want to move to column 3 ("Alpha(t-1)") and within each group that exists identify the row with the highest "Alpha(t-1)".

For example, say i have 3 groups as identified by "Cohort Number" (as occurs in Mat2 below). I group the data into subsets based on this number. Within all rows that correspond to cohort number =1, i look at column "Alpha (t-1)" and select the row with the highest value of "Alpha(t-1)". I store this row of data in Output1[[2]] (let Output1 be a list of output matrices). I store all remaining rows in Output2[[2]] (let Output2 be a list of all "remaining output" matrices). I then repeat the process for Cohort number = 2. I store the row with the highest alpha(t-1) in Output1[[2]], and all remaining rows in Output2... and so on and so forth. So Output1[[2]] would have 3 rows (and x columns depending on length of Mat2). Output2[[2]] would have 508-3 rows and x columns depending on length of Mat2. In this case Mat2 has 142 columns. so ncol(Output1[[2]])==ncol(Output2[[2]])=142. To reaffirm. Output1[[2]] is 3*142 and Output2[[2]] is 505*142.

To take a second example - Mat1. Mat1 is 508 *55. There is only one cohort number (Cohort Number ==1). This implies Output1[[1]] is 1*55 and Output2[[1]] is 507*55.

I need to apply this process for all matrices in the list. In my actual example i have over 100 matrices and would prefer not doing this one-by-one).

Setup:

mat1 <- data.frame(matrix(nrow=508, ncol =55))
colnames(mat1)[c(1:3,55)] <- c("ID", "Alpha(t)", "Alpha(t-1)", "Cohort Number")
mat1[,1] = 1:508
mat1[1:20,2] = rnorm(20, 0,1)
mat1[1:20,3] = rnorm(20,0,1)
mat1[1:20,55] = rep(1,20)
mat1[1:20, 4:54] = rnorm(20*(55-4),0,0.1)


mat2 <- data.frame(matrix(nrow=508, ncol =142))
colnames(mat2)[c(1:3,142)] <- c("ID", "Alpha(t)", "Alpha(t-1)", "Cohort Number")
seq3 <- seq(1,3,1)
mat2[,1] = 1:508
mat2[1:77,2] = rnorm(77, 0,1)
mat2[1:77,3] = rnorm(77,0,1)
mat2[1:77, 142] = c(rep(seq3,25),1,2)
mat2[1:77,4:141] = rnorm(77*(142-4),0,0.1)

list1 <- list(mat1,mat2)
Output1 <- c()
Output2 <- c()

Ronak Shah Ronak Shah · Accepted Answer · 2019-09-05T01:07:28

We can use map and dplyr operations. For each dataframe in list1 we group_by Cohort Number and select the max row. For output2 we remove all the rows which are selected in ouptut1.

library(purrr)
library(dplyr)

output1 <- map(list1,. %>% group_by(`Cohort Number`) %>% slice(which.max(`Alpha(t-1)`)))
output2 <- map2(list1, output1, anti_join)

We can check dimension of output matrix.

map(output1, dim)
#[[1]]
#[1]  1 55

#[[2]]
#[1]   3 142

map(output2, dim)
#[[1]]
#[1] 507  55

#[[2]]
#[1] 505 142

Group data by column and apply operations on subsets

1 Answers