R tibble: aggregating by row across specific columns, by column groups

Question

I have data of biological compounds levels of test patients, who are grouped into different groups depending on being administered certain drugs. That is, we have:

Columns: Drugs(or groups) A, B and C, where each group has 3 patients (individually denoted where the patients in A are denoted A1, A2, A3; patients in B are denoted B1, B2, B3, and so on.)
Rows: we are monitoring biological compounds Coronin, Dystrophin, Tubulin (randomly Googled protein names), and so on.

So we have a tibble like (all values in the tibble are floats):

| compound  | A1 | A2 | A3 | B1 ... C3|
|-----------|----|----|----|---- ... --|
| Coronin   |
| Dystrophin|
| Gloverin  |
| keratin   |
| Tubulin   |

For each compound, I wish to compute the means of each group, as a new column, like so:

| compound  | A1 | A2 | A3 | B1 ...C3| mean_A | mean_B | mean_C |
|-----------|-----|-----|-----|---- ... --|---------|---------|---------|
| Coronin   |  1  |  2  |  3  |     ...   |    2    |  ...              |
| Dystrophin|  4  |  5  |  6  |     ...   |    5    |  ...              |
| Gloverin  |  ...
| keratin   |
| Tubulin   |

The code to do this is:

my_tibble <- my_tibble %>% 
  mutate(mean_A = rowMeans(select(., c("A1", "A2", "A3")))) %>%
  mutate(mean_B = rowMeans(select(., c("B1", "B2", "B3")))) %>%
  mutate(mean_C = rowMeans(select(., c("C1", "C2", "C3"))))

The question is: I'd like to be able to this for a dynamically input number of groups, i.e. C, D, E, etc ...where column-to-group is a separate, user-input tibble in itself, say:

| group_name | name1 | name2 | name3 |
|------------|-------|-------|-------|
|      A     |  A1   |  B2   |  C3   |
|      B     |  B1   |  B2   |  C3   |
...
and so on

How might I iteratively add mutate verbs, according to a user-specified number of groups (and associated sample-to-group names)?

Note: the group names "C", "B" ...etc are arbitrary (the groups are, for instance, likely to be assigned the name of the drug that that group was given), so I wouldn't use an iterative operation that relies on the fact that they are literally named "A", "B", etc.

Rearrange your data into a long tidy format then you can just do dat %>% group_by(compound, grp) %>% summarise(meanval=mean(value)) . Otherwise you'll end up with gibberish code trying to subset groups of columns when you essentially have one value column split by multiple grouping columns. — thelatemail
Sorry, so are the g11 group-patient column names already provided? or do you have a way to create them? I'm unclear if just the group_names G1 etc. are given or whether the g11 names are as well. In particular, if there is no separator between the group_name and the patient number that's going to be difficult to resolve (is column g123 patient 3 of group G12 or patient 23 of group G1?) — Calum You
There's no patient "g(<3 digit number>)" for two reasons: in practice, there's not likely to be 10 drugs being tested at once, nor 10 patients for each drug (for economic reasons). Also, the names are arbitrary in the sense that the group and patient names are likely to be pronouns. i.e. the groups can be assigned the names of drugs (e.g. "zyprexa", "tamiflu", etc...you can just name them "red", "green", "blue", etc.), and patients are each given a unique serial number to uniquely identify them in some black-boxed database schema. — AndreyIto
Also, the group-patient column names are meant to be user input, so you can arbitrarily generate your own to answer the question. — AndreyIto
There's a second tibble that assigns names to groups in the question; if that could be used as a generalizable solution, then I'll be able to accept your answer. This is because there may be no logic that links each column name to their group name simply because of poor record keeping consistency, but there definitely will be a table that lists all the column names that belong to each group. — AndreyIto

akrun akrun · Accepted Answer · 2019-06-06T04:58:55

An option would be to split by the column names, loop through the list with sapply, get the rowMeans and assign it to 3 new columns

nm1 <- substr(names(df1)[-1], 1, nchar(names(df1)[-1])-1)  
df1[paste0("mean_", toupper(unique(nm1)))] <- 
            sapply(split.default(df1[-1], nm1), rowMeans)

df1
#  compound g11 g12 g13 g21 g22 g23 g31 g32 g33  mean_G1  mean_G2  mean_G3
#1        A   7   3   9   8   8   1   3   7   2 6.333333 5.666667 4.000000
#2        B   3   8   8   1   2   5   1   1   4 6.333333 2.666667 2.000000
#3        C   8   6   7   5   1   4   3   6   3 7.000000 3.333333 4.000000
#4        D   7   9   8   5   5   6   8   7   6 8.000000 5.333333 7.000000
#5        E   2   4   1   5   2   6   6   1   3 2.333333 4.333333 3.333333

NOTE: This can be extended to any number of groups. Only thing to change is the 1:3 in the current example for creating the column names

data

set.seed(24)
df1 <- cbind(compound = LETTERS[1:5], as.data.frame(matrix(sample(1:9, 5 * 9,
      replace = TRUE), nrow = 5, ncol = 9, dimnames = list(NULL,
        paste0(rep(paste0("g", 1:3), each = 3), 1:3)))))

R tibble: aggregating by row across specific columns, by column groups

1 Answers

data