sum or group specific columns based on clusters in r

Question

So I have a data set of species and abundances, here is a sample of it:

  aca.qua aca.bah aca.chi achi.lin alb.vul alu.mon ani.vir arc.rho asp.lun aux.roc bag.bag bag.mar bal.cap cal.cal cal.pen
1       0       0       0        0       5       0      57       0       0       0       0       0       0       0      16
2       0       0       1        0       2       0       3       0       0       0       0       8       0       0       0
3       0       0       0        0       1       0       3       0       0       0       0       0       0       0       3
4       0       0       0        0       5       0       0       0      22       0       0      94       0       0       0
5       0       0       0        0       1       0       0       0       0       2       3       2       0       0       1
6       0       0       0        0       0       0       0       1       0       0       2       2       0       0       0

A made a cluster analysis with some of the species traits and came up with some clusters were each species should be included:

aca.qua  aca.bah  aca.chi achi.lin  alb.vul  alu.mon  ani.vir  arc.rho  asp.lun  aux.roc  bag.bag  bag.mar  bal.cap cal.cal  cal.pen
   1        1        1        2        3        1        4        4        1        5        4        4        1       1        1

"aca.qua" should be in cluster 1, as well as "aca.bah", "aca.chi" and "alu.mon", etc. "achi.lin" in cluster two and so on.

I was trying to come up with a code that uses the references in the second data frame to group the columns by cluster and sum them. I was trying to do so with dplyr, mutate and some loops, but I never managed to get to a good way of doing that. I tried adding the clusters as a row thant using t() to transpose and select(), then transpose back, etc, it was getting way too complicated.

Is there any way that I can use the the vector containing the names of the species and it's clusters as reference to sum the respective columns of each cluster?

The idea is to end up with something like this, but for all the clusters:

   V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 cluster1
1   1   0   0   0   0   0   0   0   0   0        0
2   0   0   0   0   0   0   0   0   0   0        0
3   0   0   0   0   0   0   0   0   0   0        1
4   1   0   0   0   0   0   0   0   0   0        0
5   0   0   1   0   0   0   0   1   0   0       22
6   0   1   0   0   0   0   0   0   0   0        0

Here I used the following code:

teste4 <- teste3 %>%
        filter(V1 == 1) %>%
        select(-1)
teste5 <- teste4 %>%
        mutate(cluster1 = rowSums(teste4[, 1:rowSums(teste4)]))

The point here is that I will also try several different cluster methods and models, therefore, I need to make it somehow more automatic when I come up with new cluster combinations instead of manualy selecting each columns (the original dataset is much larger.

Pierre L Pierre L · Accepted Answer · 2016-01-21T15:47:08

Try to add the rows that match each cluster with rowSums. We can wrap it in an lapply call to cycle through each unique cluster:

lst <- lapply(1:max(df2[1,]), function(x) rowSums(df1[,df2[1,] == x, drop=F]))
setNames(data.frame(lst),paste0("clust",1:length(lst)))
#   clust1 clust2 clust3 clust4 clust5
# 1     16      0      5     57      0
# 2      1      0      2     11      0
# 3      3      0      1      3      0
# 4     22      0      5     94      0
# 5      1      0      1      5      2
# 6      0      0      0      5      0

sum or group specific columns based on clusters in r

1 Answers