9
votes

I have a situation where my data frame contains the results of image analysis where the columns are the proportion of a particular class present in the image, such that an example dataframe class_df would look like:

id    A    B    C    D    E    F
 1 0.20 0.30 0.10 0.15 0.25 0.00 
 2 0.05 0.10 0.05 0.30 0.10 0.40
 3 0.10 0.10 0.10 0.20 0.20 0.30

Each of these classes belongs to a functional group and I want to create new columns where the proportions of each functional group are calculated from the classes. An example mapping class_fg

class         fg
    A          Z
    B          Z
    C          Z
    D          Y
    E          Y
    F          X

and the desired result would be (line added to show the desired new columns):

id    A    B    C    D    E    F |    X    Y    Z
 1 0.20 0.30 0.10 0.15 0.25 0.00 | 0.00 0.40 0.60
 2 0.05 0.10 0.05 0.30 0.10 0.40 | 0.40 0.40 0.20
 3 0.10 0.10 0.10 0.20 0.20 0.30 | 0.30 0.40 0.30

And I can do it one functional group at a time using

first_fg <- class_fg %>%
  filter(fg == "Z") %>%
  select(class) %>%
  unlist()

class_df <- class_df %>%
  mutate(Z = rowSums(select(., one_of(first_fg))))

Surely there is a better way to do this where I can calculate the row sum for each functional group without having to just repeat this code for each group? Maybe using purrr?

5
Yes that's called aggregation then summarization. Do class_fg %>% group_by(fg) %>% summarize(...your summary code goes here...) - smci
Sorry @Ronak, updated correctly. "label" should have been "class" - Syzorr
@smci - I don't see how that would allow me to create a summary for the class_df which is what I'm actually wanting to summarize? - Syzorr
Your code was confusing because you never named your df. (Is it called class_fg or class_df? What is class_df?) Either way, the solution you want is whatever_your_df_is_actually_called %>% group_by(fg) %>% summarize(...your summary code goes here...) - smci
I've updated to make the examples clearer. I'm working with a very large data set where I'm unsure if I can share (and using it as an example would be too large here), so hopefully the improved examples help. - Syzorr

5 Answers

7
votes

We could split the 'class_df' by 'class', loop through the list elements with map, select the columns of 'class_df' and get the rowSums

library(tidyverse)
class_fg %>%
    split(.$fg) %>% 
    map_df(~ class_df %>%
                select(one_of(.x$class)) %>% 
                rowSums) %>%
    bind_cols(class_df, .)
#  id    A   B    C    D    E   F   X   Y   Z
#1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

Or do a group by nesting, and then do the rowSums by mapping over the list elements

class_fg %>% 
   group_by(fg) %>%
   nest %>%
   mutate(out = map(data, ~  class_df %>%
                               select(one_of(.x$class)) %>% 
                               rowSums)) %>% 
   select(-data)  %>%
   unnest %>% 
   unstack(., out ~ fg) %>% 
   bind_cols(class_df, .)
6
votes

Always it is easier to work on data in long format. Hence, change class_df to long format using tidyr:gather and join with class_fg. Perform analysis in long format on your data. Finally, spread in wide-format to match expected result.

library(tidyverse)

class_df %>% gather(key, value, -id) %>% 
  inner_join(class_fg, by=c("key" = "class")) %>%
  group_by(id, fg) %>%
  summarise(value = sum(value)) %>%
  spread(fg, value) %>%
  inner_join(class_df, by="id") %>% as.data.frame()

#   id   X   Y   Z    A   B    C    D    E   F
# 1  1 0.0 0.4 0.6 0.20 0.3 0.10 0.15 0.25 0.0
# 2  2 0.4 0.4 0.2 0.05 0.1 0.05 0.30 0.10 0.4
# 3  3 0.3 0.4 0.3 0.10 0.1 0.10 0.20 0.20 0.3

Data:

class_fg <- read.table(text = 
"class         fg
                 A          Z
                 B          Z
                 C          Z
                 D          Y
                 E          Y
                 F          X",
header = TRUE, stringsAsFactors = FALSE)

class_df  <- read.table(text = 
"id    A    B    C    D    E    F
1 0.20 0.30 0.10 0.15 0.25 0.00 
2 0.05 0.10 0.05 0.30 0.10 0.40
3 0.10 0.10 0.10 0.20 0.20 0.30",
header = TRUE, stringsAsFactors = FALSE)
5
votes

Yet another option, along with the already contributed working solutions, would be to use quasiquotation with the rlang package to build expressions to calculate the sums in each group.

library(tidyverse)

First, define a helper function for doing an elementwise sum of vectors:

psum <- function(...) reduce(list(...), `+`)

Extracting the groupings into a list from class_fg we can then construct a list of expressions to calculate the sum in each group:

sum_exprs <- with(class_fg, split(class, fg)) %>% 
  map(~ rlang::expr(psum(!!!rlang::syms(.x))))

sum_exprs
#> $X
#> psum(F)
#> 
#> $Y
#> psum(D, E)
#> 
#> $Z
#> psum(A, B, C)

With the list of expressions ready, we can just "bang-bang-bang" (!!!) them into the data with mutate:

class_df %>%
  mutate(!!!sum_exprs)
#>   id    A   B    C    D    E   F   X   Y   Z
#> 1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#> 2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#> 3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

(I used the code provided by @MKR in his answer to create the data).

Created on 2018-05-22 by the reprex package (v0.2.0).

1
votes

My usual approach is to stick to base R as long as the data sets don't get too large. In your case, a base R solution would be:

class_df=as.data.frame(
  c(class_df,
    lapply(split(class_fg,class_fg$fg),
           function(x) rowSums(class_df[,x$class,drop=FALSE]))))
class_df
#  id    A   B    C    D    E   F   X   Y   Z
#1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

If the data sets get too large, I use data.table. A data.table solution for your problem:

library(data.table)

class_dt=data.table(class_df)
grps=split(class_fg,class_fg$fg)

for (g in grps) class_dt[,c(g$fg[1]):=rowSums(.SD),.SDcols=g$class,]
class_dt
#   id    A   B    C    D    E   F   X   Y   Z
#1:  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2:  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3:  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3
0
votes

Another tidyverse solution using rowSums on column subsets :

library(tidyverse)
class_fg %>%
  group_by(fg) %>% 
  summarize(list(rowSums(class_df[class]))) %>%
  spread(1,2) %>%
  unnest() %>%
  bind_cols(class_df, .)

#>   id    A   B    C    D    E   F   X   Y   Z
#> 1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#> 2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#> 3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3

Or for the glory of code golf :

x <- with(class_fg, tapply(class, fg, reformulate))
mutate(class_df, !!!map(x, ~as.list(.)[[2]]))
#>   id    A   B    C    D    E   F   X   Y   Z
#> 1  1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#> 2  2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#> 3  3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3