7
votes

I have the following Problem: In a data frame I have a lot of rows and columns with the first row being the date. For each date I have more than 1 observation and I want to summarize them.

My df looks like that (date replaced by ID for ease of use):

df:
ID     Cash    Price    Weight   ...
1      0.4     0        0
1      0.2     0        82       ...
1      0       1        0        ...
1      0       3.2      80       ...
2      0.3     1        70       ...
...    ...     ...      ...      ...

I want to group them by the first column and then summarize all rows BUT with different functions:

The function Cash and Price should be sum so I get the sum of Cash and Price for each ID. The function on Weight should be max so I only get the maximum weight for the ID.

Because I have so many columns I can not write a all functions by hand, but I have only 2 columns which should be summarized by max the rest should be summarized by sum.

So I am looking for a function to group by ID, summarize all with sum except 2 different columns which I need the max value.

I tried to use the dplyr package with:

df %>% group_by(ID = tolower(ID)) %>% summarise_each(funs(sum))

But I need the addition to not sum but max the 2 specified columns, any Ideas?

To be clear, the output of the example df should be:

ID     Cash     Price    Weight
1       0.6        4.2       82     
2       0.3        1          70
5

5 Answers

4
votes

We can use

 df %>%
    group_by(ID) %>%
    summarise(Cash = sum(Cash), Price = sum(Price), Weight = max(Weight))

If we have many columns, one way would be to do this separately and then join the output together.

 df1 <- df %>% 
          group_by(ID) %>% 
          summarise_each(funs(sum), Cash:Price)
 df2 <- df %>%
          group_by(ID) %>% 
          summarise_each(funs(max), Weight)
 inner_join(df1, df2, by = "ID")
 #      ID  Cash Price Weight
 #   (int) (dbl) (dbl)  (int)
 #1     1   0.6   4.2     82
 #2     2   0.3   1.0     70
4
votes

Or do it w/o the double groups:

library(dplyr)

set.seed(1492)
df <- data.frame(id=rep(c(1,2), 3),
                 cash=rnorm(6, 0.5, 0.1),
                 price=rnorm(6, 0.5, 0.1)*6,
                 weight=sample(100, 6))

df

##   id      cash    price weight
## 1  1 0.4410152 2.484082     10
## 2  2 0.4101343 3.032529     93
## 3  1 0.3375889 2.305076     58
## 4  2 0.6047922 3.248851     55
## 5  1 0.4721711 3.209930     34
## 6  2 0.5362493 2.331530     99

custom_summarise <- function(do_df) {

  return(bind_cols(
    summarise_each(select(do_df, -weight), funs(sum)),
    summarise_each(select(do_df, weight), funs(max))
  ))

}

group_by(df, id) %>% do(custom_summarise(.))

## Source: local data frame [2 x 4]
## Groups: id [2]
## 
##      id     cash    price weight
##   (dbl)    (dbl)    (dbl)  (int)
## 1     3 1.250775 7.999089     58
## 2     6 1.551176 8.612910     99
3
votes
library(data.table)

setDT(df)

df[,.(Cash = sum(Cash),Price = sum(Price),Weight = max(Weight)),by=ID]

One way of doing this for +90 columns can be:

max_col <- 'Weight'

sum_col <- setdiff(colnames(df),max_col)

query_1 <- paste0(sum_col,' = sum(',sum_col,')')

query_2 <- paste0(max_col,' = max(',max_col,')')

query_3 <- paste(query_1,collapse=',')

query_4 <- paste(query_2,collapse=',')

query_5 <- paste(query_3,query_4,sep=',')

final_query <- paste0('df[,.(',query_5,'),by = ID]')

eval(parse(text = final_query))
3
votes

Here is a solution based on this comment on an issue on dplyr repo. I think it's very general to be applied to more complicated cases.

library(tidyverse)
df <- tribble(
      ~ID,   ~Cash,  ~Price,  ~Weight,
      #----------------------
          'a',      4,   6,   8,
          'a',      7,   3,   0,
          'a',      7,   9,   0,
          'b',      2,   8,   8,
          'b',      5,   1,   8,
          'b',      8,   0,   1,
          'c',      2,   1,   1,
          'c',      3,   8,   0,
          'c',      1,   9,   1
     )

out <- list(.vars=lst(vars(-Weight), vars(Weight)),
            .funs=lst(sum, max))%>% 
  pmap(~df%>%group_by(ID)%>%summarise_at(.x, .y)) %>% 
  reduce(inner_join)
out
# A tibble: 3 x 4
#   ID     Cash Price Weight
#   <chr> <dbl> <dbl>  <dbl>
# 1 a        18    18      8
# 2 b        15     9      8
# 3 c         6    18      1

You should specify the vars in the first lst (e.g. vars(-Weight), vars(Weight)) and respective function to be applied in the lst (sum, max). The .x in the summarise_at argument refers to elements in the variable lst, and .y refers to the elements in the function lst.

2
votes

As of dplyr 1.0.0 you can use across():

tribble(
  ~ID, ~max1, ~max2, ~sum1, ~sum2, ~sum3,
  1, 1, 1, 1, 2, 3,
  1, 2, 3, 1, 2, 3,
  2, 1, 1, 1, 2, 3,
  2, 3, 4, 2, 3, 4,
  3, 1, 1, 1, 2, 3,
  3, 4, 5, 3, 4, 5,
  3, NA, NA, NA, NA, NA
) %>%
  group_by(ID) %>%
  summarize(
    across(matches("max1|max2"), max, na.rm = T),
    across(!matches("max1|max2"), sum, na.rm = T)
  )

# ID  max1  max2  sum1  sum2  sum3
#  1     2     3     2     4     6
#  2     3     4     3     5     7
#  3     4     5     4     6     8