3
votes

I would like to group my data frame by a variable, summarize another variable, but keep all other associated columns.

In Applying group_by and summarise on data while keeping all the columns' info the accepted answer is to use filter() or slice(), which works fine if the answer exists in the data already (i.e. min, max) but this doesn't work if you would like to use a function that generates a new answer (i.e. sum, mean).

In Applying group_by and summarise(sum) but keep columns with non-relevant conflicting data? the accepted answer is to use all the the columns you would like to keep as part of the grouping variable. But this seems like an ineffective solution if you have many columns you would like to keep. For example, the data I'm working with has 26 additional columns.

The best solution I've come up with is to split-apply-combine. But this seems clunky - surely there must be a solution that can be done in a single pipeline.

Example:

location <- c("A", "A", "B", "B", "C", "C")
date <- c("1", "2", "1", "2", "1", "2")
count <- c(3, 6, 4, 2, 7, 5)
important_1 <- c(1,1,2,2,3,3)
important_30 <- c(4,4,5,5,6,6)

df <- data.frame(location = location, date = date, count = count, important_1 = important_1, important_30 = important_30)

I want to summarize the counts that happened on different dates at the same location. I want to keep all the important (imagine there are 30 instead of 2).

My solution so far:

check <- df %>%
  group_by(location) %>%
  summarise(count = sum(count))

add2 <- df %>%
  select(-count, -date) %>%
  distinct()

results <- merge(check, add2)

Is there a way I could accomplish this in a single pipeline? I'd rather keep it organized and avoid creating new objects if possible.

2
important_2 is not definedakrun

2 Answers

3
votes

We can create a column with mutate and then apply distinct

library(dplyr)
df %>% 
   group_by(location) %>% 
   mutate(count = sum(count)) %>% select(-date) %>% 
   distinct(location,  important_1, important_30, .keep_all = TRUE)

If there are multiple column names, we can also use syms to convert to symbol and evaluate (!!!)

df %>% 
     group_by(location) %>% 
     mutate(count = sum(count)) %>% select(-date) %>% 
    distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE)
-1
votes

You can group_by all the variables that you want to keep and sum count.

library(dplyr)

df %>% 
  group_by(location, important_1, important_30) %>% 
  summarise(count = sum(count))


#  location important_1 important_30 count
#  <chr>          <dbl>        <dbl> <dbl>
#1 A                  1            4     9
#2 B                  2            5     6
#3 C                  3            6    12