33
votes

I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name after grouping by State. Thanks

TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE))
TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)]


TDAA.states <- TDAAtest %>%
  filter(!is.na(State)) %>%
  group_by(State) %>%
  summarize(n=n()) %>%
  ungroup() %>%
  arrange(State)
3
Which column you want to summarise? Based on the code, you are doing it by both the columnsakrun
It's probably the best just grouping by both. The alternative is summarize(State.Full.Name = unique(State.Full.Name), n=n()), which is less concise.alistaire
@akrun I clarified. I had left in my not so "right" approach.atclaus
Also, count(...) is equivalent to group_by(...) %>% summarise(n = n())alistaire

3 Answers

35
votes

Perhaps we need

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     summarise(State.Full.Name = first(State.Full.Name), n = n())

Or use mutate to create the column and then do the distinct

TDAAtest %>% f
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     mutate(n= n()) %>% 
     distinct(State, .keep_all=TRUE)
2
votes

I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column ).

Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county and population columns).

We have the following options:

1. dplyr version

From this link, you have three extra operations (mutate, ungroup and filter) to achieve that:

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     mutate(maxPopulation = max(Population)) %>% 
     ungroup() %>%
     filter(maxPopulation == Population)

2. Function version

This one gives you as much flexibility as you want and you can apply any kind of operation to each group:

maxFUN = function(x) {
  # order population in a descending order
  x = x[with(x, order(-Population)), ]
  x[1, ]
}

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>%
     do(maxFUN(.)) 

This one is highly recommended for more complex operations. For instance, you can return top n (topN) counties per state by having x[1:topN] for the returned dataframe in maxFUN.

0
votes

To retain all columns, you can include across() as a summarize argument, as explained in the documentation for dplyr::do().

by_cyl <- head(mtcars) %>%
  group_by(cyl)
by_cyl %>%
  summarise(m_mpg = mean(mpg), across())

    cyl m_mpg   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4  22.8  22.8   108    93  3.85  2.32  18.6     1     1     4     1
2     6  20.4  21     160   110  3.9   2.62  16.5     0     1     4     4
3     6  20.4  21     160   110  3.9   2.88  17.0     0     1     4     4
4     6  20.4  21.4   258   110  3.08  3.22  19.4     1     0     3     1
5     6  20.4  18.1   225   105  2.76  3.46  20.2     1     0     3     1
6     8  18.7  18.7   360   175  3.15  3.44  17.0     0     0     3     2

To retain only a subset of unaltered columns, you can select them within across using tidyselect semantics.