3
votes

Using the iris data set with the addition of a Year column as created below, I want to group by Species and Year to select the Year with the maximum number of observations for each species.

iris$Year <- c(rep(c("2007", "2008"), each = 25),
               rep(c("2007", "2008"), times = c(10, 40)),
               rep(c("2007", "2008"), times = c(40, 10)))
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Year
1          5.1         3.5          1.4         0.2  setosa 2007
2          4.9         3.0          1.4         0.2  setosa 2007
3          4.7         3.2          1.3         0.2  setosa 2007
4          4.6         3.1          1.5         0.2  setosa 2007
5          5.0         3.6          1.4         0.2  setosa 2007
6          5.4         3.9          1.7         0.4  setosa 2007

As seen in the summary table below, both 2007 and 2008 have 25 observations (rows) for setosa. When there are ties in the number of observations, I want to select the minimum Year.

In short, I am trying to select only setosa 2007, versicolor 2008, and virginica 2007 from the reproducible data set above, prefably using dplyr

EDITNote, rather than create a summary table, I want to select the respective species-year combinations from the iris data created above. In other words, I want to subset iris to include only setosa 2007, versicolor 2008, and virginica 2007 (i.e. the minimum year with the max number of observations). Sorry for the confusion. End Edit

library(dplyr)
iris %>%
  group_by(Species, Year) %>%
  summarise(N = n()) %>%
  as.data.frame()

     Species Year  N
1     setosa 2007 25
2     setosa 2008 25
3 versicolor 2007 10
4 versicolor 2008 40
5  virginica 2007 40
6  virginica 2008 10
1

1 Answers

4
votes

You are halfway there. You need to group by just Species, sort according to what you want to keep, and take the top row from each species.

iris %>%
  group_by(Species, Year) %>%
  summarise(N = n()) %>%
  group_by(Species) %>%
  arrange(desc(N), Year) %>%
  slice(1)
# # A tibble: 3 x 3
# # Groups:   Species [3]
#      Species  Year     N
#       <fctr> <chr> <int>
# 1     setosa  2007    25
# 2 versicolor  2008    40
# 3  virginica  2007    40

If you wanted to keep the ties rather than break them, you could replace the arrange %>% slice with filter(N == max(N)).