44
votes

I have a portion of my script that was running fine before, but recently has been producing an odd statement after which many of my other functions do not work properly. I am trying to select the 8th and 23rd positions in a ranked list of values for each site to find the 25th and 75th percentile values for each day in a year for each site for 30 years. My approach was as follows (adapted for the four line dataset - slice(3) would be slice(23) for my full 30 year dataset usually):

library(“dplyr”)

mydata

structure(list(station_number = structure(c(1L, 1L, 1L, 1L), .Label = "01AD002", class = "factor"), 
year = 1981:1984, month = c(1L, 1L, 1L, 1L), day = c(1L, 
1L, 1L, 1L), value = c(113, 8.329999924, 15.60000038, 149
)), .Names = c("station_number", "year", "month", "day", "value"), class = "data.frame", row.names = c(NA, -4L))    

  value <- mydata$value
  qu25 <- mydata %>% 
          group_by(month, day, station_number) %>% 
          arrange(desc(value)) %>% 
          slice(3) %>% 
          select(value)

Before, I would be left with a table that had one value per site to describe the 25th percentile (since the arrange function seems to order them highest to lowest). However, now when I run these lines, I get a message:

Adding missing grouping variables: `month`, `day`, `station_number`

This message doesn’t make sense to me, as the grouping variables are clearly present in my table. Also, again, this was working fine until recently. I have tried:

  • detatch(“plyr”) – since I have it loaded before dplyr
  • dplyr:: group_by – placing this directly in the group_by line
  • uninstalling and re-intstalling dplyr, although this was for another issue I was having

Any idea why I might be receiving this message and why it may have stopped working?

Thanks for any help.

Update: Added dput example with one site, but values for January 1st for multiple years. The hope would be that the positional value is returned once grouped, for instance slice(3) would hopefully return the 15.6 value for this smaller subset.

3
That's weird. When I run your code it says Error: corrupt 'grouped_df', contains 0 rows, and 4 rows in groups. You don't get that message? Maybe you need to give us more of your example data. BTW it's highly preferred for you to dput the data.Hack-R
I was receiving a corrupt message before, which was why I uninstalled and reinstalled dplyr - but I suspect the code won't work on the little bit I have provided there as it would need multiple sites, months and days to group - it would be a very large chunk so I was hoping maybe it was just a package issue. Sorry - I'm new to posting here, not sure what dput is - but I will look into it.acersaccharum
Sure, no problem. So dput (?dput) is the core R command to facilitate sharing data. On StackOverflow you're required to provide a reproducible example of your problem when you're troubleshooting an error or warning. So, if you dataset has a million rows and it's called mydata go in R and do something like this dput(mydata[1:1000,]) and the paste the results to pastebin.com and give us the link so that we can help you. This assumes that there's enough data in the first 1,000 rows to reproduce your problem.Hack-R
Thank you for your explanation - I have updated the question with a subset, and on this smaller portion I am still receiving the error.acersaccharum
The error comes from slice(23) since there are only 4 elements the data.frame returned will be empty (therefore no grouping in the first place) but since you did a grouped operation, groups will be added afterwards to an empty dataframeDrey

3 Answers

72
votes

For consistency sake the grouping variables should be always present when defined earlier and thus are added when select(value) is executed. ungroup should resolve it:

qu25 <- mydata %>% 
  group_by(month, day, station_number) %>%
  arrange(desc(value)) %>% 
  slice(2) %>% 
  ungroup() %>%
  select(value)

The requested result is without warnings:

> mydata %>% 
+   group_by(month, day, station_number) %>%
+   arrange(desc(value)) %>% 
+   slice(2) %>% 
+   ungroup() %>%
+   select(value)
# A tibble: 1 x 1
  value
  <dbl>
1   113
4
votes

Did you update dplyr recently by chance? I wonder if your dplyr::arrange call has been adversely effected by https://blog.rstudio.org/2016/06/27/dplyr-0-5-0/

Breaking changes arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.

2
votes

You can also convert your tibble to a data frame before your select statement using data.frame(). Then dplyr loses track of your grouping variables and isn't worried about them anymore.

qu25 <- mydata %>% 
      group_by(month, day, station_number) %>% 
      arrange(desc(value)) %>% 
      slice(3) %>% 
      data.frame() %>%
      select(value)