Group_by then filter with dplyr

Question

First of all, please let me know if what I'm doing is poor use of dplyr because I'm not sure if I'm approaching this in the best way. I have the following dataframe:

mydf = data.frame(user = c(7,7,7,7,7,7,7,8,8,8,8,8,8),
                  col1 = c('0','0','1','1','0','3','NULL','3','3','0','1','0','0'),
                  col2 = runif(n=13),
                  col3 = letters[1:13], 
                  stringsAsFactors = FALSE)

> mydf
   user col1      col2 col3
1     7    0 0.7607907    a
2     7    0 0.1580448    b
3     7    1 0.8063540    c
4     7    1 0.7331512    d
5     7    0 0.2433631    e
6     7    3 0.2357065    f
7     7 NULL 0.4864172    g
8     8    3 0.6806089    h
9     8    3 0.2229874    i
10    8    0 0.6187911    j
11    8    1 0.7617177    k
12    8    0 0.5884821    l
13    8    0 0.4985750    m

The filtering I would like to do is a bit wordy, but I'll try - I would like to filter the dataframe by removing all rows where col1 == '0' if that row occurs AFTER the first row for that user where col1 == '1'. (bolded indicates i messed up the original question, and switched the 0 and 1).

For example, for user 7, the 3rd row has col1 == '1', so I'd like to filter all rows after row 3 where col1 == '0' (in this case, only row 5). Then, for user 8, the 11th row is the first row for this user where col1 == '1', so I'd then like to filter rows 12 and 13, since col1 == '0'.

My final output should be something like this:

> mydf
   user col1      col2 col3
1     7    0 0.7607907    a
2     7    0 0.1580448    b
3     7    1 0.8063540    c
4     7    1 0.7331512    d
6     7    3 0.2357065    f
7     7 NULL 0.4864172    g
8     8    3 0.6806089    h
9     8    3 0.2229874    i
10    8    0 0.6187911    j
11    8    1 0.7617177    k

I've tried the following, but it didn't work. I thought adding a rownums column, then grouping by user, and then filtering how I've described would work. My thoughts are that there is something wrong with my filter call:

mydf %>%
  mutate(rownums = 1:nrow(mydf)) %>%
  group_by(user) %>%
  filter(!(col1 == "0" & rownums > min(which(col1 == "1"))))

# A tibble: 9 x 5
# Groups:   col0 [2]
   user  col1      col2  col3 rownums
  <dbl> <chr>     <dbl> <chr>   <int>
1     7     0 0.2088034     a       1
2     7     0 0.2081894     b       2
3     7     1 0.1825428     c       3
4     7     1 0.2143353     d       4
5     7     3 0.1979774     f       6
6     7  NULL 0.2990799     g       7
7     8     3 0.7808038     h       8
8     8     3 0.1694272     i       9
9     8     1 0.1526450     k      11

The difference between this output, and the correct output, is that this output incorrectly also filtered row 10 of the original dataframe.

Any help with this is appreciated!

EDIT - I'm particularly curious if group_by() %>% filter() is bad practice in R with regards to dplyr. 99% of my group_by() are followed by summarize(), which obviously makes more sense.

EDIT2 - I think I've got it!

mydf %>%
  group_by(col0) %>%
  mutate(rownums = 1:length(col0)) %>%
  filter(!(col1 == "0" & rownums > min(which(col1 == "1"))))

Simply flipping the order of the mutate() and group_by() calls, and tweaking the mutate() call a bit, looks to have gotten it done. I'm open to hearing a better approach though.

liborm liborm · Accepted Answer · 2017-10-10T07:18:12

There is a cumany function, which is useful for these sequential conditions, like this:

mydf %>%
  group_by(user) %>% 
  mutate(seen_one = cumany(col1 == "1")) %>%
  filter(!seen_one | col1 != "0")

That is mark all rows after "1" has been in the 'stream' with seen_one, and then keep the rows where one of the conditions is not met. (Semantics of filter require to invert the condition to 'get rid' of the rows, !(A & B) == !A | !B.)

Group_by then filter with dplyr

4 Answers