mutate or filter according to conditions from tibble

Question

I'm currently working through R for Data Science, and specifically working on exercise 5.7.1 #8 which is analyzing the library(nycflights13) package data.

The question reads:

For each plane, count the number of flights before the first delay of greater than 1 hour.

My attempt was to create a table that finds the first "over 60 minute" delay by using the first() function:

first_del <- flights %>%
  select(month, day, flight, dep_time, tailnum, dep_delay) %>%
  filter(dep_delay > 60) %>%
  group_by(month,day) %>%
  arrange(month, day, dep_time) %>%
  summarise(flight = first(flight), first_time = first(dep_time))

first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time
    <int> <int>  <int>      <int>
 1     1     1   4576        811
 2     1     2     22        126
 3     1     3    104         50
 4     1     4    608        106
 5     1     5     11         37
 6     1     6     27        746
 7     1     7    145        756
 8     1     8   4334        740
 9     1     9     51        641
10     1    10    905        743
# ... with 355 more rows

My idea is to tag each row in the flights tibble 1 if if it matches the month, day, and is less than the flight number of the first delayed flight on that day (for example, from the first_del tibble above, flight 4576 is the first "over 60 minute delayed" flight on Jan 1, and every other flight before it will count). The desired output would be something like:

  flights %>%
  filter(dep_time > 805) %>%
  select(month, day, flight, dep_time, tag)

# A tibble: 272,933 x 4
   month   day flight dep_time   tag
   <int> <int>  <int>    <int>  <int>
 1     1     1    269      807    1
 2     1     1   4388      809    1
 3     1     1   3538      810    1
 4     1     1   2395      810    1
 5     1     1   4260      811    1
 6     1     1   4576      811    1
 7     1     1    675      811    0
 8     1     1   4537      812    0
 9     1     1    914      813    0
10     1     1    346      814    0

Ideally, it would be great to tally all rows less than or equal to the flight number on each day according to the first_del tibble. I've tried to use many combinations of filter, %in%, mutate, but have not yet been successful. Should I create a custom function?

My ultimate desired output is (with fictitious $count values):

 first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time  count
    <int> <int>  <int>      <int>  <int>
 1     1     1   4576        811    212
 2     1     2     22        126    216
 3     1     3    104         50    298
 4     1     4    608        106    220
 5     1     5     11         37    168
 6     1     6     27        746    287
 7     1     7    145        756    302
 8     1     8   4334        740    246
 9     1     9     51        641    235
10     1    10    905        743    313

where $count is the number of flights that preceded the first delayed flight on that day (as wanted by the question in the links above).

Hugh Hugh · Accepted Answer · 2018-07-21T05:22:24

You can use which.max on a logical vector to determine the first instance satisfying a condition. You also need to check the condition actually occurs.

library(dplyr)
library(nycflights13)

flights %>%
  mutate(dep_delay = coalesce(dep_delay, 0)) %>%
  arrange(month, day, dep_time) %>%
  group_by(tailnum) %>%
  summarise(max_delay = max(dep_delay), 
            which_first_geq_1hr = which.max(dep_delay > 60)) %>%
  ungroup %>%
  filter(max_delay > 60)

I'm assuming that delay means departure delay, NA delay means 0 or at least less than an hour, and I'm ignoring planes that 'failed' to be delayed by more than hour. The coalesce is necessary to avoid which.max(NA).

mutate or filter according to conditions from tibble

2 Answers