1
votes

I'm currently working through R for Data Science, and specifically working on exercise 5.7.1 #8 which is analyzing the library(nycflights13) package data.

The question reads:

  1. For each plane, count the number of flights before the first delay of greater than 1 hour.

My attempt was to create a table that finds the first "over 60 minute" delay by using the first() function:

first_del <- flights %>%
  select(month, day, flight, dep_time, tailnum, dep_delay) %>%
  filter(dep_delay > 60) %>%
  group_by(month,day) %>%
  arrange(month, day, dep_time) %>%
  summarise(flight = first(flight), first_time = first(dep_time))

first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time
    <int> <int>  <int>      <int>
 1     1     1   4576        811
 2     1     2     22        126
 3     1     3    104         50
 4     1     4    608        106
 5     1     5     11         37
 6     1     6     27        746
 7     1     7    145        756
 8     1     8   4334        740
 9     1     9     51        641
10     1    10    905        743
# ... with 355 more rows

My idea is to tag each row in the flights tibble 1 if if it matches the month, day, and is less than the flight number of the first delayed flight on that day (for example, from the first_del tibble above, flight 4576 is the first "over 60 minute delayed" flight on Jan 1, and every other flight before it will count). The desired output would be something like:

  flights %>%
  filter(dep_time > 805) %>%
  select(month, day, flight, dep_time, tag)

# A tibble: 272,933 x 4
   month   day flight dep_time   tag
   <int> <int>  <int>    <int>  <int>
 1     1     1    269      807    1
 2     1     1   4388      809    1
 3     1     1   3538      810    1
 4     1     1   2395      810    1
 5     1     1   4260      811    1
 6     1     1   4576      811    1
 7     1     1    675      811    0
 8     1     1   4537      812    0
 9     1     1    914      813    0
10     1     1    346      814    0

Ideally, it would be great to tally all rows less than or equal to the flight number on each day according to the first_del tibble. I've tried to use many combinations of filter, %in%, mutate, but have not yet been successful. Should I create a custom function?

My ultimate desired output is (with fictitious $count values):

 first_del

# A tibble: 365 x 4
# Groups:   month [?]
    month   day flight first_time  count
    <int> <int>  <int>      <int>  <int>
 1     1     1   4576        811    212
 2     1     2     22        126    216
 3     1     3    104         50    298
 4     1     4    608        106    220
 5     1     5     11         37    168
 6     1     6     27        746    287
 7     1     7    145        756    302
 8     1     8   4334        740    246
 9     1     9     51        641    235
10     1    10    905        743    313

where $count is the number of flights that preceded the first delayed flight on that day (as wanted by the question in the links above).

2

2 Answers

0
votes

You can use which.max on a logical vector to determine the first instance satisfying a condition. You also need to check the condition actually occurs.

library(dplyr)
library(nycflights13)

flights %>%
  mutate(dep_delay = coalesce(dep_delay, 0)) %>%
  arrange(month, day, dep_time) %>%
  group_by(tailnum) %>%
  summarise(max_delay = max(dep_delay), 
            which_first_geq_1hr = which.max(dep_delay > 60)) %>%
  ungroup %>%
  filter(max_delay > 60)

I'm assuming that delay means departure delay, NA delay means 0 or at least less than an hour, and I'm ignoring planes that 'failed' to be delayed by more than hour. The coalesce is necessary to avoid which.max(NA).

0
votes

The question is per plane, so you really want to operate grouped by tailnum. You can add a flag column, but really you need to end up with something you can pass to filter (a logical vector) or slice (a vector of row indices). There are various ways to do this, e.g. slice(seq(c(which(dep_delay > 60) - 1, n())[1])), but a nice approach is to use dplyr's cumall (a cumulative version of all, like cumsum is to sum) to generate a logical vector for filter:

library(dplyr)

nycflights13::flights %>% 
    group_by(tailnum) %>% 
    arrange(year, month, day, dep_time) %>%    # ensure order before cumany
    filter(cumall(pmax(dep_delay, arr_delay) < 60)) %>% 
    tally()    # count number of observations per group (tailnum)
#> # A tibble: 3,709 x 2
#> # Groups:   tailnum [3,709]
#>    tailnum     n
#>    <chr>   <int>
#>  1 N10156      9
#>  2 N102UW     25
#>  3 N103US     46
#>  4 N104UW      3
#>  5 N105UW     22
#>  6 N107US     20
#>  7 N108UW     36
#>  8 N109UW     28
#>  9 N110UW     15
#> 10 N11107      7
#> # ... with 3,699 more rows

It's possible to make a intermediate table of first big delays and do a self-join to ID where they are, or add a flag value to some observations with if_else, but regardless, subsetting up to those rows will still require similar logic to the cumall above, so they're really just longer, slower approaches.