I'm currently working through R for Data Science, and specifically working on exercise 5.7.1 #8 which is analyzing the library(nycflights13)
package data.
The question reads:
- For each plane, count the number of flights before the first delay of greater than 1 hour.
My attempt was to create a table that finds the first "over 60 minute" delay by using the first()
function:
first_del <- flights %>%
select(month, day, flight, dep_time, tailnum, dep_delay) %>%
filter(dep_delay > 60) %>%
group_by(month,day) %>%
arrange(month, day, dep_time) %>%
summarise(flight = first(flight), first_time = first(dep_time))
first_del
# A tibble: 365 x 4
# Groups: month [?]
month day flight first_time
<int> <int> <int> <int>
1 1 1 4576 811
2 1 2 22 126
3 1 3 104 50
4 1 4 608 106
5 1 5 11 37
6 1 6 27 746
7 1 7 145 756
8 1 8 4334 740
9 1 9 51 641
10 1 10 905 743
# ... with 355 more rows
My idea is to tag each row in the flights
tibble 1 if if it matches the month, day, and is less than the flight number of the first delayed flight on that day (for example, from the first_del
tibble above, flight 4576 is the first "over 60 minute delayed" flight on Jan 1, and every other flight before it will count). The desired output would be something like:
flights %>%
filter(dep_time > 805) %>%
select(month, day, flight, dep_time, tag)
# A tibble: 272,933 x 4
month day flight dep_time tag
<int> <int> <int> <int> <int>
1 1 1 269 807 1
2 1 1 4388 809 1
3 1 1 3538 810 1
4 1 1 2395 810 1
5 1 1 4260 811 1
6 1 1 4576 811 1
7 1 1 675 811 0
8 1 1 4537 812 0
9 1 1 914 813 0
10 1 1 346 814 0
Ideally, it would be great to tally all rows less than or equal to the flight number on each day according to the first_del
tibble. I've tried to use many combinations of filter, %in%, mutate, but have not yet been successful. Should I create a custom function?
My ultimate desired output is (with fictitious $count
values):
first_del
# A tibble: 365 x 4
# Groups: month [?]
month day flight first_time count
<int> <int> <int> <int> <int>
1 1 1 4576 811 212
2 1 2 22 126 216
3 1 3 104 50 298
4 1 4 608 106 220
5 1 5 11 37 168
6 1 6 27 746 287
7 1 7 145 756 302
8 1 8 4334 740 246
9 1 9 51 641 235
10 1 10 905 743 313
where $count
is the number of flights that preceded the first delayed flight on that day (as wanted by the question in the links above).