3
votes

I'm learning data.table and am trying to subset and re-order grouped data while calculating averages, all in one data.table statement.

I got the flight data from here while following along in this tutorial.

From the tutorial,

How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?

ans <- flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]
ans
#      origin dest month         V1         V2
#   1:    JFK  LAX     1   6.590361 14.2289157
#   2:    LGA  PBI     1  -7.758621  0.3103448
#   3:    EWR  LAX     1   1.366667  7.5000000
#   4:    JFK  MIA     1  15.720670 18.7430168
#   5:    JFK  SEA     1  14.357143 30.7500000
#  ---                                        
# 196:    LGA  MIA    10  -6.251799 -1.4208633
# 197:    JFK  MIA    10  -1.880184  6.6774194
# 198:    EWR  PHX    10  -3.032258 -4.2903226
# 199:    JFK  MCO    10 -10.048387 -1.6129032
# 200:    JFK  DCA    10  16.483871 15.5161290

This makes sense to me. However, I want to improve on this by organizing the flights which departed from the same origin together. Following the schema of...

data.table(subset, select/compute, group by)

I came up with this:

flights[(carrier=="AA")[order(origin, dest)], 
       .(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)), 
       by=.(origin, dest, month)]

But my call to order doesn't seem to have done anything. Why?

I can achieve this in dplyr with:

 flights %>%
  filter(carrier=="AA") %>%
  group_by(origin, dest, month) %>%
  summarize(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)) %>%
  as.data.table()

I'm curious how to do this in data.table and why my approach did not work.

1
put the [order()] call at the end of the chain: flights[(carrier=="AA"), .(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)), by=.(origin, dest, month)][order(origin, dest)] - SymbolixAU
That worked! Although I'm wondering what's wrong with (carrier=="AA")[order(origin, dest)] as a re-ordering of those logicals... - Wassadamo
@PanFrancisco: read the comment in this to understand why it won't work, stackoverflow.com/questions/46462054/… - Hardik Gupta

1 Answers

4
votes

With data.table, you can use keyby when you want to sort a result after a by operation. From help("data.table", package = "data.table")

keyby
Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.

You could then use keyby instead of by in your code


library(data.table)
# get the data from the web
flights <- fread("https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv")
# keyby will call setkey and then sort your result
ans <- flights[carrier == "AA",
               .(mean(arr_delay), mean(dep_delay)),
               keyby = .(origin, dest, month)]
ans
#>      origin dest month         V1         V2
#>   1:    EWR  DFW     1   6.427673 10.0125786
#>   2:    EWR  DFW     2  10.536765 11.3455882
#>   3:    EWR  DFW     3  12.865031  8.0797546
#>   4:    EWR  DFW     4  17.792683 12.9207317
#>   5:    EWR  DFW     5  18.487805 18.6829268
#>  ---                                        
#> 196:    LGA  PBI     1  -7.758621  0.3103448
#> 197:    LGA  PBI     2  -7.865385  2.4038462
#> 198:    LGA  PBI     3  -5.754098  3.0327869
#> 199:    LGA  PBI     4 -13.966667 -4.7333333
#> 200:    LGA  PBI     5 -10.357143 -6.8571429