0
votes

I have two different data frames called df and cr each with different columns and row content as shown below. I am trying to find the number of flights on a given route that occur for a given fleet type between the install and removal dates shown in the cr dataframe and create a new column with this sum. The df dataframe contains the fleet, date, route, and count information needed. I envision needing to filter df from the information contained in each row of cr. The filtering would be by fleet, route, and inclusive date range. After df is filter, route count would be summed and put in the new column for the given row in cr then move to the next row. I would typically do this in python using a for loop but my dataframes are quite large and looping is tedious in Pyspark.

My current attempt:

def component_route_normalized(component_route_count, fleet_month_year_count):
    cr=component_route_count
    df=fleet_month_year_count

    temp=df.filter(
         F.col('month-year').between(pd.to_datetime(cr.date_installed),pd.to_datetime(cr.date_removed)) &
         F.col('route') == cr.route &
         F.col('fleet_type') == cr.fmis_fleet_type_code
    )

    cr=cr.withColumn('fleet_route_count', F.sum(temp.route_count))

return cr

example of cr dataframe content:

+-----------+----------------------+------------------------+------------+-------+-------------+---------+-------+--------------------+--------------+------------+
|aircraft_id|nca_part_number_sse001|nca_serial_number_sse001|repair_cycle|  route|part__routing|departure|arrival|fmis_fleet_type_code|date_installed|date_removed|
+-----------+----------------------+------------------------+------------+-------+-------------+---------+-------+--------------------+--------------+------------+
|          1|        25-3246-9-0001|                     341|           8|PVG-EWR|           13|      PVG|    EWR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|EWR-TLV|           34|      EWR|    TLV|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|CDG-EWR|            4|      CDG|    EWR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|DEL-EWR|           16|      DEL|    EWR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|EWR-MXP|            3|      EWR|    MXP|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|EWR-LHR|            7|      EWR|    LHR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|TLV-EWR|           34|      TLV|    EWR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|NRT-IAH|           15|      NRT|    IAH|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|IAH-FRA|            5|      IAH|    FRA|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|EWR-CDG|            4|      EWR|    CDG|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|BRU-EWR|            8|      BRU|    EWR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|NRT-EWR|           17|      NRT|    EWR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|FRA-EWR|           11|      FRA|    EWR|                 777|    2014-12-16|  2015-12-10|
|          1|        25-3246-9-0001|                     341|           8|EWR-PVG|           14|      EWR|    PVG|                 777|    2014-12-16|  2015-12-10|
+-----------+----------------------+------------------------+------------+-------+-------------+---------+-------+--------------------+--------------+------------+ 

example of df dataframe content:

+----------+-------+----------+-----------+------------+-----------+
|month-year|  route|fleet_type|route_count|flight_month|flight_year|
+----------+-------+----------+-----------+------------+-----------+
|  6/1/2014|PHL-ORD|       737|         92|           6|       2014|
|  4/1/2014|IAH-TUL|       787|         23|           4|       2014|
|  4/1/2014|DFW-ORD|       737|         86|           4|       2014|
|  5/1/2014|BRO-IAH|       737|         33|           5|       2014|
|  4/1/2014|YQR-ORD|       787|          9|           4|       2014|
|  3/1/2014|SFO-IAH|       757|         58|           3|       2014|
|  4/1/2014|AUS-IAH|       BUS|         55|           4|       2014|
|  5/1/2014|AGU-IAH|       787|          1|           5|       2014|
+----------+-------+----------+-----------+------------+-----------+
1

1 Answers

0
votes

A join on fleet type, date range and route followed by groupBy and count should do the job. You could try something like this (tweaks to deal with column types will probably be needed):

cond =  (F.col('cr.month-year').between(F.col('df.date_installed',F.col('df.date_removed')) &
         (F.col('df.route') == F.col('cr.route')) &
         (F.col('df.fleet_type') == F.col('cr.fmis_fleet_type_code')))

joined = cr.alias('cr').join(df.alias('df'), on=cond, how='inner')
counts = joined.groupBy(*cr.columns).agg(F.sum('route_count').alias('total_route_count'))