I have two different data frames called df and cr each with different columns and row content as shown below. I am trying to find the number of flights on a given route that occur for a given fleet type between the install and removal dates shown in the cr dataframe and create a new column with this sum. The df dataframe contains the fleet, date, route, and count information needed. I envision needing to filter df from the information contained in each row of cr. The filtering would be by fleet, route, and inclusive date range. After df is filter, route count would be summed and put in the new column for the given row in cr then move to the next row. I would typically do this in python using a for loop but my dataframes are quite large and looping is tedious in Pyspark.
My current attempt:
def component_route_normalized(component_route_count, fleet_month_year_count):
cr=component_route_count
df=fleet_month_year_count
temp=df.filter(
F.col('month-year').between(pd.to_datetime(cr.date_installed),pd.to_datetime(cr.date_removed)) &
F.col('route') == cr.route &
F.col('fleet_type') == cr.fmis_fleet_type_code
)
cr=cr.withColumn('fleet_route_count', F.sum(temp.route_count))
return cr
example of cr dataframe content:
+-----------+----------------------+------------------------+------------+-------+-------------+---------+-------+--------------------+--------------+------------+
|aircraft_id|nca_part_number_sse001|nca_serial_number_sse001|repair_cycle| route|part__routing|departure|arrival|fmis_fleet_type_code|date_installed|date_removed|
+-----------+----------------------+------------------------+------------+-------+-------------+---------+-------+--------------------+--------------+------------+
| 1| 25-3246-9-0001| 341| 8|PVG-EWR| 13| PVG| EWR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|EWR-TLV| 34| EWR| TLV| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|CDG-EWR| 4| CDG| EWR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|DEL-EWR| 16| DEL| EWR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|EWR-MXP| 3| EWR| MXP| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|EWR-LHR| 7| EWR| LHR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|TLV-EWR| 34| TLV| EWR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|NRT-IAH| 15| NRT| IAH| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|IAH-FRA| 5| IAH| FRA| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|EWR-CDG| 4| EWR| CDG| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|BRU-EWR| 8| BRU| EWR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|NRT-EWR| 17| NRT| EWR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|FRA-EWR| 11| FRA| EWR| 777| 2014-12-16| 2015-12-10|
| 1| 25-3246-9-0001| 341| 8|EWR-PVG| 14| EWR| PVG| 777| 2014-12-16| 2015-12-10|
+-----------+----------------------+------------------------+------------+-------+-------------+---------+-------+--------------------+--------------+------------+
example of df dataframe content:
+----------+-------+----------+-----------+------------+-----------+
|month-year| route|fleet_type|route_count|flight_month|flight_year|
+----------+-------+----------+-----------+------------+-----------+
| 6/1/2014|PHL-ORD| 737| 92| 6| 2014|
| 4/1/2014|IAH-TUL| 787| 23| 4| 2014|
| 4/1/2014|DFW-ORD| 737| 86| 4| 2014|
| 5/1/2014|BRO-IAH| 737| 33| 5| 2014|
| 4/1/2014|YQR-ORD| 787| 9| 4| 2014|
| 3/1/2014|SFO-IAH| 757| 58| 3| 2014|
| 4/1/2014|AUS-IAH| BUS| 55| 4| 2014|
| 5/1/2014|AGU-IAH| 787| 1| 5| 2014|
+----------+-------+----------+-----------+------------+-----------+