Rolling join using dates in PySpark?

Question

I'm trying to do a join between two PySpark dataframes, joining on a key, however the date of the first table should always come after the date of the second table. As an example. We have two tables that we're trying to join:

Table 1:

    Date1    value1   key
13 Feb 2020    1       a
01 Mar 2020    2       a
31 Mar 2020    3       a
15 Apr 2020    4       a

Table 2:

    Date2    value2  key
10 Feb 2020    11     a
15 Mar 2020    22     a

After the join, the result should be something like this:

    Date1    value1 value2  key
13 Feb 2020    1      11     a
01 Mar 2020    2     null    a
31 Mar 2020    3      22     a
15 Apr 2020    4     null    a

Any ideas?

mck mck · Accepted Answer · 2021-01-14T15:27:22

This is an interesting join. My approach is to join on the key first, select the earliest date, and do a self join after the earliest date is found.

from pyspark.sql import functions as F, Window

# Clean up date format first
df3 = df1.withColumn('Date1', F.to_date('Date1', 'dd MMM yyyy'))
df4 = df2.withColumn('Date2', F.to_date('Date2', 'dd MMM yyyy'))

result = (df3.join(df4, 'key')
             .filter('Date1 > Date2')
             .withColumn('rn', F.row_number().over(Window.partitionBy('Date2').orderBy('Date1')))
             .filter('rn = 1')
             .drop('key', 'rn', 'Date2')
             .join(df3, ['Date1', 'value1'], 'right')
         )

result.show()
+----------+------+------+---+
|Date1     |value1|value2|key|
+----------+------+------+---+
|2020-02-13|1     |11    |a  |
|2020-03-01|2     |null  |a  |
|2020-03-31|3     |22    |a  |
|2020-04-15|4     |null  |a  |
+----------+------+------+---+

Rolling join using dates in PySpark?

2 Answers