Pyspark: Joining 2 dataframes by ID & Closest date backwards

Question

I'm having the world of issues performing a rolling join of two dataframes in pyspark (and python in general). I am looking to join two pyspark dataframes together by their ID & closest date backwards (meaning the date in the second dataframe cannot be greater than the one in the first)

Table_1:

Table_2:

Desired Result:

In essence, I understand an SQL Query can do the trick where I can do spark.sql("query") So anything else. I've tried several things which aren't working in a spark context. Thanks!

Lamanus Lamanus · Accepted Answer · 2020-08-08T11:04:50

Here is my trial.

First, I determine the Date_2 which met your condition. After that, join the second dataframe again and get the Value_2

from pyspark.sql.functions import monotonically_increasing_id, unix_timestamp, max

df3 = df1.withColumn('newId', monotonically_increasing_id()) \
  .join(df2, 'ID', 'left') \
  .where(unix_timestamp('Date', 'M/dd/yy') >= unix_timestamp('Date_2', 'M/dd/yy')) \
  .groupBy(*df1.columns, 'newId') \
  .agg(max('Date_2').alias('Date_2'))
df3.orderBy('newId').show(20, False)    

+---+-------+-----+-----+-------+
|ID |Date   |Value|newId|Date_2 |
+---+-------+-----+-----+-------+
|A1 |1/15/20|5    |0    |1/12/20|
|A2 |1/20/20|10   |1    |1/11/20|
|A3 |2/21/20|12   |2    |1/31/20|
|A1 |1/21/20|6    |3    |1/16/20|
+---+-------+-----+-----+-------+

df3.join(df2, ['ID', 'Date_2'], 'left') \
  .orderBy('newId') \
  .drop('Date_2', 'newId') \
  .show(20, False)

+---+-------+-----+-------+
|ID |Date   |Value|Value_2|
+---+-------+-----+-------+
|A1 |1/15/20|5    |5      |
|A2 |1/20/20|10   |12     |
|A3 |2/21/20|12   |14     |
|A1 |1/21/20|6    |3      |
+---+-------+-----+-------+

Pyspark: Joining 2 dataframes by ID & Closest date backwards

2 Answers