I have the following pySpark dataframe:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| null|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| null|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| null|
| 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| null|
+------------------+------------------+--------------------+--------------+-------+
It is a rather simple operation and I can easily do it with pandas. However, I need to do it using only pySpark.
I want to do the following (I`ll write in sort of pseudocode):
In row where col3 == max(col3), change Y from null to 'K'
In the remaining rows, in the row where col1 == max(col1), change Y from null to 'Z'
In the remaining rows, in the row where col1 == min(col1), change Y from null to 'U'
In the remaining row: change Y from null to 'I'.
Therefore, the expected output is:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| K|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| U|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| I|
| 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| Z|
+------------------+------------------+--------------------+--------------+-------+
Having that done, I need to use this table as lookup for another table:
+--------------------+--------+-----+------------------+--------------+------------+
| x1| x2| x3| x4| X| d|
+--------------------+--------+-----+------------------+--------------+------------+
|0057f68a-6330-42a...| 2876| 30| 5.989999771118164| 0| 20171219|
|05cc0191-4ee4-412...| 108381| 34|24.979999542236328| 3| 20171219|
|06f353af-e9d3-4d0...| 118798| 34| 0.0| 3| 20171219|
|0c69b607-112b-4f3...| 20993| 34| 0.0| 0| 20171219|
|0d1b52ba-1502-4ff...| 23817| 34| 0.0| 0| 20171219|
I want to use the first table as lookup to create a new column in second table. The values for the new column should be looked up in column Y in first table using X column in second table as key (so we lookup values in column Y in first table corresponding to values in column X, and those values come from column X in second table).
UPD: I need a solution robust to one row satisfying two conditions, for example:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
| 2.608497168338446| 3.558069532647046| 6.607603368496324| 1| null|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| null|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| null|
|2.1729247374294496| 3.529397129549324| 0.373034222141551| 2| null|
+------------------+------------------+--------------------+--------------+-------+
In this case row 0 satisfies both max('col3') and max('col1') conditions.
So what needs to happen is this:
Row 0 becomes 'K'
Row 3 becomes 'Z' (because out of remaining rows (0 already has 'K' row 3 satisfies max('col1') condition
Row 1 becomes 'U'
Row 2 becomes 'I'
I cannot cannot have multiple rows in table 1 with 'I' in them.