I have a dataframe df looks like this
from pyspark.sql.functions import lit, col, create_map
df = spark.createDataFrame(
[
("1","A","B","2020-01-01", 6),
("2","A","B","2020-01-01", 6),
("3","A","C","2020-01-01", 6),
("4","A","C","2020-01-01", 6) ,
("5","B","D","2020-01-01", 10),
("6","B","D","2020-01-01",10),
],
["id","map1","map2","date",'var']
)
+---+----+----+----------+---+
| id|map1|map2| date|var|
+---+----+----+----------+---+
| 1| A| B|2020-01-01| 6|
| 2| A| B|2020-01-01| 6|
| 3| A| C|2020-01-01| 6|
| 4| A| C|2020-01-01| 6|
| 5| B| D|2020-01-01| 10|
| 6| B| D|2020-01-01| 10|
+---+----+----+----------+---+
Now I would like to map using map1 and map2 column such that ... shown in the screenshot below.
note that for all different map1 values , (A,B) the var values are same (6,10) and map1 can not be null but map2 can be null.
I want to do this without using join/rdd/udf as much as possible, just depends on pure pyspark functions for the performance.
first, I create a column dictionary key : value
df = df.withColumn("mapp", create_map('map1', 'var'))
I tried using something like but this obviously does not work dynamically.
df = df.withColumn('var_mapped', df["mapp"].getItem(df['map1']))
what are some solutions/functions to use in this case? any help would be appreciated.