0
votes

I have two DataFrames as follows,

DataFrame A:

DEPT_ID DEPT_NAME
10      Finance
20      Marketing

DataFrame B:

EMP_ID DEPT_ID EMP_NAME EMP_SALARY
101    10      AAAA     1000
102    20      BBBB     2000
103    10      CCCC     1500
104    20      DDDD     3000

Expected Result: in Pyspark, I need to join Data Frame A & B on DEPT_ID and convert the non key columns in Dataframe B to_json string and store it in Json_Data column in Dataframe C

DataFrame C:

DEPT_ID DEPT_NAME Json_Data
10      Finance   [{"_status": "normal","EmpDetails":{"EMP_ID":"101","EMP_NAME":"AAAA","EMP_SALARY":"1000"},{"EMP_ID":"103","EMP_NAME":"CCCC","EMP_SALARY":"1500" }]
20      Marketing [{"_status": "normal","EmpDetails":{"EMP_ID":"102","EMP_NAME":"BBBB","EMP_SALARY":"2000"},{"EMP_ID":"104","EMP_NAME":"DDDD","EMP_SALARY":"3000" }]
1

1 Answers

0
votes

You can join and group by DEPT_ID and DEPT_NAME to collect list of employee details into struct. And use to_json to get a json string :

from pyspark.sql import functions as F

df_c = df_a.join(df_b, ["DEPT_ID"]).groupBy("DEPT_ID", "DEPT_NAME").agg(
    F.to_json(
        F.struct(
            F.lit("normal").alias("_status"),
            F.collect_list(
                F.struct(
                    F.col("EMP_ID"),
                    F.col("EMP_NAME"),
                    F.col("EMP_SALARY")
                )
            ).alias("EmpDetails")
        )
    ).alias("Json_Data")
)

df_c.show(truncate=False)
#+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+
#|DEPT_ID|DEPT_NAME|Json_Data                                                                                                                                |
#+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+
#|10     |Finance  |{"_status":"normal","EmpDetails":[{"EMP_ID":101,"EMP_NAME":"AAAA","EMP_SALARY":1000},{"EMP_ID":103,"EMP_NAME":"CCCC","EMP_SALARY":1500}]}|
#|20     |Marketing|{"_status":"normal","EmpDetails":[{"EMP_ID":102,"EMP_NAME":"BBBB","EMP_SALARY":2000},{"EMP_ID":104,"EMP_NAME":"DDDD","EMP_SALARY":3000}]}|
#+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+