I am trying to group data from df and generate json object out of the grouped data. I thought I was on the right track but looking at the result I do not think I am generating a proper json object.
Source data
df1 = sqlContext.createDataFrame([
("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_04_20210815.DAT', '0.057','PRODUCT'),
("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_00_20210815.DAT', '0.068','PRODUCT'),
("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_01_20210815.DAT', '0.089','PRODUCT'),
("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_03_20210815.DAT', '0.100','PRODUCT')
], ["ID", "FileReceivedTimestamp", "SourceDataTimestamp","SourceFileName", "FileSize","FileName"])
df1.show()
+--------------+---------------------+-------------------+--------------------+--------+--------+
| ID|FileReceivedTimestamp|SourceDataTimestamp| SourceFileName|FileSize|FileName|
+--------------+---------------------+-------------------+--------------------+--------+--------+
|20210924155828| 2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_04_202108...| 0.057| PRODUCT|
|20210924155828| 2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_00_202108...| 0.068| PRODUCT|
|20210924155828| 2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_01_202108...| 0.089| PRODUCT|
|20210924155828| 2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_03_202108...| 0.100| PRODUCT|
+--------------+---------------------+-------------------+--------------------+--------+--------+
utilized to_json, groupBy, agg and collect_list to generate json object.
df2 = (df1.select("FileName","ID",to_json(struct("SourceFileName","FileReceivedTimestamp","FileSize")).alias("metadata"))
.groupby("FileName","ID").agg(collect_list(col("metadata")).alias("jsonmetadata")))
I see below using display in databricks
to query and test my output i did a explode on the generated json. I get this error when trying to view SourceFileName "Can't extract value from MetadataArray#779: need struct type but got string;"
display(df2.select(explode(col("jsonmetadata")).alias("MetadataArray")
,col("MetadataArray.SourceFileName").alias("SourceFileName")
))
Am I missing something here....