2
votes

I have pyspark dataframe and i want to convert it into list which contain JSON object. For that i have done like below..

df.toJSON().collect()

But this operation send data to driver which is costly and take to much time to perform.And my dataframe contain millions of records.So is there any another way to do it without collect() operation which is optimized than collect().

Below is my dataframe df:-

      product cost
      pen      10
      book     40
      bottle   80
      glass    55

and output is like below :-

df2 = [{product:'pen',cost:40},{product:'book',cost:40},{product:'bottle',cost:80},{product:'glass',cost:55}]

when i print the datatype of df2 it will be list.

1

1 Answers

1
votes

If you want to create json object in dataframe then use collect_list + create_map + to_json functions.

(or)

To write as json document to the file then won't use to_json instead use .write.json()


Create JSON object:

df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).\
selectExpr("to_json(stru) as json").\
show(10,False)

#+-------------------------------------------------------------------------------------------------------------------------------+
#|json                                                                                                                           |
#+-------------------------------------------------------------------------------------------------------------------------------+
#|[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]|
#+-------------------------------------------------------------------------------------------------------------------------------+


#write to hdfs use .saveAsTextFile
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).selectExpr("to_json(stru) as json").rdd.map(lambda x:x['json']).saveAsTextFile("<path>")

#cat part-00000
#[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]

Create JSON file:

df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).write.mode("overwrite").json("<path>")

#cat part-00000-3a19165e-219e-4485-adb8-ef91589d6e31-c000.json
#{"stru":[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]}