Save json objects file as json array instead of json objects on s3

Question

I am trying to save DF as json format on s3. it is saved as json objects file however i want json array file.

I have csv file on s3, which i am loading into dataframe in aws glue. after performing some transformation i am writing DF to S3 format as json. But it is creating json objects file like:

{obj1} {obj2} however i want to save it as json array file like: [{obj1},{obj2}]

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3",connection_options = {"paths": [s3_path],"useS3ListImplementation":True,"recurse":True}, format="csv", format_options={"withHeader":True,"separator":"|"})

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("cdw_zip_id", "string", "cdw_zip_id", "string"), ("zip_code", "string", "zip_code", "string"), ("cdw_terr_id", "string", "cdw_terr_id", "string")], transformation_ctx = "applymapping1")

applymapping2 = applymapping1.toDF() applymapping2.coalesce(1).write.format("org.apache.spark.sql.json").mode("overwrite").save(args['DEST_PATH'])

Actual is: {obj1} {obj2} expected is: [{obj1},{obj2}]

Karan Hebbar Karan Hebbar · Accepted Answer · 2019-04-23T15:58:16

When df.write action is called, Spark does a lazy evaluation, i.e all of the transformation are applied to single all records it read from all the partition in a single read operation simultaneously across all the nodes (partitions present in them ) that are configured to execute the work load.

Since all the task does write the output independently, we can expect only individual record written to the destination, not as a entire json file.

If you perform the coalesce operation, you'll be only able to merge the partition data not the behavior of spark write operation.

Save json objects file as json array instead of json objects on s3

1 Answers