0
votes

I am trying to save DF as json format on s3. it is saved as json objects file however i want json array file.

I have csv file on s3, which i am loading into dataframe in aws glue. after performing some transformation i am writing DF to S3 format as json. But it is creating json objects file like: 

{obj1} {obj2} however i want to save it as json array file like: [{obj1},{obj2}]

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3",connection_options = {"paths": [s3_path],"useS3ListImplementation":True,"recurse":True}, format="csv", format_options={"withHeader":True,"separator":"|"})

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("cdw_zip_id", "string", "cdw_zip_id", "string"), ("zip_code", "string", "zip_code", "string"), ("cdw_terr_id", "string", "cdw_terr_id", "string")], transformation_ctx = "applymapping1")

applymapping2 = applymapping1.toDF() applymapping2.coalesce(1).write.format("org.apache.spark.sql.json").mode("overwrite").save(args['DEST_PATH'])

Actual is: {obj1} {obj2} expected is: [{obj1},{obj2}]

1

1 Answers

0
votes

When df.write action is called, Spark does a lazy evaluation, i.e all of the transformation are applied to single all records it read from all the partition in a single read operation simultaneously across all the nodes (partitions present in them ) that are configured to execute the work load.

Since all the task does write the output independently, we can expect only individual record written to the destination, not as a entire json file.

If you perform the coalesce operation, you'll be only able to merge the partition data not the behavior of spark write operation.