Pyspark dataframe write to single json file with specific name

20

votes

I have a dataframe which I want to write it as single json file with a specific name. I tried below

df2 = df1.select(df1.col1,df1.col2)
df2.write.format('json').save('/path/file_name.json') # didnt work, writing in folder 'file_name.json' and files with part-XXX
df2.toJSON().saveAsTextFile('/path/file_name.json')  # didnt work, writing in folder 'file_name.json' and files with part-XXX

Appreciate if some one can provide a solution.

apache-sparkpyspark

24

votes

You need to save this on single file using below code:-

df2 = df1.select(df1.col1,df1.col2)
df2.coalesce(1).write.format('json').save('/path/file_name.json')

This will make a folder with file_name.json. Check this folder you can get a single file with whole data part-000

4

votes

You can do it by converting to a pandas df previously:

df.toPandas().to_json('path/file_name.json', orient='records', force_ascii=False, lines=True)

0

votes

Pyspark stores the files in smaller chunks and as far as I know, we can not store the JSON directly with a single given file name. I think this small python function will be helpful to what you're trying to achieve.

def saveResult (data_frame, temp_location, file_path):
    data_frame.write.mode('append').json(temp_location)
    file = dbutils.fs.ls(temp_location)[-1].path # last file is the json or can also use regex to determine this
    dbutils.fs.cp(file, file_path)
    dbutils.fs.rm(temp_location, recurse=True)

Basically, what's happening here is you are passing the data frame, the temp_location where all the file chunks are stored and the full file path (file path + filename) which you'd like to get as an output file. The function generates the chunks, deletes all the chunks, and saves the final file into the desired location with the desired file name.

-2

votes

Here's another approach:

import os
df2 = df1.select(df1.col1,df1.col2)
df2.write.format('json').save('/path/folder_name')

os.system("cat /path/folder_name/*.json > /path/df.json")
os.system("rm -rf /path/folder_name")

Assuming this is done in the analysis phase and the exporting as a single json doesn't get carried into prod.

Pyspark dataframe write to single json file with specific name

4 Answers