5
votes

I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. When i give it the path to the directory it returns an error stating it already exists. My assumption based off the documentation was that it would save a json file in the path that you give it.

df.write.json("C:\Users\username")

Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files.

df.write.json("C:\Users\username\test")

And adding a file extension of JSON, produces the same error

df.write.json("C:\Users\username\test.JSON")
3
I think you need to give it a complete file name, not just the directory.Brobin
yes, i verified the permissions on that directory and used getpass.getuser() from python to verify that i was logged in as that user via the console.Jared
try an alternate approach such as df.toJSON().saveAsTextFile(path)urug
I too faced such a problem when using windows.. So I changes to Linux where same code worked perfectly ...Kavindu Dodanduwa
Thanks for giving it a try. I figured it had something to do with Windows, ughhh....Jared

3 Answers

4
votes

Could you not just use

df.toJSON()

as shown here? If not, then first transform into a pandas DataFrame and then write to json.

pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")
3
votes

When working with large data converting pyspark dataframe to pandas is not advisable. you can use below command to save json file in output directory. Here df is pyspark.sql.dataframe.DataFrame. Part file will be generated inside the output directory by the cluster.

df.coalesce(1).write.format('json').save('/your_path/output_directory')
0
votes

I would avoid using write.json since its causing problems on Windows. Using Python's file writing should skip creating the temp directories that are giving you issues.

with open("C:\\Users\\username\\test.json", "w+") as output_file:
    output_file.write(df.toJSON())