How to read files written by Spark with pandas?

Question

When Spark writes dateframe data to parquet file, Spark will create a directory which include several separate parquet files. Code for saving:

term_freq_df.write
            .mode("overwrite")
            .option("header", "true")
            .parquet("dir/to/save/to")

I need to read data from this directory with pandas:

term_freq_df = pd.read_parquet("dir/to/save/to")

The error:

IsADirectoryError: [Errno 21] Is a directory:

How to resolve this problem with the simple method that the two code samples could use same path of files?

What version of pandas are you using? Can you show the full error traceback? — joris

Shaido Shaido · Accepted Answer · 2019-08-07T06:04:25

As you noted, when saving Spark will create multiple parquet files in a directory. To read these files with pandas what you can do is reading the files separately and then concatenate the results.

import glob
import os
import pandas as pd

path = "dir/to/save/to"
parquet_files = glob.glob(os.path.join(path, "*.parquet"))
df = pd.concat((pd.read_parquet(f) for f in parquet_files))

How to read files written by Spark with pandas?

2 Answers