1
votes

When Spark writes dateframe data to parquet file, Spark will create a directory which include several separate parquet files. Code for saving:

term_freq_df.write
            .mode("overwrite")
            .option("header", "true")
            .parquet("dir/to/save/to")

I need to read data from this directory with pandas:

term_freq_df = pd.read_parquet("dir/to/save/to") 

The error:

IsADirectoryError: [Errno 21] Is a directory: 

How to resolve this problem with the simple method that the two code samples could use same path of files?

2
What version of pandas are you using? Can you show the full error traceback? - joris

2 Answers

1
votes

As you noted, when saving Spark will create multiple parquet files in a directory. To read these files with pandas what you can do is reading the files separately and then concatenate the results.

import glob
import os
import pandas as pd

path = "dir/to/save/to"
parquet_files = glob.glob(os.path.join(path, "*.parquet"))
df = pd.concat((pd.read_parquet(f) for f in parquet_files))
1
votes

Normally, pandas.read_parquet can handle reading a directory of multiple (partitioned) parquet files fine. So I am curious to see the full error traceback you get.

To demo that this works fine:

In [82]: pd.__version__ 
Out[82]: '0.25.0'

In [83]: df = pd.DataFrame({'A': ['a', 'b']*2, 'B':[1, 2, 3, 4]})

In [85]: df.to_parquet("test_directory", partition_cols=['A'])

This created a "test_directory" folder with multiple parquet files. I can read those back in using pandas:

In [87]: pd.read_parquet("test_directory/")
Out[87]: 
   B  A
0  1  a
1  3  a
2  2  b
3  4  b