1
votes

For a parquet file written from spark (without any partitioning) its directoy looks like:

%ls foo.parquet
part-00017-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00018-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00019-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
_SUCCESS

When trying to read via pandas:

pd.read_parquet('foo.parquet')

everything works fine as expected.

However, when using dask it fails:

dd.read_parquet('foo.parquet')
 [Errno 17] File exists: 'foo.parquet/_SUCCESS'

What do I need to change so that dask is able to read the data successfully?

1
I believe it would have worked with fastparquet too with the longer dd.read_parquet('foo.parquet/*parquet').mdurant

1 Answers

0
votes

It turns out that pandas is using pyarrow. When switching to this backend for dask:

 dd.read_parquet('foo.parquet', engine='pyarrow')

it works just like expected