dask read parquet file from spark

Question

For a parquet file written from spark (without any partitioning) its directoy looks like:

%ls foo.parquet
part-00017-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00018-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00019-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
_SUCCESS

When trying to read via pandas:

pd.read_parquet('foo.parquet')

everything works fine as expected.

However, when using dask it fails:

dd.read_parquet('foo.parquet')
 [Errno 17] File exists: 'foo.parquet/_SUCCESS'

What do I need to change so that dask is able to read the data successfully?

I believe it would have worked with fastparquet too with the longer dd.read_parquet('foo.parquet/*parquet'). — mdurant

Georg Heiler Georg Heiler · Accepted Answer · 2020-04-23T05:17:55

It turns out that pandas is using pyarrow. When switching to this backend for dask:

 dd.read_parquet('foo.parquet', engine='pyarrow')

it works just like expected

dask read parquet file from spark

1 Answers