For a parquet file written from spark (without any partitioning) its directoy looks like:
%ls foo.parquet
part-00017-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00018-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00019-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
_SUCCESS
When trying to read via pandas:
pd.read_parquet('foo.parquet')
everything works fine as expected.
However, when using dask it fails:
dd.read_parquet('foo.parquet')
[Errno 17] File exists: 'foo.parquet/_SUCCESS'
What do I need to change so that dask is able to read the data successfully?
dd.read_parquet('foo.parquet/*parquet')
. – mdurant