How can I read each Parquet row group into a separate partition?

Question

I have a parquet file with 10 row groups:

In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups)
10

But when I load it using Dask Dataframe, it is read into a single partition:

In [31]: print(dask.dataframe.read_parquet("/tmp/test2.parquet").npartitions)
1

This appears to contradict this answer, which states that Dask Dataframe reads each Parquet row group into a separate partition.

How can I read each Parquet row group into a separate partition with Dask Dataframe? Or must the data be distributed over different files for this to work?

mdurant mdurant · Accepted Answer · 2020-01-30T15:00:12

I believe that fastparquet will read each row-group separately, and the fact that pyarrow apparently doesn't could be considered bug or at least a feature enhancement that you could request on the dask issues tracker. I would tend to agree that a set of files containing one row-group each and a single file containing the same row-groups should result in the same partition structure.

How can I read each Parquet row group into a separate partition?

2 Answers