4
votes

I have a parquet file with 10 row groups:

In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups)
10

But when I load it using Dask Dataframe, it is read into a single partition:

In [31]: print(dask.dataframe.read_parquet("/tmp/test2.parquet").npartitions)
1

This appears to contradict this answer, which states that Dask Dataframe reads each Parquet row group into a separate partition.

How can I read each Parquet row group into a separate partition with Dask Dataframe? Or must the data be distributed over different files for this to work?

2

2 Answers

2
votes

I believe that fastparquet will read each row-group separately, and the fact that pyarrow apparently doesn't could be considered bug or at least a feature enhancement that you could request on the dask issues tracker. I would tend to agree that a set of files containing one row-group each and a single file containing the same row-groups should result in the same partition structure.

1
votes

I can read using the batches with pyarrow.

import pyarrow as pq
batch_size = 1
_file = pq.parquet.ParquetFile("file.parquet")
batches = _file.iter_batches(batch_size) #batches will be a generator

for batch in batches:
  process(batch)