5
votes

To read a parquet file into multiple partitions, it should be stored using row groups (see How to read a single large parquet file into multiple partitions using dask/dask-cudf?). The pandas documentation describes partitioning of columns, the pyarrow documentation describes how to write multiple row groups. Using the pandas DataFrame .to_parquet method, can I interface the ability to write multiple row groups, or will it always write to a single partition? If yes, how?

Although the dataset is small (currently only 3 GB), I want to read into multiple partitions such that subsequent processing using dask will use multiple cores (I can repartition, but this creates additional overhead) (and I might work with datasets of some 10s of GB later, still small but too large for RAM).

1
I'm also looking for this exact feature. The closest I could find was to use the dask to_parquet() writer and use the option partition_on to create partitions according to some field values. However that create these partitions in separate smaller parquet files associated with metadata used by the dask parquet reader to get back a partitionned dataframe, which can make things a bit messier for sharing files or storing them.Wall-E

1 Answers

4
votes

You can simply provide the keyword argument row_group_size when using pyarrow. Note that pyarrow is the default engine.

df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow")