How to specify number of partitions when writing a Parquet file?

Question

parquet_writer.write_table(table)

This line writes a single file. The documentation says: This creates a single Parquet file. In practice, a Parquet dataset may consist of many files in many directories. We can read a single file back with read_table:

Is there a way for PyArrow to create a parquet file in the form of a directory with multiple part files in it such as :

ls -lrt permit-inspections-recent.parquet  
...  14:53 part-00001-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet  
...  14:53 part-00000-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet

Regards,
Yash

chris chris · Accepted Answer · 2020-07-22T09:48:05

You need to tell Arrow how to partition the data. This done with partition_cols argument. See here: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html

How to specify number of partitions when writing a Parquet file?

1 Answers