dask parquet file structure created using to_parquet function

Question

This is more of a theorical and intuitive question. when I provided a list of columns to partition_on variable in dask_dataframe.to_parquet(), it created Directory like structure in the order of columns provided more like a nested structure.

However, the actual documentation of parquet says that it is a column store data structure and if we provide a list of columns to it then it creates a partitions based on those columns i:e all the rows(if not rowsize is not provided) of specified columns go in one Partition. Is dask to_parquet doing it right way?

mdurant mdurant · Accepted Answer · 2019-12-03T16:33:09

actual documentation of parquet says

The canonical parquet documentation does not address splitting up a dataset into multiple files. The directory structure and the optional special _metadata file are conventions that were, I believe, devised first by Hive. It is additional to the standard parquet spec, but not in contravention of it.

Each of the data files contains a number of rows and is a valid parquet data set in itself, containing one or more "row groups" (parquet's logical partition) and each column being written in a separate part of the file and encoded as a number of "pages". Parquet allows for dictionary-encoding, but this is a per-page thing, and there is no global categorical labelling scheme, so encoding values into the path names is very useful, and also allows for pre-filtering which of the files we want to access, when only some values are needed.

Short answer: yes, dask is doing the right thing!

dask parquet file structure created using to_parquet function

1 Answers