4
votes

I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy-compressed parquet file.

However when I am saving my dataset as partitioned files, they result in a much larger sizes combined (61MB).

Here is example dataset that I am trying to save:

listing_id |     date     | gender | price
-------------------------------------------
     a     |  2019-01-01  |   M    |   100
     b     |  2019-01-02  |   M    |   100
     c     |  2019-01-03  |   F    |   200
     d     |  2019-01-04  |   F    |   200

When I partitioned by date (300+ unique values), the partitioned files result in 61MB combined. Each file has 168.2kB of size. When I partition by gender (2 unique values), the partitioned files result in just 3MB combined.

I am wondering if there is any minimum file size for parquet such that many small files combined consume greater disk space?

My env:

- OS: Ubuntu 18.04
- Language: Python
- Library: pyarrow, pandas

My dataset source:

https://www.kaggle.com/brittabettendorf/berlin-airbnb-data

# I am using calendar_summary.csv as my data from a group of datasets in that link above

My code to save as parquet file:

# write to dataset using parquet
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table=table, where='./calendar_summary_write_table.parquet')

# parquet filesize
parquet_method1_filesize = os.path.getsize('./calendar_summary_write_table.parquet') / 1000
print('parquet_method1_filesize: %i kB' % parquet_method1_filesize)

My code to save as partitioned parquet file:

# write to dataset using parquet (partitioned)
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(
    table=table, 
    root_path='./calendar_summary/', 
    partition_cols=['date'])

# parquet filesize
import os
print(os.popen('du -sh ./calendar_summary/').read())
1
Could you post the code that you are using to load CSV and save it as parquet, and a sample of the CSV (in the CSV format, not as a table)? It could be a serialization issue where data is not being serialized correctly (e.g. using pickle instead of a native format.). - Jorge Leitao
there you go. I am not sure if it is serialization issue or probably it's parquet normal behavior? Basically I see that each individual partitioned files have similar file size of 168.2kb - addicted
You can use parquet tools to investigate further. There is some overhead for metadata but I'd be surprised if it is a meaningful amount relevant to the amount of data. My guess is for some reason one or more columns is seeing a lot lower compression ratio due to a different encoding (e.g. not using Run Length encoding). - Micah Kornfield
any quick snippet of code I can paste to investigate and check? Otherwise let me read up parquet tools first and understand what you mean by that. - addicted
Parquet files do have quite a bit of metadata. In data warehousing, often partitioning is performed when partitions contains data on the order of gigabytes, not less than 1 megabyte. - Wes McKinney

1 Answers

3
votes

There is no minimum file size, but there is an overhead for storing the footer and there is a wasted opportunity for optimizations via encodings and compressions. The various encodings and compressions build on the idea that the data has some amount of self-similarity which can be exploited by referencing back to earlier similar occurances. When you split the data into multiple files, each of them will need a separate "initial data point" that the successive ones can refer back to, so disk usage goes up. (Please note that there are huge oversimplifications in this wording to avoid having to specifically go through the various techniques employed to save space, but see this answer for a few examples.)

Another thing that can have a huge impact on the size of Parquet files is the order in which data is inserted. A sorted column can be stored a lot more efficiently than a randomly ordered one. It is possible that by partitioning the data you inadvertently alter its sort order. Another possibility is that you partition the data by the very attribute that it was ordered by and which allowed a huge space saving when storing in a single file and this opportunity gets lost by splitting the data into multiple files. Finally, you have to keep in mind that Parquet is not optimized for storing a few kilobytes of data but for several megabytes or gigabytes (in a single file) or several petabytes (in multiple files).

If you would like to inspect how your data is stored in your Parquet files, the Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The commands most interesting to you are probably meta and dump.