I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy-compressed parquet file.
However when I am saving my dataset as partitioned files, they result in a much larger sizes combined (61MB).
Here is example dataset that I am trying to save:
listing_id | date | gender | price
-------------------------------------------
a | 2019-01-01 | M | 100
b | 2019-01-02 | M | 100
c | 2019-01-03 | F | 200
d | 2019-01-04 | F | 200
When I partitioned by date (300+ unique values), the partitioned files result in 61MB combined. Each file has 168.2kB of size.
When I partition by gender (2 unique values), the partitioned files result in just 3MB combined.
I am wondering if there is any minimum file size for parquet such that many small files combined consume greater disk space?
My env:
- OS: Ubuntu 18.04
- Language: Python
- Library: pyarrow, pandas
My dataset source:
https://www.kaggle.com/brittabettendorf/berlin-airbnb-data
# I am using calendar_summary.csv as my data from a group of datasets in that link above
My code to save as parquet file:
# write to dataset using parquet
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table=table, where='./calendar_summary_write_table.parquet')
# parquet filesize
parquet_method1_filesize = os.path.getsize('./calendar_summary_write_table.parquet') / 1000
print('parquet_method1_filesize: %i kB' % parquet_method1_filesize)
My code to save as partitioned parquet file:
# write to dataset using parquet (partitioned)
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(
table=table,
root_path='./calendar_summary/',
partition_cols=['date'])
# parquet filesize
import os
print(os.popen('du -sh ./calendar_summary/').read())
168.2kb- addicted