Trying to read a single day from a large parquet dataset is very slow despite using the filters flag. Is there something wrong with how I am using the filters? If so, what can I do to rectify the issue?
I have a large dataset of parquet with millions of files in S3 generated by a spark job that is hive partitioned. The partitioning schema looks like this:
s3://bucket-name/dataset-name/year=2019/month=5/day=23
The code is as follows:
import s3fs
import pyarrow.parquet as pq
s3_path = 's3://bucket-name/dataset-name'
s3 = s3fs.S3FileSystem()
hive_filter = [('year', '=', 2019), ('month', '=', 11), ('day', '=', 13)]
pyarrow_dataset = pq.ParquetDataset(s3_path, filesystem=s3, filters=hive_filter)
pyarrow_table = pyarrow_dataset.read()
If I specify the S3 path to the individual day, this runs very fast. Unfortunately that won't work as I will often want to select multiple days worth of data at a time and don't want to scan the whole dataset. The workaround I have is to have a function that effectively implements the filter myself by using an S3 client to list the directories.
Additional information:
- When testing in a Jupyter notebook, the first run is very slow. Subsequent runs are the same speed as when I specify the S3 path down to the day. I suspect some kind of caching is the reason for that. Unfortunately I need the first run to be performant.