Pyarrow ParquetDataset.read() is slow on a hive partitioned S3 dataset despite using filters

Question

Trying to read a single day from a large parquet dataset is very slow despite using the filters flag. Is there something wrong with how I am using the filters? If so, what can I do to rectify the issue?

I have a large dataset of parquet with millions of files in S3 generated by a spark job that is hive partitioned. The partitioning schema looks like this:

s3://bucket-name/dataset-name/year=2019/month=5/day=23

The code is as follows:

import s3fs
import pyarrow.parquet as pq
s3_path = 's3://bucket-name/dataset-name'
s3 = s3fs.S3FileSystem()
hive_filter = [('year', '=', 2019), ('month', '=', 11), ('day', '=', 13)]
pyarrow_dataset = pq.ParquetDataset(s3_path, filesystem=s3, filters=hive_filter)
pyarrow_table = pyarrow_dataset.read()

If I specify the S3 path to the individual day, this runs very fast. Unfortunately that won't work as I will often want to select multiple days worth of data at a time and don't want to scan the whole dataset. The workaround I have is to have a function that effectively implements the filter myself by using an S3 client to list the directories.

Additional information:

When testing in a Jupyter notebook, the first run is very slow. Subsequent runs are the same speed as when I specify the S3 path down to the day. I suspect some kind of caching is the reason for that. Unfortunately I need the first run to be performant.

Micah Kornfield Micah Kornfield · Accepted Answer · 2019-11-20T09:58:25

TL;DR; I think the workaround you mentioned will be required (passing a list of paths to construct the ParquetDataSet).

Unfortunately, it appears filters are applied after discovering all partitions in the bucket. So, I believe, the time being spent is building the list of partitions which then as you noted is cached. There is a lot of work on-going to push this logic into C++. Taking a quick look at that implementation it isn't clear if there is currently logic to do this more intelligently. It might be worth opening a JIRA issue to better support this use case on issues.apache.org or e-mail the Apache Arrow dev mailing list to discuss this use-case.

Pyarrow ParquetDataset.read() is slow on a hive partitioned S3 dataset despite using filters

1 Answers