Querying Parquet file nested column scan whole column even when there's predicate pushdown applied to records

Question

I'm playing with parquet format. I have parquet file of events, each consist of timestamp, topic, and tags. the file is sorted by topic and then by timestamp. I run query which can be described like:

select topic from T where topic = 404;

it runs much fast, and returns very few rows. It runs much faster than:

select topic from T;

When I change it to be something like:

select tags from T where topic = 404;

it runs as slow as running

select tags from T;

Analyzing the plan, it seems (when using spark), that the predicate push down is applied, but from the performance I can assume it doesn't apply to the column of tags.

I tested with hive, spark and presto. Is there anything to do about it or any other technology that handles parquet nested arrays better?

execution plan in spark:

== Physical Plan ==

*Project [tags#4]

+- *Filter (isnotnull(topic#3L) && (topic#3L = 404))

+- *FileScan parquet [topic#3L,tags#4] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/example-path], PartitionFilters: [], PushedFilters: [IsNotNull(topic), EqualTo(topic,404)], ReadSchema: struct>

Thanks, Roee

Hi, shared the execution plan of the query of "select tags from T where topic = 404". — roee

Unknown Unknown · Accepted Answer · 2017-01-11T15:11:40

0

votes

It is a Spark issue documented in SPARK-17636

Querying Parquet file nested column scan whole column even when there's predicate pushdown applied to records

1 Answers