I'm playing with parquet format. I have parquet file of events, each consist of timestamp, topic, and tags. the file is sorted by topic and then by timestamp. I run query which can be described like:
select topic from T where topic = 404;
it runs much fast, and returns very few rows. It runs much faster than:
select topic from T;
When I change it to be something like:
select tags from T where topic = 404;
it runs as slow as running
select tags from T;
Analyzing the plan, it seems (when using spark), that the predicate push down is applied, but from the performance I can assume it doesn't apply to the column of tags.
I tested with hive, spark and presto. Is there anything to do about it or any other technology that handles parquet nested arrays better?
- execution plan in spark:
== Physical Plan ==
*Project [tags#4]
+- *Filter (isnotnull(topic#3L) && (topic#3L = 404))
+- *FileScan parquet [topic#3L,tags#4] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/example-path], PartitionFilters: [], PushedFilters: [IsNotNull(topic), EqualTo(topic,404)], ReadSchema: struct>
Thanks, Roee