Tez VS Spark - huge performance diffs

Question

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!) To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Enable Vectorization and Map Vectorization true
Enable Reduce Vectorization false
hive.vectorized.groupby.checkinterval 4096
hive.vectorized.groupby.flush.percent 0.1
hive.tez.container.size 682

More Updates:

When checking about vectorization on this link, I noticed I don't see Vectorized execution: true anywhere when I used explain. Another thing that caught my attention is the following: table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}

Namely, when checking table itself: STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' and OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx

Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1) — leftjoin
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out — hummingBird
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc... — Samson Scharfrichter
hive.tez.container.size? if not set, mapreduce.map.memory.mb? — Samson Scharfrichter

hummingBird hummingBird · Accepted Answer · 2020-04-07T10:28:29

In the end, we gave up and installed LLAP. I'm going to accept it as an answer, as I have sort of an OCD and this unanswered question has been poking my eyes for long enough.

Tez VS Spark - huge performance diffs

1 Answers