1
votes

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!) To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

  • Only one queue used, with capacity scheduler

  • User under which the job is running is my own user. We have Kerberos used with LDAP

  • AM Resource: 4096 MB

  • using tez.runtime.compress with Snappy

  • data is in Parquet format, no compression applied

  • tez.task.resource.memory 6134 MB

  • tez.counters.max 10000

  • tez.counters.max.groups 3000

  • tez.runtime.io.sort.mb 8110 MB

  • tez.runtime.pipelined.sorter.sort.threads 2

  • tez.runtime.shuffle.fetch.buffer.percent 0.6

  • tez.runtime.shuffle.memory.limit.percent 0.25

  • tez.runtime.unordered.output.buffer.size-mb 460 MB

  • Enable Vectorization and Map Vectorization true

  • Enable Reduce Vectorization false

  • hive.vectorized.groupby.checkinterval 4096

  • hive.vectorized.groupby.flush.percent 0.1

  • hive.tez.container.size 682

More Updates:

When checking about vectorization on this link, I noticed I don't see Vectorized execution: true anywhere when I used explain. Another thing that caught my attention is the following: table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}

Namely, when checking table itself: STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' and OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx

1
Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1) - leftjoin
thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out - hummingBird
What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as hive on a different queue than your personal Spark session uses? Tez container size? Etc... - Samson Scharfrichter
hive.vectorized.execution.enabled? - Samson Scharfrichter
hive.tez.container.size? if not set, mapreduce.map.memory.mb? - Samson Scharfrichter

1 Answers

0
votes

In the end, we gave up and installed LLAP. I'm going to accept it as an answer, as I have sort of an OCD and this unanswered question has been poking my eyes for long enough.