Query performance in spark-submit vs hive shell

Question

I am having hard time debugging why a simple query against hive external table (dynamodb backed) is taking north of 10 minutes via spark-submit and it only takes 4 seconds in hive shell.

Hive External Table that refers to a Dynamodb table say Employee[id, name, ssn, dept]. id is the partition key and ssn is the range key.

Using aws emr 5.29, spark, hive, tez, hadoop. 1 master, 4 core, m5.l

in hive shell: select name, dept, ssn from employee here id='123/ABC/X12I' returns results in 4 seconds.

Now, lets say i have the following code in code.py (ignoring the imports)

spark = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data=spark.sql("select name, dept, ssn from employee here id='123/ABC/X12I'")
# print data or get length

I submit the above on the master node as:

spark-submit --jars /pathto/emr-ddb-hive.jar, /pathto/emr-ddb-hadoop.jar code.py

The above spark submit takes a long time 14+ minutes. I am not sure which parameter needs to be tweaked or set to get better response time.

In hive shell I did a SET; to view the parameters that hive shell is using and there are a gazillion.

I also tried a boto3 dynamodb way of searching and it is way faster than my simple py sql to spark-submit.

I am missing fundamentals...Any idea or direction is appreciated.

Jason B Jason B · Accepted Answer · 2020-03-04T01:59:05

I was doing an aggregation when I was trying to print by doing a collect() . I read about it but, did not realize that it was that bad (timing wise). I also did end up doing some more experiments like take(n) limit 1.

Query performance in spark-submit vs hive shell

1 Answers