I am having hard time debugging why a simple query against hive external table (dynamodb backed) is taking north of 10 minutes via spark-submit and it only takes 4 seconds in hive shell.
Hive External Table that refers to a Dynamodb table say Employee[id, name, ssn, dept]. id is the partition key and ssn is the range key.
Using aws emr 5.29, spark, hive, tez, hadoop. 1 master, 4 core, m5.l
in hive shell: select name, dept, ssn from employee here id='123/ABC/X12I' returns results in 4 seconds.
Now, lets say i have the following code in code.py (ignoring the imports)
spark = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data=spark.sql("select name, dept, ssn from employee here id='123/ABC/X12I'")
# print data or get length
I submit the above on the master node as:
spark-submit --jars /pathto/emr-ddb-hive.jar, /pathto/emr-ddb-hadoop.jar code.py
The above spark submit takes a long time 14+ minutes. I am not sure which parameter needs to be tweaked or set to get better response time.
In hive shell I did a SET; to view the parameters that hive shell is using and there are a gazillion.
I also tried a boto3 dynamodb way of searching and it is way faster than my simple py sql to spark-submit.
I am missing fundamentals...Any idea or direction is appreciated.