I execute Spark SQL reading from Hive Tables and it is lengthy in execution(15 min). I am interested in optimizing the query execution so I am asking about if the execution for those queries uses the execution engine of Hive and by this way it is similar to executing the queries in Hive editor, or Spark use the Hive Metastore only to know the locations of the files and deals with the files after that directly?
import os
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("yarn") \
.appName("src_count") \
.config('spark.executor.cores','5') \
.config('spark.executor.memory','29g') \
.config('spark.driver.memory','16g') \
.config('spark.driver.maxResultSize','12g')\
.config("spark.dynamicAllocation.enabled", "true")\
.config("spark.shuffle.service.enabled", "true")\
.getOrCreate()
sql = "SELECT S.SERVICE, \
COUNT(DISTINCT CONTRACT_KEY) DISTINCT_CNT, \
COUNT(*) CNT ... "
df.toPandas()