0
votes

I have Spark jobs running on EMR cluster. EMR use AWS Glue as Hive metastore. Jobs write data to S3 through EMRFS in parquet format. I read dataframes with Spark SQL with help of SparkSession#table method.

Is that possible to configure Spark's Cost Based Optimizer (CBO) with AWS Glue?

AFAIK, Spark CBO stores table-level statistic at meta store. It works with Hive, but doesn't work with Spark default metastore (embedded Derby). So my confusion is based on the question wheather CBO can use Glue metastore, if it's already using Glue as meta-store for Spark SQL. I suppose the answer is yes, but still not sure.

1
you can apply whatever CBOs you can apply it in spark, even if you are using the spark inside a glue job.sumitya
@syadav why do you thing so? CBO requires metastore to store editional table's metadata afte ANALYZE TABLE command. AFAIK, Glue apply some limitations compared to Hive HCatalog, i.e. see answer below - Hive not support CBO. I'm wondering if Glue support CBO for Spark jobsVB_

1 Answers

1
votes

Unfortunately, it is not supported.

Cost-based Optimization in Hive is not supported. Changing the value of hive.cbo.enable to true is not supported.

Reference: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html