Spark Cost Based Optimizer with Glue + S3

Question

I have Spark jobs running on EMR cluster. EMR use AWS Glue as Hive metastore. Jobs write data to S3 through EMRFS in parquet format. I read dataframes with Spark SQL with help of SparkSession#table method.

Is that possible to configure Spark's Cost Based Optimizer (CBO) with AWS Glue?

AFAIK, Spark CBO stores table-level statistic at meta store. It works with Hive, but doesn't work with Spark default metastore (embedded Derby). So my confusion is based on the question wheather CBO can use Glue metastore, if it's already using Glue as meta-store for Spark SQL. I suppose the answer is yes, but still not sure.

you can apply whatever CBOs you can apply it in spark, even if you are using the spark inside a glue job. — sumitya
@syadav why do you thing so? CBO requires metastore to store editional table's metadata afte ANALYZE TABLE command. AFAIK, Glue apply some limitations compared to Hive HCatalog, i.e. see answer below - Hive not support CBO. I'm wondering if Glue support CBO for Spark jobs — VB_

Sandeep Fatangare Sandeep Fatangare · Accepted Answer · 2019-11-05T06:46:10

Unfortunately, it is not supported.

Cost-based Optimization in Hive is not supported. Changing the value of hive.cbo.enable to true is not supported.

Reference: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html

Spark Cost Based Optimizer with Glue + S3

1 Answers