7
votes

Is there any way to run local master Spark SQL queries against AWS Glue?

Launch this code on my local PC:

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue

EDIT based on some examples it appears that the hive.metastore.uris configuration key should allow specifying a specific metastore url, however, it's not clear how to get the relevant value for glue

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .config("hive.metastore.uris", "thrift://???:9083")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue
1
I think that it isn't possible, for two reasons: 1) You can run the glue code by using UI, boto3, dev endpoints, you can also use AWS Glue Data Catalog in AWS EMR, but according to my knowledge that is all options. 2) the Glue service bases on such technologies as Hive or Spark, but it isn't pure version of these technologies, there are limitations and this service uses its own library. - j.b.gorski
@j.b.gorski Looks like our Glue serves only as metadata store, and it doesn't transform data. So instead of mocking data for integration tests I can mock Glue reader wih S3 reader and read data directly from S3 (enforcing the same schema). The only error-prone point here is enforcing schema on CSV dataset read from S3 - VB_
@j.b.gorski What's strange: session.catalog().listDatabases() returns default database with Glue's description. Spark SQL also returns default when I'm doing show databases. But it does not see another Glue's databases - VB_
did you manage to find a solution? - Ophir Yoktan

1 Answers

3
votes

Amazon provide this client that should solve the problem. (didn't try it yet)

https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore