
I created a AWS EMR Hadoop cluster with 'AWS Glue Data Catalog' used for 'for Spark table metadata'. Consequently, in Spark jobs or in spark-shell, I can write Spark SQL that uses Glue/Athena databases and tables.

What happens if one changes the Athena table location while a Spark job running in EMR is reading the content of this table ?

Let's imagine that I have a Athena table named "item" in the Glue database named "my_db". The Athena table location points to a S3 folder where Parquet files containing the data are stored. This folder is s3://my_bucket/item_2020_03_02.

A spark job running in EMR is launched and process a Spark SQL string that reads the table content :

Dataset<Row> df = spark.sql("select * from my_db.item");

Few milliseconds after, someone runs this SQL query in AWS Athena Web Console :

ALTER TABLE my_db.item SET LOCATION 's3://my_bucket/item_2020_03_03'

The previous location of the Athena table's data is not deleted nor changed. The bucket s3://my_bucket/item_2020_03_02 is unchanged.

What happens in the Spark job ?

Does it happily continue to read the data that were the data of the Athena table when it started : s3://my_bucket/item_2020_03_02 ?

Or will it face data inconsistency as part of the data will have been read from s3://my_bucket/item_2020_03_02 (the old location) and part of the data from s3://my_bucket/item_2020_03_03 (the new location) ?

Or some AWS error will be thrown ?

Simply error I think.Lamanus

1 Answers


Ideally there should not be any error. If your spark job is already running and read the previous location when you performed change to table in Athena then spark will end up writing data from s3://my_bucket/item_2020_03_02 to some_location_in_emr_hdfs.

If the change is performed even before spark start's reading the table data then it will read data from new location.

Either it will read from old or new location depending upon when the change was actually performed to table in Athena.