I created a AWS EMR Hadoop cluster with 'AWS Glue Data Catalog' used for 'for Spark table metadata'. Consequently, in Spark jobs or in spark-shell, I can write Spark SQL that uses Glue/Athena databases and tables.
What happens if one changes the Athena table location while a Spark job running in EMR is reading the content of this table ?
Let's imagine that I have a Athena table named "item" in the Glue database named "my_db". The Athena table location points to a S3 folder where Parquet files containing the data are stored. This folder is s3://my_bucket/item_2020_03_02
.
A spark job running in EMR is launched and process a Spark SQL string that reads the table content :
Dataset<Row> df = spark.sql("select * from my_db.item");
df.write().parquet("some_location_in_emr_hdfs");
Few milliseconds after, someone runs this SQL query in AWS Athena Web Console :
ALTER TABLE my_db.item SET LOCATION 's3://my_bucket/item_2020_03_03'
The previous location of the Athena table's data is not deleted nor changed. The bucket s3://my_bucket/item_2020_03_02
is unchanged.
What happens in the Spark job ?
Does it happily continue to read the data that were the data of the Athena table when it started : s3://my_bucket/item_2020_03_02
?
Or will it face data inconsistency as part of the data will have been read from s3://my_bucket/item_2020_03_02
(the old location) and part of the data from s3://my_bucket/item_2020_03_03
(the new location) ?
Or some AWS error will be thrown ?