I have multiple queries running on the same spark structured streaming session. The queries are writing parquet records to Google Bucket and checkpoint to Google Bucket.
val query1 = df1
.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir1)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query1").start()
val query2 = df2.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir2)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query2").start()
Problem: Sometimes job fails with ava.lang.IllegalStateException: Race while writing batch 4
Logs:
Caused by: java.lang.IllegalStateException: Race while writing batch 4
at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:67)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 20 more
20/07/24 19:40:15 INFO SparkContext: Invoking stop() from shutdown hook