5
votes

I have a simple Spark Structured Streaming app that reads from Kafka and writes to HDFS. Today the app has mysteriously stopped working, with no changes or modifications whatsoever (it had been working flawlessly for weeks).

So far, I have observed the following:

  • App has no active, failed or completed tasks
  • App UI shows no jobs and no stages
  • QueryProgress indicates 0 input rows every trigger
  • QueryProgress indicates offsets from Kafka were read and committed correctly (which means data is actually there)
  • Data is indeed available in the topic (writing to console shows the data)

Despite all of that, nothing is being written to HDFS anymore. Code snippet:

val inputData = spark
.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("subscribe", topic-name-here")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false").load()

inputData.toDF()
.repartition(10)
.writeStream.format("parquet")
.option("checkpointLocation", "hdfs://...")
.option("path", "hdfs://...")
.outputMode(OutputMode.Append())
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()

Any ideas why the UI shows no jobs/tasks?

No jobs for the application

No tasks and basically no activity

Query Progress

1

1 Answers

5
votes

For anyone facing the same issue: I found the culprit:

Somehow the data within _spark_metadata in the HDFS directory where I was saving the data got corrupted.

The solution was to erase that directory and restart the application, which re-created the directory. After data, data started flowing.