0
votes

I am getting the below exception when processing input streams using Spark structured streaming.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 5.0 failed 1 times, most recent failure: Lost task 22.0 in stage 5.0 (TID 403, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space

I have handled watermark as given below,

    .withWatermark("timestamp", "5 seconds")
        .groupBy(window($"timestamp", "1 second"), $"column")

What could be the issue? I have tried changing the trigger from default to fixed interval but still I am still facing the problem.

1

1 Answers

1
votes

I don't believe this issue is related to watermarks or triggers. OutOfMemory errors occur due to two reasons:

  1. Memory Leaks. This programming error will lead your application to constantly consume more memory. Every time the leaking functionality of the application is used it leaves some objects behind into the Java heap space. Over time the leaked objects consume all of the available Java heap space and trigger the error.

  2. Too much data for the resources designated to it. Your cluster has a designated threshold and can only hold a certain amount of data. When the volume of data exceeds that threshold, the job which functioned normally before the spike ceases to operate and triggers the java.lang.OutOfMemoryError: Java heap space error.

Your error says task 22.0 in stage 5.0 as well which means that it completed stages 1 - 4 successfully. To me, that signifies that there was too much data for the resources designated to it as it did not die over multiple runs as it would with a memory leak. Try limiting the amount of data being read in with something like spark.readStream.option("maxFilesPerTrigger", "6")or increasing the memory assigned to that cluster.