2
votes

I have 16 receivers in Spark Streaming 2.2.1 job. After a while, some of the receivers are processing less and less records, eventually processing only one record/second. The behaviour can be observed on the screenshot:

enter image description here

While I understand the root-cause can be difficult to find and not obvious, is there a way I could debug this problem further? Currently I have no idea where to start digging. Could it be related to back-pressure?

Spark streaming properties:

spark.app.id    application_1599135282140_1222
spark.cores.max 64
spark.driver.cores  4
spark.driver.extraJavaOptions -XX:+PrintFlagsFinal -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dump/ -Dlog4j.configuration=file:///tmp/4f892127ad794245aef295c97ccbc5c9/driver_log4j.properties
spark.driver.maxResultSize  3840m
spark.driver.memory 4g
spark.driver.port   36201
spark.dynamicAllocation.enabled false
spark.dynamicAllocation.maxExecutors    10000
spark.dynamicAllocation.minExecutors    1
spark.eventLog.enabled  false
spark.executor.cores    4
spark.executor.extraJavaOptions -XX:+PrintFlagsFinal -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dump/
spark.executor.id   driver
spark.executor.instances    16
spark.executor.memory   4g
spark.jars  file:/tmp/4f892127ad794245aef295c97ccbc5c9/main-e41d1cc.jar
spark.master    yarn
spark.rpc.message.maxSize   512
spark.scheduler.maxRegisteredResourcesWaitingTime   300s
spark.scheduler.minRegisteredResourcesRatio 1.0
spark.scheduler.mode    FAIR
spark.shuffle.service.enabled   true
spark.sql.cbo.enabled   true
spark.streaming.backpressure.enabled    true
spark.streaming.backpressure.initialRate    25
spark.streaming.backpressure.pid.minRate    1
spark.streaming.concurrentJobs  1
spark.streaming.receiver.maxRate    100
spark.submit.deployMode client
1

1 Answers

1
votes

Seems that the problem started manifesting after running for 30 mins. I think back-pressure could be a reason. According to this article:

With activated backpressure, the driver monitors the current batch scheduling delays and processing times and dynamically adjusts the maximum rate of the receivers. The communication of new rate limits can be verified in the receiver log:

2016-12-06 08:27:02,572 INFO org.apache.spark.streaming.receiver.ReceiverSupervisorImpl Received a new rate limit: 51.

Here is what I would recommend you to try:

  1. Check the receiver log to see if backpress is triggerred.
  2. Check your stream sink to see if there is any error.
  3. Check YARN resource manager for resource utilization.
  4. Tune Spark parameters to see if that makes a difference.