Kafka direct consumer started to limit reads to 450 events(5 * 90 partitions) per batch (5 seconds), it was running fine for 1 or 2 days before that (about 5000 to 40000 events per batch)
I'm using spark standalone cluster (spark and spark-streaming-kafka version 1.6.1) running in AWS and using S3 bucket for checkpoint directory StreamingContext.getOrCreate(config.sparkConfig.checkpointDir, createStreamingContext), there are not scheduling delays and enough disk space on each worker node.
Didn't change any Kafka client initialization parameters, pretty sure that kafka's structure hasn't changed:
val kafkaParams = Map("metadata.broker.list" -> kafkaConfig.broker)
val topics = Set(kafkaConfig.topic)
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
Also can't understand why when direct consumer description says The consumed offsets are by the stream itself I still need to use checkpoint directory when creating the streaming context?
spark.streaming.backpressure.enabledset totrue? - Yuval Itzchakov