I'm trying to create a stream in Spark which gets data from Kafka. When I check the record count in the RDD it seems count is not same as the Web UI.
I execute a function for all RDDs in the DStream (codes are generated in Python):
rdds = KafkaUtils.createStream(...)
rdds = rdds.repartition(1)
rdds.foreachRDD(doJob)
And the doJob function I have a loop and a counter
def doJob(time, p_rdd):
if not p_rdd.isEmpty:
batch_count = 0
...
...
rdd_collected = p_rdd.collect()
for record in rdd_collected:
...
...
batch_count = batch_count + 1
log("Count: " + str(batch_count))
My expectation is that batch_count should be the same as http://webui.adress/my_app_id/Streaming page -> Completed Batch section -> Input size. But it seems they are not. Where should I check the RDD Record Count in the web ui, I'm missing something?
Thanks.