Spark RDD Record Count and Spark Streaming Web UI Incoordination

Question

I'm trying to create a stream in Spark which gets data from Kafka. When I check the record count in the RDD it seems count is not same as the Web UI.

I execute a function for all RDDs in the DStream (codes are generated in Python):

rdds = KafkaUtils.createStream(...)
rdds = rdds.repartition(1)
rdds.foreachRDD(doJob)

And the doJob function I have a loop and a counter

def doJob(time, p_rdd):
    if not p_rdd.isEmpty:
        batch_count = 0
        ...
        ...
        rdd_collected = p_rdd.collect()
        for record in rdd_collected:
            ...
            ...
            batch_count = batch_count + 1
        log("Count: " + str(batch_count))

My expectation is that batch_count should be the same as http://webui.adress/my_app_id/Streaming page -> Completed Batch section -> Input size. But it seems they are not. Where should I check the RDD Record Count in the web ui, I'm missing something?

Thanks.

Piyush Patel Piyush Patel · Accepted Answer · 2020-04-16T19:09:37

I think that's wrong. It must be same. I have used such logic and it appears to correctly match the records I have processed.

You can check them at http://webui.adress/my_app_id/Streaming under Completed Batches -> Records in the same Streaming table in the UI.

Spark RDD Record Count and Spark Streaming Web UI Incoordination

1 Answers