Hi I am running a streaming Job using Spark 2.2, and maintaining a couple of states using mapWithState
The batch interval is 4 minutes. I have checkpointed the Kinesis Dstream every 20 minutes.
I also repartition and cache the Kinesis Dstream since its used in multiple paths of execution.
When I look at the storage tab I always see 63 RDDs (21 for MapPartitionsRDD, 21 MapWithStateRDD for STATE 1 , 21 MapWithStateRDD for STATE 2).
How can I reduce storage? should I checkpoint the mapWithState Dstream?