spark streaming state storage - memory usage

Question

Hi I am running a streaming Job using Spark 2.2, and maintaining a couple of states using mapWithState

The batch interval is 4 minutes. I have checkpointed the Kinesis Dstream every 20 minutes.

I also repartition and cache the Kinesis Dstream since its used in multiple paths of execution.

When I look at the storage tab I always see 63 RDDs (21 for MapPartitionsRDD, 21 MapWithStateRDD for STATE 1 , 21 MapWithStateRDD for STATE 2).

How can I reduce storage? should I checkpoint the mapWithState Dstream?

dasman dasman · Accepted Answer · 2017-11-21T02:39:36

So on reading the source code of mapWithStateDstream, i found that the remember-duration is what detemines how many rdd batches will be "remembered" or cached in memory.

The default is 2 * checkpoint_duration

The default checkpoint_duration is 10 * batch_duration.

so you can specify the checkpoint_duration on the mapWithStateDstream by calling the checkpoint method and set it like 5 * batch_duration to reduce your storage by 50%.

spark streaming state storage - memory usage

1 Answers