4
votes

In my spark application, I am reading few hive tables in spark rdd and then performing few transformation on those rdds later. To avoid re computation I cached those rdds using rdd.cache() or rdd.persist() and rdd.checkpoint() methods.

As per spark documentation and online references I was of opinion that checkpointing operation is costlier than caching. Though caching keeps rdd lineage and checkpointing breaks it but checkpointing writes and reads from HDFS.

Strange thing I observed in my case is, I see checkpointing stage is faster (nearly 2 times) than caching/persisting(memory only). I ran multiple times and still results were similar.

I am not able to understand why this happening. Any help would be helpful.

1

1 Answers

1
votes

I have ran similar benchmarks lately and I am experiencing the same: checkpoint() is faster despite more I/O. My explanation is that keeping the whole lineage is a costly operation.

I ran benchmarks on 1, 10, 100, 1000, 10000, 100000, 1000000, 2000000, 10m, and more records and checkpoint was always faster. The lineage was pretty simple (record filtering then a couple of aggregations). Storage was local on NVMe drives (not block over the network). I guess it really depends on a lot of criteria.