2
votes

I have one RDD that is not being cached. I set for it the default cache() directive, and used count() to force an action.

The is_cached method returns True, but in Spark UI I can't see the RDD in the storage tab (and further calling count takes the exact same time as the first time).

toDebugString returns:

(4) RDD CoalescedRDD[8] at coalesce at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]\n | MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]\n | s3://... HadoopRDD[6] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]'

And StorageLevel:

StorageLevel(False, True, False, False, 1)

The Input data is 64MB and I have 2 executors with 500MB remaining in each. Other RDDs are cached just fine.

Code:
*before coalescing numPartitions is 5943

RDD = sc.textFile('s3:/...',use_unicode=False).coalesce(4)
RDD.cache()
RDD.count()
1
We need a minimal reproducible code sample - Justin Pihony

1 Answers

0
votes

Cached memory you can see in executor tab in spark UI. For same time on memory and disc, that is because just one operation on rdd/df is going to take same time as it needs on io for both.

enter image description here