I have one RDD that is not being cached. I set for it the default cache() directive, and used count() to force an action.
The is_cached method returns True, but in Spark UI I can't see the RDD in the storage tab (and further calling count takes the exact same time as the first time).
toDebugString returns:
(4) RDD CoalescedRDD[8] at coalesce at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]\n | MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]\n | s3://... HadoopRDD[6] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]'
And StorageLevel:
StorageLevel(False, True, False, False, 1)
The Input data is 64MB and I have 2 executors with 500MB remaining in each. Other RDDs are cached just fine.
Code:
*before coalescing numPartitions is 5943
RDD = sc.textFile('s3:/...',use_unicode=False).coalesce(4)
RDD.cache()
RDD.count()
