Have few questions around Spark RDD. Can someone enlighten me please.
I could see that RDDs are distributed across nodes, does that mean the distributed RDD are cached in memory of each node or will that RDD data reside on the hdfs disk. Or Only when any application runs the RDD data get cached in memory ?
My understanding is, when I create a RDD based on a file which is present on hdfs blocks , the RDD will first time read the data (I/O operation ) from the blocks and then cache it persistently. Atleast one time it has to the read the data from disk, Is that true ???
Is there any way if i can cache the external data directly into RDD instead of storing the data first in hdfs and then load into RDD from hdfs blocks ? The intention here is storing data first into hdfs and then loading it into in memory will present latency ??