3
votes

Have few questions around Spark RDD. Can someone enlighten me please.

  1. I could see that RDDs are distributed across nodes, does that mean the distributed RDD are cached in memory of each node or will that RDD data reside on the hdfs disk. Or Only when any application runs the RDD data get cached in memory ?

    1. My understanding is, when I create a RDD based on a file which is present on hdfs blocks , the RDD will first time read the data (I/O operation ) from the blocks and then cache it persistently. Atleast one time it has to the read the data from disk, Is that true ???

    2. Is there any way if i can cache the external data directly into RDD instead of storing the data first in hdfs and then load into RDD from hdfs blocks ? The intention here is storing data first into hdfs and then loading it into in memory will present latency ??

1

1 Answers

0
votes
  1. Rdd's are data structures similar to arrays and lists. When you create an RDD (example: loading a file ) if it is in the local mode it is stored in the laptop. If you are using hdfs it is stored in hdfs. Remember ON DISK. If you want to store it in the cache (in RAM), you can use the cache() function.
  2. Hope you got the answer for the second question too from the first one .
  3. Yes you can directly load the data from your laptop without loading it into hdfs.

val newfile = sc.textFile("file:///home/user/sample.txt")

Specify the file path. By default spark takes hdfs as storage u can change it by using the above line.

Dont forget to put the three ///:

file:///