I am having a hard time understanding the difference between the RDD partitions and the HDFS Input Splits. So essentially when you submit a Spark application:
When the Spark application wants to read from HDFS, that file on HDFS will have input splits (of let's say 64 mb each and each of these input splits are present on different data nodes).
Now let's say the Spark application wants to load that file from HDFS using the (sc.textFile(PATH_IN_HDFS)
). And the file is about 256 mb and has 4 input splits where 2 of the splits are on data node 1 and the other 2 splits are on data node 2.
Now when Spark loads this 256 mb into it's RDD abstraction, will it load each of the input splits (64mb) into 4 separate RDD's (where you will have 2 RDD's with 64mb of data in data node 1 and the other two RDD's of 64mb of data on data node 2). Or will the RDD further partition those input splits on Hadoop? Also how will these partitions be redistributed then? I do not understand if there is a correlation between the RDD partitions and the HDFS input splits?