Please help me to understand the difference between HDFS' data block and the RDDs in Spark. HDFS distributes a dataset to multiple nodes in a cluster as blocks with same size and data blocks will be replicated mutiple times and stored. RDDs are created as parallelized collection. Are the elements of the Parallelized collections distributed across nodes or it will be stored in memory for processing? Is there any relation to HDFS' data blocks?
1
votes
1 Answers
5
votes
Is there any relation to HDFS' data blocks?
In general not. They address different issues
- RDDs are about distributing computation and handling computation failures.
- HDFS is about distributing storage and handling storage failures.
Distribution is common denominator, but that is it, and failure handling strategy are obviously different (DAG re-computation and replication respectively).
Spark can use Hadoop Input Formats, and read data from HDFS. In that case there will be a relationship between HDFS blocks and Spark splits. However Spark doesn't require HDFS and many components of the newer API don't use Hadoop Input Formats anymore.