Difference between Spark RDDs and HDFS' data blocks

Question

Please help me to understand the difference between HDFS' data block and the RDDs in Spark. HDFS distributes a dataset to multiple nodes in a cluster as blocks with same size and data blocks will be replicated mutiple times and stored. RDDs are created as parallelized collection. Are the elements of the Parallelized collections distributed across nodes or it will be stored in memory for processing? Is there any relation to HDFS' data blocks?

user9296761 user9296761 · Accepted Answer · 2018-01-31T21:03:44

Is there any relation to HDFS' data blocks?

In general not. They address different issues

RDDs are about distributing computation and handling computation failures.
HDFS is about distributing storage and handling storage failures.

Distribution is common denominator, but that is it, and failure handling strategy are obviously different (DAG re-computation and replication respectively).

Spark can use Hadoop Input Formats, and read data from HDFS. In that case there will be a relationship between HDFS blocks and Spark splits. However Spark doesn't require HDFS and many components of the newer API don't use Hadoop Input Formats anymore.

Difference between Spark RDDs and HDFS' data blocks

1 Answers