0
votes

I want to understand below things on RDD of Spark Concept.

  1. is RDD just a concept of copying require data in some node's RAM from HDFS storage to speed up the execution?

  2. if a file is splitted across the cluster then for a single flie, RDD brings all require data from other nodes?

  3. if 2nd point is correct then how it decides which node's JVM it has to execute? how data locality works here?

1

1 Answers

0
votes

The RDD is at the core of Apache Spark and it is a data abstraction for a distributed collection of objects. They are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. Ref: https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html

If a file is split across the cluster upon loading, the calculations are done on the nodes where the RDDs reside. That is, the compute is performed where the data resides (as well as it can) to minimize the need for performing shuffles. For more information concerning Spark and Data locality, please refer to: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html.

Note, for more information about Spark Research, please refer to: http://spark.apache.org/research.html; more specifically, please refer to Zaharia et. al.'s paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf).