The RDD is at the core of Apache Spark and it is a data abstraction for a distributed collection of objects. They are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. Ref: https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html
If a file is split across the cluster upon loading, the calculations are done on the nodes where the RDDs reside. That is, the compute is performed where the data resides (as well as it can) to minimize the need for performing shuffles. For more information concerning Spark and Data locality, please refer to: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html.
Note, for more information about Spark Research, please refer to: http://spark.apache.org/research.html; more specifically, please refer to Zaharia et. al.'s paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing (http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf).