What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

Question

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?

PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?

Hi. This a broad question of which the title and text do not fit so well. E.g. the YARN I do not see back. — thebluephantom
YARN is the resource manager for handling spark jobs throughout the entire question. — Akash Basudevan
Yea but i am trying to understand the advantage of using it over HDFS and YARN. — Akash Basudevan
I cannot glean that from the text. Only trying to help on clarity. — thebluephantom

OneCricketeer OneCricketeer · Accepted Answer · 2019-01-27T19:14:59

Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.

You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...

At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed

If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

1 Answers