Using Apache Spark with HDFS vs. other distributed storage

Question

On the Spark's FAQ it specifically says one doesn't have to use HDFS:

Do I need Hadoop to run Spark?

No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?

I have deleted my earlier answer. This SE question may be useful for comparison of HDFS vs other alternatives : stackoverflow.com/questions/32669187/… — Ravindra babu

ofirski ofirski · Accepted Answer · 2016-04-14T07:35:16

After a few months and some experience with both NFS and HDFS, I can now answer my own question:

NFS allows to view/change files on a remote machines as if they were stored a local machine. HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.

The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters. The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

Using Apache Spark with HDFS vs. other distributed storage

1 Answers