0
votes

I have been trying to deploy a spark multi-node cluster on three machines (master, slave1 and slave2). I have successfully deployed the spark cluster but I am confused about how to distribute my HDFS data over the slaves? Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client? I have searched multiple forums but haven't been able to figure out how to use HDFS with Spark without using Hadoop.

1

1 Answers

1
votes

tl;dr Store files to be processed by a Spark application on Hadoop HDFS and Spark executors will be told how to access them.


From HDFS Users Guide:

This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system.

A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.

So, HDFS is a mere file system that you can use to store files and use them in a distributed application, incl. a Spark application.


To my great surprise, it's only in HDFS Architecture where you can find a HDFS URI, i.e. hdfs://localhost:8020/user/hadoop/delete/test1 that is a HDFS URL to a resource delete/test1 that belongs to the user hadoop.

The URL that start with hdfs points at a HDFS that in the above example is managed by a NameNode at localhost:8020.

That means that HDFS does not require Hadoop YARN, but is usually used together because they come together and is just simple to use together.


Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client?

Spark supports Hadoop HDFS with or without Hadoop YARN. A cluster manager (aka master URL) is an orthogonal concern to HDFS.

Wrapping it up, just use hdfs://hostname:port/path/to/directory with to access files on HDFS.