I have been trying to deploy a spark multi-node cluster on three machines (master, slave1 and slave2). I have successfully deployed the spark cluster but I am confused about how to distribute my HDFS data over the slaves? Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client? I have searched multiple forums but haven't been able to figure out how to use HDFS with Spark without using Hadoop.
1 Answers
tl;dr Store files to be processed by a Spark application on Hadoop HDFS and Spark executors will be told how to access them.
From HDFS Users Guide:
This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
So, HDFS is a mere file system that you can use to store files and use them in a distributed application, incl. a Spark application.
To my great surprise, it's only in HDFS Architecture where you can find a HDFS URI, i.e. hdfs://localhost:8020/user/hadoop/delete/test1
that is a HDFS URL to a resource delete/test1
that belongs to the user hadoop
.
The URL that start with hdfs
points at a HDFS that in the above example is managed by a NameNode at localhost:8020
.
That means that HDFS does not require Hadoop YARN, but is usually used together because they come together and is just simple to use together.
Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client?
Spark supports Hadoop HDFS with or without Hadoop YARN. A cluster manager (aka master URL) is an orthogonal concern to HDFS.
Wrapping it up, just use hdfs://hostname:port/path/to/directory
with to access files on HDFS.