2
votes

Hello people and happy new year ;) !

I am bulding a lambda architecture with Apache Spark, HDFS and Elastichsearch. In the following picture, here what I am trying to do: enter image description here

So far, I have written the source code in java for my spark streaming and spark applications. I read in the spark documentation that spark can be run in a Mesos or YARN clutser. As indicated in the picture, I have already a hadoop cluster. Is it possible to run my spark streaming and spark application within the same hadoop cluster ? If yes, is there any particular configuration to do (for instance the number of nodes, RAM...). Or do I have to add a hadoop cluster specialy for spark streaming ?

I hope my explanation is clear.

Yassir

2

2 Answers

1
votes

You need not build a separate cluster for running spark streaming.

Change the spark.master property to yarn-client or yarn-cluster in conf/spark-defaults.conf file. When specified so, the spark application submitted will be handled by the ApplicationMaster of YARN and will be executed by NodeManagers.

Additionally modify these properties of cores and memory to align Spark with Yarn.

In spark-defaults.conf

spark.executors.memory
spark.executors.cores
spark.executors.instances

In yarn-site.xml

yarn.nodemanager.resource.memory-mb
yarn.nodemanager.resource.cpu-vcores

Else it could lead to either deadlock or improper resource utilization of the cluster.

Refer here for resource management of cluster when running Spark on Yarn.

1
votes

It is possible. You submit your streaming and batch applications to same yarn cluster. But sharing of cluster resources between these two jobs could be a bit tricky(as per my understanding).

So I would suggest you to look at Spark Jobserver to submit your applications. The Spark-jobserver makes your life easier when you want to maintain multiple spark contexts. All the required configurations for both the applications will be at one place.