How to run Spark on Docker?

Question

Can’t run Apache Spark on Docker.

When I try to communicate from my driver to spark master I receive next error:

15/04/03 13:08:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I had the same issue when I tried to run Spark Job Server inside Docker. The problem here is, Spark uses random ports to communicate between master, worker and driver. Docker is a closed system by design and you need to expose specific ports through EXPOSE. When Spark cannot communicate, this error comes. — Yohan Liyanage

Paul Paul · Accepted Answer · 2015-04-22T07:54:57

This error sounds like the workers have not registered with the master.

This can be checked at the master's spark web stool http://<masterip>:8080

You could also simply use a different docker image, or compare docker images with one that works and see what is different.

I have dockerized a spark master and spark worker.

If you have a Linux machine sitting behind a NAT router, like a home firewall, that allocates addresses in the private 192.168.1.* network to the machines, this script will download a spark 1.3.1 master and a worker to run in separate docker containers with addresses 192.168.1.10 and .11 respectively. You may need to tweak the addresses if 192.168.1.10 and 192.168.1.11 are already used on your LAN.

pipework is a utility for bridging the LAN to the container instead of using the internal docker bridge.

Spark requires all of the machines to be able to communicate with each other. As far as I can tell, spark is not hierarchical, I've seen the workers try to open ports to each other. So in the shell script I expose all the ports, which is OK if the machines are otherwise firewalled, such as behind a home NAT router.

./run-docker-spark

#!/bin/bash
sudo -v
MASTER=$(docker run --name="master" -h master --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env SPARK_MASTER_IP=192.168.1.10 -d drpaulbrewer/spark-master:latest)
sudo pipework eth0 $MASTER 192.168.1.10/[email protected]
SPARK1=$(docker run --name="spark1" -h spark1 --add-host home:192.168.1.8 --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env mem=10G --env master=spark://192.168.1.10:7077 -v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK1 192.168.1.11/[email protected]

After running this script I can see the master web report at 192.168.1.10:8080, or go to another machine on my LAN that has a spark distribution, and run ./spark-shell --master spark://192.168.1.10:7077 and it will bring up an interactive scala shell.

How to run Spark on Docker?

2 Answers