Apache zeppelin: Spark cluster configuration

Question

I'm a new user of pyspark from Apache Zeppelin 0.7.1 to access my Spark cluster. I configured 2 machines:

Machine-1: Spark Master + 2 workers + Apache Zeppelin
Machine-2: 2 Workers

Situation:

Cluster works fine if I use pyspark console from the Master (Machine-1).
When I use Local[*] configuration of Spark, all it's OK from
Zeppelin.

Following this zeppelin documentation, I put spark://Machine-1:7077 at the master property of the spark interpreter configuration. Then, some code runs OK from the cells of my Zeppelin Notebook:

%spark
sc.version
sc.getConf.get("spark.home")
System.getenv().get("PYTHONPATH")
System.getenv().get("SPARK_HOME")

but others RDD trasnformations (for instance) never end:

%pyspark
input_file = "/tmp/kddcup.data_10_percent.gz"
raw_rdd = sc.textFile(input_file)

What's wrong? Some advice? Thank you in adance.

Do you see the job running on the Spark console (machine-1:4040)? — Greg

Julián Gómez Julián Gómez · Accepted Answer · 2017-05-21T09:30:42

eventually I realised that:

Memory and cores parameters for workers are not suitable for my cluster. I changed the values in spark-env.sh files and It's working!.
Configuration parameters in Apache Zeppelin had also some mistake (son extra spark modules needed)

Thank you, Greg, for your interest.

Apache zeppelin: Spark cluster configuration

1 Answers