0
votes

I'm a new user of pyspark from Apache Zeppelin 0.7.1 to access my Spark cluster. I configured 2 machines:

  • Machine-1: Spark Master + 2 workers + Apache Zeppelin
  • Machine-2: 2 Workers

Situation:

  • Cluster works fine if I use pyspark console from the Master (Machine-1).

  • When I use Local[*] configuration of Spark, all it's OK from
    Zeppelin.

Following this zeppelin documentation, I put spark://Machine-1:7077 at the master property of the spark interpreter configuration. Then, some code runs OK from the cells of my Zeppelin Notebook:

%spark
sc.version
sc.getConf.get("spark.home")
System.getenv().get("PYTHONPATH")
System.getenv().get("SPARK_HOME")

but others RDD trasnformations (for instance) never end:

%pyspark
input_file = "/tmp/kddcup.data_10_percent.gz"
raw_rdd = sc.textFile(input_file)

What's wrong? Some advice? Thank you in adance.

1
Do you see the job running on the Spark console (machine-1:4040)?Greg

1 Answers

0
votes

eventually I realised that:

  1. Memory and cores parameters for workers are not suitable for my cluster. I changed the values in spark-env.sh files and It's working!.
  2. Configuration parameters in Apache Zeppelin had also some mistake (son extra spark modules needed)

Thank you, Greg, for your interest.