I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow.
import pandas as pd
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.printSchema()
Till' this point everything is OK. The output is:
root
|-- Letters: string (nullable = true)
The problem comes when I try to print the DataFrame:
spark_df.show()
This is the result:
An error occurred while calling o158.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
Error executing Jupyter command 'pyspark.daemon': [Errno 2] No such file or directory PYTHONPATH was:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
These are my Spark specifications:
SparkSession - hive
SparkContext
Spark UI
Version: v2.4.0
Master: local[*]
AppName: PySparkShell
This are my venv:
export PYSPARK_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab'
Fact:
As the error mentions, it has to do with running pyspark from Jupyter. Running it with 'PYSPARK_PYTHON=python2.7' and 'PYSPARK_PYTHON=python3.6' works fine