Error while finding module specification for 'pyspark.worker' (ModuleNotFoundError: No module named 'pyspark')

Question

I'm trying to run a pyspark program, but I'm getting an error:

python.exe: Error while finding module specification for 'pyspark.worker' (ModuleNotFoundError: No module named 'pyspark')

SparkException: Python worker failed to connect back.

Code:

from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as func

spark = SparkSession\
    .builder\
    .appName("ReplaceNanByAverage")\
    .config("spark.master", "local")\
    .getOrCreate()

items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300),
         (3,float('Nan'))]

sc = spark.sparkContext
rdd = sc.parallelize(items)
itemsRdd = rdd.map(lambda x: Row(id=x[0], col1=int(x[1])))
df = itemsRdd.toDF()

I've tried a lot of suggested solutions:

Downgrading spark version
Using findspark.init()
Using findspark.init('/path/to/spark_home')
Adding Content Root to the Project Structure.
Using .config('PYTHONPATH','pyspark.zip:py4j-0.10.7-src.zip')

But I keep getting the same error.

I'm working in PyCharm IDE on Windows.

Bitswazsky Bitswazsky · Accepted Answer · 2019-09-23T09:30:28

After downloading and extracting spark on your local, can you try adding these lines in bash_Profile or bashrc, depending on whether you're on mac or linux? Note: this is for spark 2.4.0, replace spark and py4j version according to your local installation.

export SPARK_HOME=<prefix-to-spark-path>/spark-2.4.0-bin-hadoop2.7 
export PYTHONPATH=${SPARK_HOME}/python:$PYTHONPATH 
export PYTHONPATH=${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

export PYSPARK_PYTHON=<path-to-your-python> 
export PYSPARK_DRIVER_PYTHON=<path-to-your-python>

PATH=$PATH:$SPARK_HOME/bin

Keep in mind, the syntax is little different in linux, so adjust accordingly. Once you've made the changes, execute source ~/.bash_profile or source ~/.bashrc. Then in your PyCharm project use the same python interpreter, that you linked in the previous file. That should work.

Error while finding module specification for 'pyspark.worker' (ModuleNotFoundError: No module named 'pyspark')

2 Answers