2
votes

I recently installed pyspark on Linux and get the error when importing pyspark:

ModuleNotFoundError: No module named 'pyspark'

Pyspark is in my 'pip list'

I addded the following lines to my .bashrc:

export SPARK_HOME=~/Spark/spark-3.0.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3

If I type pyspark from the terminal, it work properly:

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/
Using Python version 3.7.3 (default, Jul 25 2020 13:03:44)
SparkSession available as 'spark'.

In the terminal I can do all my coding, it just doesn't load import pyspark from a python script. It looks like my environment variables are okay.

I then typed:

import findspark
print(findspark.init())

And it says; ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation)

2
how do you run your script? try use python with version: python3.7 script.py - Brown Bear
your solution indeeds works. Good to know how I can run it succesfully, but still want to know how I can do it in my interpreter (I use Thonny) - Jeroen
What is the output when you type echo $SPARK_HOME in your terminal? - Henrique Branco

2 Answers

0
votes

check your environment variable set properly or not by using

source ~/.bashrc
cd $SPARK_HOME/bin 

or provide complete path in script

import findspark
print(findspark.init('~/Spark/spark-3.0.1-bin-hadoop2.7/'))

0
votes

I had a similar problem when running a pyspark code on a Mac.

It worked when I addded the following line to my .bashrc:

export PYSPARK_SUBMIT_ARGS="--name job_name --master local --conf spark.dynamicAllocation.enabled=true pyspark-shell"

Or, when I added in my python code:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = """--name job_name --master local --conf spark.dynamicAllocation.enabled=true pyspark-shell"""