1
votes

I'm new to using Spark and I'm attempting play with Spark on my local (windows) machine using Jupyter Notebook

I've been following several tutorials for setting environment variables, as well as using multiple functions to do so via Python and cmd, and I cannot get any introductory PySpark code to work.

When running (in Jupyter Notebook, using Python)

from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext('lcoal', 'Spark SQL') 

OR

from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext('C:\spark\spark-2.4.3-bin-hadoop2.7', 'Spark SQL') 

I get the error:

FileNotFoundError: [WinError 2] The system cannot find the file specified

Additionally,

I attempted using findspark and run into the issue:

findspark.init()
OR
findspark.init("C:\spark\spark-2.4.3-bin-hadoop2.7")

I get the error:

IndexError: list index out of range

Which, from other posts around this topic I've been led to believe that the SPARK_HOME variable could be set incorrectly.

My Environment variables are as follows: My spark was extracted here: C:\spark\spark-2.4.3-bin-hadoop2.7

HADOOP_HOME: C:\spark\spark-2.4.3-bin-hadoop2.7 SPARK_HOME: C:\spark\spark-2.4.3-bin-hadoop2.7 JAVA_HOME: C:\Program Files\Java\jdk1.8.0_201

All of these including %SPARK_HOME%\bin have been added to my PATH variable.

Lastly, When I cmd > cd %SPARK_HOME% it correctly brings me to the right directory, \spark\spark-2.4.3-bin-hadoop2.7

As far as I can see, there are no issues with my environment variables so I'm unsure why pyspark through Juputer notebook cannot find my spark_home (or maybe that's not the issue).

Would appreciate any and all help!

Thanks!

1
check if there is any bin directory inside spark-2.4.3-bin-hadoop2.7 . if it is there add bin also to that path. Also check pyspark also there. In Ubuntu how it is like defined .PIG

1 Answers

0
votes

You seem to have done rest of the process, just one step needs to be done.In Jupyter NB, run the below command :

import os    
os.environ['SPARK_HOME'] = 'C:\\Users\\user_name\\Desktop\\spark'

It should add this path to your environment variable. You can also check if it sets the correct path as expected by running below command in Jupyter NB:

%env OR

for var in os.environ():  
    print(var,':',os.environ[var])

PS. Please mind the indentation of the codes