I'm new to using Spark and I'm attempting play with Spark on my local (windows) machine using Jupyter Notebook
I've been following several tutorials for setting environment variables, as well as using multiple functions to do so via Python and cmd, and I cannot get any introductory PySpark code to work.
When running (in Jupyter Notebook, using Python)
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext('lcoal', 'Spark SQL')
OR
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext('C:\spark\spark-2.4.3-bin-hadoop2.7', 'Spark SQL')
I get the error:
FileNotFoundError: [WinError 2] The system cannot find the file specified
Additionally,
I attempted using findspark and run into the issue:
findspark.init()
OR
findspark.init("C:\spark\spark-2.4.3-bin-hadoop2.7")
I get the error:
IndexError: list index out of range
Which, from other posts around this topic I've been led to believe that the SPARK_HOME variable could be set incorrectly.
My Environment variables are as follows: My spark was extracted here: C:\spark\spark-2.4.3-bin-hadoop2.7
HADOOP_HOME: C:\spark\spark-2.4.3-bin-hadoop2.7 SPARK_HOME: C:\spark\spark-2.4.3-bin-hadoop2.7 JAVA_HOME: C:\Program Files\Java\jdk1.8.0_201
All of these including %SPARK_HOME%\bin have been added to my PATH variable.
Lastly, When I cmd > cd %SPARK_HOME% it correctly brings me to the right directory, \spark\spark-2.4.3-bin-hadoop2.7
As far as I can see, there are no issues with my environment variables so I'm unsure why pyspark through Juputer notebook cannot find my spark_home (or maybe that's not the issue).
Would appreciate any and all help!
Thanks!