Running Spark, PySpark first time

Question

I bought a book - try to learn Spark. After downloading it and following the proper steps, I have problem loading spark-shell and also pyspark. Wonder if someone could point to me what I need to do in order to run spark-shell or pyspark

Here is what I did.

I created folder C:\spark and placed all the files from the Spark tar into the folder.

I had also created c:\hadoop\bin and place winutils.exe into the folder.

Did the following:

> set SPARK_HOME=c:\spark 
> set HADOOP_HOME=c:\hadoop 
> set PATH=%SPARK_HOME%\bin;%PATH%
> set PATH=%HADOOP_HOME%\bin;%PATH%
> set PYTHONPATH=C:\Users\AppData\Local\Continuum\anaconda3

Created C:\tmp\hive and did the following:

> cd c:\hadoop\bin

> winutils.exe chmod -R 777 C:\tmp\hive

also did the following:

> set PYSPARK_PYTHON=C:\Users\AppData\Local\Continuum\anaconda3\python

> set PYSPARK_DRIVER_PYTHON=C:\Users\AppData\Local\Continuum\anaconda3\ipython

Also QQ, I tried to check and confirm what I set environment variable SPARK_HOME by doing the following (I think this is how I do it. Is this the right way to do it to see if I set the environment variable correctly?)

>echo %SPARK_HOME%

I just got back %SPARK_HOME%

I also did:

>echo %PATH%

I did not see %SPARK_HOME%\bin nor %HADOOP_HOME%\bin in the directories printed on CMD.

When I finally tried to run pyspark:

C:\spark\bin>pyspark

I got the following error message:

Missing Python executable 'C:\Users\AppData\Local\Continuum\anaconda3\pyth
on', defaulting to 'C:\spark\bin\..' for SPARK_HOME environment variable. Please
install Python or specify the correct Python executable in PYSPARK_DRIVER_PYTHON or 
PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
'C:\Users\AppData\Local\Continuum\anaconda3\ipython' is not recognized as
an internal or external command,
operable program or batch file.

When I tried to run spark-shell:

C:\spark\bin>spark-shell

I got the following error message:

Missing Python executable 'C:\Users\AppData\Local\Continuum\anaconda3\pyth
on', defaulting to 'C:\spark\bin\..' for SPARK_HOME environment variable. 
Please install Python or specify the correct Python executable in PYSPARK_DRIVER_PYTHO
N or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
'C:\Users\AppData\Local\Continuum\anaconda3\ipython' is not recognized as
an internal or external command,
operable program or batch file.

C:\spark\bin>spark-shell
Missing Python executable 'C:\Users\AppData\Local\Continuum\anaconda3\pyth
on', defaulting to 'C:\spark\bin\..' for SPARK_HOME environment variable. 
Please
 install Python or specify the correct Python executable in PYSPARK_DRIVER_PYTHO
N or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
2018-08-19 18:29:01 ERROR Shell:397 - Failed to locate the winutils binary in th
e hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in 
the Ha
doop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
    at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur
ityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
nformation.java:273)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
rGroupInformation.java:261)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
UserGroupInformation.java:791)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
pInformation.java:761)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
oupInformation.java:634)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
.scala:2467)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
.scala:2467)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
    at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:220)
    at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.
scala:408)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
mit$$secMgr$1(SparkSubmit.scala:408)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
nt$7.apply(SparkSubmit.scala:416)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
nt$7.apply(SparkSubmit.scala:416)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark
Submit.scala:415)
    at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu
bmit.scala:250)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-08-19 18:29:01 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
lib
rary for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLeve
l(newLevel).
2018-08-19 18:29:08 WARN  Utils:66 - Service 'SparkUI' could not bind on port 40
40. Attempting port 4041.
Spark context Web UI available at http://NJ1-BCTR-10504.usa.fxcorp.prv:4041
Spark context available as 'sc' (master = local[*], app id = local-1534717748215
).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

you will find ipython under anaconda3/scripts folder. not under anaconda3. — natarajan k

H_J H_J · Accepted Answer · 2018-08-20T04:14:38

I see following missing from your setup

1.

Apache Spark needs Java 1.6 or above, make sure to install jdk (latest version) and set up environment variable path for Java.

C:\Program Files\Java\jdk1.8.0_172\bin

Try to run below mentioned simple Java command on cmd prompt to validate Java is correctly installed on you machine:

java --version

On successful installation of Java set up your environment variable for spark as

C:\Spark

Since you are running spark on your local system you won't have a necessity to set up "Hadoop_home" as spark can run standalone resource navigator

2.

For pyspark to work you may have to install pyspark python package

pip install pyspark

Log setup : Good to have

I see your log is too verbose which you can control with "log4j.properties" file under spark/conf folder not to show Info.

Running Spark, PySpark first time

2 Answers