Apache Spark Running Locally Giving Refused Connection Error

Question

I have Spark and Hadoop installed on OS X. I successfully worked through an example where Hadoop ran locally, had files stored in HDFS and I ran spark with

spark-shell --master yarn-client

and from within the shell worked with HDFS. I'm having problems, however, trying to get Spark to run without HDFS, just locally on my machine. I looked at this answer but it doesn't feel right messing around with environment variables when the Spark documentation says

It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

If I run the basic SparkPi example I get the correct output.

If I try run the sample Java app they provide, again, I get output, but this time with connection refused errors relating to port 9000, which sounds like it's trying to connect to Hadoop, but I don't know why because I'm not specifying that

    $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] ~/study/scala/sampleJavaApp/target/simple-project-1.0.jar
    Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
...
...
...
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
        at org.apache.hadoop.ipc.Client$Connection.access(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
        at org.apache.hadoop.ipc.Client.call(Client.java:1381)
        ... 51 more
    15/07/31 11:05:06 INFO spark.SparkContext: Invoking stop() from shutdown hook
    15/07/31 11:05:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
...
...
...
    15/07/31 11:05:06 INFO ui.SparkUI: Stopped Spark web UI at http://10.37.2.37:4040
    15/07/31 11:05:06 INFO scheduler.DAGScheduler: Stopping DAGScheduler
    15/07/31 11:05:06 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    15/07/31 11:05:06 INFO util.Utils: path = /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf/blockmgr-b66cc31e-7371-472f-9886-4cd33d5ba4b1, already present as root for deletion.
    15/07/31 11:05:06 INFO storage.MemoryStore: MemoryStore cleared
    15/07/31 11:05:06 INFO storage.BlockManager: BlockManager stopped
    15/07/31 11:05:06 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    15/07/31 11:05:06 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    15/07/31 11:05:06 INFO spark.SparkContext: Successfully stopped SparkContext
    15/07/31 11:05:06 INFO util.Utils: Shutdown hook called
    15/07/31 11:05:06 INFO util.Utils: Deleting directory /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf

Any pointers/explanations as to where I'm going wrong would be much appreciated!

UPDATE

It seems that the fact I have the environment variable HADOOP_CONF_DIR set is causing some issues. Under that directory, I have core-site.xml which contains the following

<property>
     <name>fs.default.name</name>                                     
     <value>hdfs://localhost:9000</value>                             
</property>

If I change the value e.g. <value>hdfs://localhost:9100</value> then when I attempt to run the spark job, the connection refused error refers to this changed port

Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9100 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

So for some reason, despite instructing it to run locally, it is trying to connect to HDFS. If I remove the HADOOP_CONF_DIR environment variable, the job works fine.

are you setting up the master configuration inside your job as well? — eliasah
I'm not sure exactly what you mean (which could be a sign of why it's not working!). I am just running the command $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] ~/study/scala/sampleJavaApp/target/simple-project-1.0.jar — Philip O'Brien
how are you setting up your SparkContext in your Java project? — eliasah

Daniel Darabos Daniel Darabos · Accepted Answer · 2015-08-08T22:11:37

Apache Spark uses the Hadoop client libraries for file access when you use sc.textFile. This makes it possible to use an hdfs:// or s3n:// path for example. You can also use local paths as file:/home/robocode/foo.txt.

If you specify a file name without a schema, fs.default.name is used. It defaults to file:, but you explicitly override it to hdfs://localhost:9000 in your core-site.xml. So if you don't specify the schema, it's trying to read from HDFS.

The simplest solution is to specify the schema:

JavaRDD<String> logData = sc.textFile("file:/home/robocode/foo.txt").cache();

Apache Spark Running Locally Giving Refused Connection Error

UPDATE

3 Answers