1
votes

I am programming with Pyspark in the Eclipse IDE and have been trying to transition to Spark 1.4.1 so that I may finally program using Python 3. The following program works in Spark 1.3.1 but throws an exception in Spark 1.4.1:

from pyspark import SparkContext, SparkConf 
from pyspark.sql.types import * 
from pyspark.sql import SQLContext 

if __name__ == '__main__': 
    conf = SparkConf().setAppName("MyApp").setMaster("local") 

    global sc 
    sc = SparkContext(conf=conf)     

    global sqlc 
    sqlc = SQLContext(sc) 

    symbolsPath = 'SP500Industry.json' 
    symbolsRDD = sqlc.read.json(symbolsPath) 

    print "Done"" 

The traceback I'm getting is as follows:

Traceback (most recent call last): 
  File "/media/gavin/20A6-76BF/Current Projects Luna/PySpark               Test/Test.py", line 21, in <module>
  symbolsRDD = sqlc.read.json(symbolsPath) #rdd with all symbols (and their industries 
  File "/home/gavin/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 582, in read 
  return DataFrameReader(self) 
  File "/home/gavin/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line 39, in __init__ 
self._jreader = sqlContext._ssql_ctx.read() 
  File "/home/gavin/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ 
  File "/home/gavin/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 304, in get_return_value 
py4j.protocol.Py4JError: An error occurred while calling o18.read.          Trace: 
py4j.Py4JException: Method read([]) does not exist 
    at         py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) 
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) 
    at py4j.Gateway.invoke(Gateway.java:252) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:207) 
    at java.lang.Thread.run(Thread.java:745)" 

The external libraries I have for the project are ... spark-1.4.1-bin-hadoop2.6/python ... spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip ... spark-1.4.1-bin-hadoop2.6/python/lib/pyspark.zip (tried both including and not including this)

Can anybody help me out with what I'm doing wrong?

1

1 Answers

0
votes

You need to set the format to 'json' before your call to load. Otherwise spark will assume you are trying to load a Parquet file.

symbolsRDD = sqlc.read.format('json').json(symbolsPath) 

However, I am still not able to figure out why you are getting a read method error. Spark should complain that it found an invalid Parquet file.