I get an error while creating a dataframe in pyspark. Pls let me know how to fix it. I am learning pyspark commands through Coursera.
Here are the commands that I used:
PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.4.0
this seemed to work fine.
Once in the shell, when I tried:
yelp_df = sqlCtx.load(source = "com.databricks.spark.csv",
header = 'true',
inferSchema = 'true',
path ='file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')
I get the following error:
Py4JJavaError Traceback (most recent call last)
in ()
3 header = 'true',
4 inferSchema = 'true',
----> 5 path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')
/usr/lib/spark/python/pyspark/sql/context.py in load(self, path, source, schema, **options)
480 self._sc._gateway._gateway_client)
481 if schema is None:
--> 482 df = self._ssql_ctx.load(source, joptions)
483 else:
484 if not isinstance(schema, StructType):
/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in call(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)