pyspark - creating dataframes using sqlCtx.load() from CSV file

Question

I get an error while creating a dataframe in pyspark. Pls let me know how to fix it. I am learning pyspark commands through Coursera.

Here are the commands that I used:

PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.4.0

this seemed to work fine.

Once in the shell, when I tried:

yelp_df = sqlCtx.load(source = "com.databricks.spark.csv",
               header = 'true',
               inferSchema = 'true',
               path ='file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

I get the following error:

Py4JJavaError Traceback (most recent call last)

in ()

3 header = 'true',

4 inferSchema = 'true',

----> 5 path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

/usr/lib/spark/python/pyspark/sql/context.py in load(self, path, source, schema, **options)

480 self._sc._gateway._gateway_client)

481 if schema is None:

--> 482 df = self._ssql_ctx.load(source, joptions)

483 else:

484 if not isinstance(schema, StructType):

/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in call(self, *args)

536 answer = self.gateway_client.send_command(command)

537 return_value = get_return_value(answer, self.gateway_client,

--> 538 self.target_id, self.name)

your traceback log does not seem complete, what is the final error ? — Alexis Benichoux
The dump is too long . I am adding some more here:The entire dump is here-- 539 540 for temp_arg in temp_args: /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( — Bhavana Namboodiri
I guess the main part of error listing is: Py4JJavaError: An error occurred while calling o19.load. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at tCommand.invokeMethod(AbstractCommand.jav — Bhavana Namboodiri

Alexis Benichoux Alexis Benichoux · Accepted Answer · 2016-06-23T10:16:50

Load it as a text file, split according to your delimiter ',' then convert to dataframe. With sc being your spark context,

sc.textFile('file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv').map(lambda row:row.split(',')).toDF

pyspark - creating dataframes using sqlCtx.load() from CSV file

1 Answers