How to connect spark with cassandra using spark-cassandra-connector?

Question

You must forgive my noobness but I'm trying to setup a spark cluster that connects to cassandra running a python script, currently I am using datastax enterprise to run cassandra on solr search mode. I understand that, in order to use the spark-cassandra connector that datastax provides, you must run cassandra in analytics mode (using -k option). Currently I have got it to work only using the dse spark version, for which, to make it work I followed the next steps:

Start dse cassandra in analytics mode
change $PYTHONPATH env variable to /path/to/spark/dse/python:/path/to/spark/dse/python/lib/py4j-*.zip:$PYTHONPATH
run as root the standalone script with python test-script.py

Besides, I made another test using the spark alone (not dse version), trying to include the java packages that make driver classes accesible, I did:

Add spark.driver.extraClassPath = /path/to/spark-cassandra-connector-SNAPSHOT.jar to the file spark-defaults.conf 2.execute $SPARK_HOME/bin/spark-submit —packages com.datastax.spark:spark-cassandra...

I also tried running pyspark shell and test if sc had the method cassandraTable to see if the driver was loaded but didn't work out, in both cases I get the following error message:

AttributeError: 'SparkContext' object has no attribute 'cassandraTable'

My goal is to undestand what I must do to make the non-dse spark version connect with cassandra and have the methods from the driver available.

I also want to know if it is possible to use the dse spark-cassandra connector with a cassandra node that is NOT running with dse.

Thanks for your help

Abhishek Anand Abhishek Anand · Accepted Answer · 2016-05-09T07:08:42

Here is how to connect spark-shell to cassandra in non-dse version.

Copy spark-cassandra-connector jar to spark/spark-hadoop-directory/jars/

spark-shell --jars ~/spark/spark-hadoop-directory/jars/spark-cassandra-connector-*.jar

in spark shell execute these commands

sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
import  org.apache.spark.sql.cassandra._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val csc = new CassandraSQLContext(sc)

You will have to provide more parameters if your cassandra has password setup etc. :)

How to connect spark with cassandra using spark-cassandra-connector?

2 Answers