0
votes

I am very new to Cassandra and Spark. Following are the things I have done so far: 1) Installed Cassandra 2.1.8 version added lucene secondary indexes. Added test data. 2) Have pre built Spark 1.4.1 3) I have the Spark Cassandra connector Jars.

I am able to use ./spark-shell --jars /pathy/to/spark-cassandra-connector/spark-cassandra-connector-assembly-1.5.0-M1-SNAPSHOT.jar and

./pyspark --jars /path/to/pyspark_cassandra-0.1.5.jar --driver-class-path /path/to/pyspark_cassandra-0.1.5.jar --py-files /path/to/pyspark_cassandra-0.1.5-py2.6.egg

Using both, I am able to query the cassandra table.

My requirement is as follows -

We have an application on a remote server in Php. This application, with some filters will request for data from the spark cassandra layer.

  1. What is the best way to serve this request?
  2. Which is the preferred language, Python or Scala?
  3. With REST API which scala framework is recommended?

Currently I am just trying out a simple Python script over cgi-bin. The problem is, how do I add connector --jars in the Python script?

I have tried conf.set("spark.jars","/jar/path") which does not work.

Any help would be highly appreciated.

Thanks in Advance

1

1 Answers

0
votes

You have a few options, the easiest thing is to use a distro from Spark Packages

http://spark-packages.org/package/datastax/spark-cassandra-connector

> $SPARK_HOME/bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3

Where you just specify it with --packages

If you would like to use your own assembled jar just use the

--jars flag

You can use this without the TargetHolding jar if you only want Dataframe access. If you don't need the direct api I would recommend this because using Dataframes in this manner will make sure all your actual code will be run in native scala and there will be no need to go back and forth in serialization.

I would not try to run this from a stand alone script if you can help it. Always run through spark-submit or pyspark.