0
votes

I just set up a Spark cluster in Google Cloud using DataProc and I have a standalone installation of Cassandra running on a separate VM. I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?

The connector can be downloaded here:

https://github.com/datastax/spark-cassandra-connector

The instructions on building are here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/12_building_and_artifacts.md

sbt is needed to build it.

Where can I find sbt for the DataProc installation ?

Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?

1
Does the connector need to be installed on the entire cluster, or could it be used via spark packages (which admittedly require a bit of a hack to use on Dataproc)? If packages are sufficient, consider using the 'short answer' on this question: stackoverflow.com/questions/33363189/… - Angus Davis

1 Answers

0
votes

I'm going to follow up the really helpful comment @angus-davis made not too long ago.

Where can I find sbt for the DataProc installation ?

At present, sbt is not included on Cloud Dataproc clusters. The sbt documentation contains information on how to install sbt manually. If you need to re-install sbt on your clusters, I highly recommend you create an init action to install sbt when you create a cluster. After some research, it looks like SBT is covered under a BSD-3 license, which means we can probably (no promise) include it in Cloud Dataproc clusters.

Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?

The answer to this is it depends on what you mean.

  • binaries - /usr/bin
  • config - /etc/spark/conf
  • spark_home - /usr/lib/spark

Importantly, this same pattern is used for other major OSS components installed on Cloud Dataproc clusters, like Hadoop and Hive.

I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?

The Stack Overflow answer Angus sent is probably the easiest way if it can be used as a Spark package. Based on what I can find, however, this is probably not an option. This means you will need to install sbt and manually install.