Implementation of Cassandra sink in Spark streaming using ForeachWriter

Question

Apparently there's no built-in support for a Cassandra sink in Spark streaming. I found this example online which implements a custom Cassandra sink for Spark structured streaming based on ForEachWriter:

https://dzone.com/articles/cassandra-sink-for-spark-structured-streaming

I understand that we need to create a ForeachWriter implementation that takes care of opening a connection to the sink (Cassandra), writing the data and closing the connection. So the CassandraSinkForeach and the CassandraDriver classes make sense.

However I don't get the need to make SparkSessionBuilder serializable and even the need to initialize the SparkSession instance inside the CassandraDriver class. Seems like the only reason for doing this is to initialize the CassandraConnector from the sparkConf.

According to the CassandraConnector docs, a CassandraConnector object can initialized from a CassandraConnectorConfig passed in: http://datastax.github.io/spark-cassandra-connector/ApiDocs/2.4.0/spark-cassandra-connector/#com.datastax.spark.connector.cql.CassandraConnector

Can someone explain if there is a need to initialize SparkSession in the workers? Is this is a general pattern and if so, why the requirement?

Venkata Venkata · Accepted Answer · 2019-02-25T07:27:27

If you can upgrade to Spark 2.4, you can make use of ForEachBatch where you can apply batch writers on top of streaming dataframes.

Implementation of Cassandra sink in Spark streaming using ForeachWriter

1 Answers