Spark Structured Streaming from kafka to save data in Cassandra in Distributed fashion

Question

I'am trying to create a structured Streaming from Kafka into Spark which is a json string. Now want to parse the json into specific column and then save the dataframe to cassandra table with optimum speed. Using Spark 2.4 and cassandra 2.11 (Apache) and Not DSE.

I have tried creating a Direct Stream which gives DStream of case class which I was saving into Cassandra using foreachRDD on DStream but this gets hang after every 6-7 days. So was trying to stream which gives dataframe directly and can be saved to Cassandra.

val conf = new SparkConf()
          .setMaster("local[3]")
      .setAppName("Fleet Live Data")
      .set("spark.cassandra.connection.host", "ip")
      .set("spark.cassandra.connection.keep_alive_ms", "20000")
      .set("spark.cassandra.auth.username", "user")
      .set("spark.cassandra.auth.password", "pass")
      .set("spark.streaming.stopGracefullyOnShutdown", "true")
      .set("spark.executor.memory", "2g")
      .set("spark.driver.memory", "2g")
      .set("spark.submit.deployMode", "cluster")
      .set("spark.executor.instances", "4")
      .set("spark.executor.cores", "2")
      .set("spark.cores.max", "9")
      .set("spark.driver.cores", "9")
      .set("spark.speculation", "true")
      .set("spark.locality.wait", "2s")

val spark = SparkSession
  .builder
  .appName("Fleet Live Data")
  .config(conf)
  .getOrCreate()
println("Spark Session Config Done")

val sc = SparkContext.getOrCreate(conf)
sc.setLogLevel("ERROR")
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
 val topics = Map("livefleet" -> 1)
import spark.implicits._
implicit val formats = DefaultFormats

 val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "brokerIP:port")
  .option("subscribe", "livefleet")
  .load()

val collection = df.selectExpr("CAST(value AS STRING)").map(f => parse(f.toString()).extract[liveevent])

val query = collection.writeStream
  .option("checkpointLocation", "/tmp/check_point/")
  .format("kafka")
  .format("org.apache.spark.sql.cassandra")
  .option("keyspace", "trackfleet_db")
  .option("table", "locationinfotemp1")
  .outputMode(OutputMode.Update)
  .start()
  query.awaitTermination()

Expected is to save the dataframe to cassandra. But getting this error :-

Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()

Have you looked at Kafka Connect? This is part of Apache Kafka and is a good way for streaming data from a Kafka topic to a target datastore, such as Cassandra. — Robin Moffatt
Hint: .format("kafka").format("org.apache.spark.sql.cassandra") is not correct — OneCricketeer
@cricket_007 - I Know its not correct, but I'am actually looking for the solution what should be correct like if I remove .format("org.apache.spark.sql.cassandra") this part, then it works but in that case it starts displaying on console and does not save to cassandra. — Pinnacle
@Pinnacle yes Kafka Connect can run distributed. If you want to use Spark then fine, I was just checking if you were aware of alternative tools that might fit better. — Robin Moffatt

OneCricketeer OneCricketeer · Accepted Answer · 2019-02-07T00:21:56

Based on the error message, I would say Cassandra is not a Streaming Sink, and I believe you need to use .write

collection.write
    .format("org.apache.spark.sql.cassandra")
    .options(...)
    .save()

or

import org.apache.spark.sql.cassandra._

// ...
collection.cassandraFormat(table, keyspace).save()

Docs: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#example-using-helper-commands-to-write-datasets

But that may only be for dataframes, for streaming sources, see this example, which uses .saveToCassandra

import com.datastax.spark.connector.streaming._

// ...
val wc = stream.flatMap(_.split("\\s+"))
    .map(x => (x, 1))
    .reduceByKey(_ + _)
    .saveToCassandra("streaming_test", "words", SomeColumns("word", "count")) 

ssc.start()

And if that doesn't work, you do need a ForEachWriter

collection.writeStream
  .foreach(new ForeachWriter[Row] {

  override def process(row: Row): Unit = {
    println(s"Processing ${row}")
  }

  override def close(errorOrNull: Throwable): Unit = {}

  override def open(partitionId: Long, version: Long): Boolean = {
    true
  }
})
.start()

Also worth mentioning, that Datastax released a Kafka Connector, and Kafka Connect is included with your Kafka installation (assuming 0.10.2) or later. You can find its announcement here

Spark Structured Streaming from kafka to save data in Cassandra in Distributed fashion

2 Answers