How to Write Structured Streaming Data into Cassandra with PySpark?

Question

I want to write spark structured streaming data into cassandra. My spark version is 2.4.0.

I've research some post and some of uses DataStax enterprise platform. I've didn't use it and found a method foreachBatch which helps for write streaming data to sink.

I've review a docs based on the databricks site. And try it own.

This is the code I've written:

parsed = parsed_opc \
    .withWatermark("sourceTimeStamp", "10 minutes") \
    .dropDuplicates(["id", "sourceTimeStamp"]) \
    .groupBy(
        window(parsed_opc.sourceTimeStamp, "4 seconds"),
        parsed_opc.id
    ) \
    .agg({"value": "avg"}) \
    .withColumnRenamed("avg(value)", "avg")\
    .withColumnRenamed("window", "sourceTime") 

def writeToCassandra(writeDF, epochId):
  writeDF.write \
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="opc", keyspace="poc")\
    .save()

parsed.writeStream \
    .foreachBatch(writeToCassandra) \
    .outputMode("update") \
    .start()

The schema of the parsed dataframe is:

root
 |-- sourceTime: struct (nullable = false)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- id: string (nullable = true)
 |-- avg: double (nullable = true)

I can succesfully write this streaming df to console like this:

 query = parsed \
  .writeStream \
  .format("console")\
  .outputMode("complete")\
  .start()

And the outputs as follows in console:

+--------------------+----+---+
|          sourceTime|  id|avg|
+--------------------+----+---+
|[2019-07-20 18:55...|Temp|2.0|
+--------------------+----+---+

So, when writing to the console, thats OK. But when I query in the cqlsh there is no record appended to the table.

This is the table create script in cassandra:

CREATE TABLE poc.opc ( id text, avg float,sourceTime timestamp PRIMARY KEY );

So, Can you tell me what is wrong?

Are you sure that you don't have any errors when you're trying to write to Cassandra? — Alex Ott
I am not sure. But I didn't see any errors in terminal when program runs. — c.guzel
I am running this programs in jupyter notebook. And the runtime logs printed in the terminal. Are there any path to look for logs file, may be I am missing? — c.guzel

c.guzel c.guzel · Accepted Answer · 2019-07-21T16:42:34

After working on subject I've found the solution.

Looking to the terminal logs closely, I figured it out that there is an error log which is: com.datastax.spark.connector.types.TypeConversionException: Cannot convert object [2019-07-20 18:55:00.0,2019-07-20 18:55:04.0] of type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema to java.util.Date.

It is because, when doing window operation in spark, It adds a struct to the schema on timestamp column which is in this case sourceTime. The schema of the sourceTime looks like this:

sourceTime: struct (nullable = false)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)

But I've created a column in cassandra which is already sourceTime but it expects only one timestamp value. If looking to the error, it tries to send start and end timeStamp parameter which are not exist in cassandra table.

So, selecting this columns from parsed dataframe solved the problem: cassandra_df = parsed.select("sourcetime.start", "avg", "sourcetime.end", "id").

How to Write Structured Streaming Data into Cassandra with PySpark?

1 Answers