3
votes

I want to insert huge volume of data from Spark into Cassandra. The data has a timestamp column which determines ttl. But, this differs for each row. My question is, how can I handle ttl while inserting data in bulk from Spark.

My current implementation -

    raw_data_final.write.format("org.apache.spark.sql.cassandra")
       .mode(SaveMode.Overwrite).options(Map("table" -> offerTable ,
       "keyspace" -> keySpace, "spark.cassandra.output.ttl" -> ttl_seconds)).save   

Here raw_data_final has around a million records with each record yielding a different ttl. So, is there a way to do a bulk insert and somehow specify ttl from a column within raw_data.

Thanks.

1
Can you explain more? How you are going to use timestamp column as ttl?Kaushal
There is an expr_dt column, from which I can calculate ttl. (ttl = expr_dt - current timestamp). So, I can have ttl as one of my input columns.Sudha Viswanathan

1 Answers

2
votes

This is supported by setting the WriteConf parameter with TTLOption.perRow option. The official documentation has following example for RDDs:

import com.datastax.spark.connector.writer._
...
rdd.saveToCassandra("test", "tab", writeConf = WriteConf(ttl = TTLOption.perRow("ttl")))

In your case you need to replace "ttl" with the name of your column with TTL.

I'm not sure that you can set this directly on DataFrame, but you can always get RDD from DataFrame, and use saveToCassandra with WriteConf...

Update in September 2020th: support for writetime and ttl in dataframes was added in the Spark Cassandra Connector 2.5.0