2
votes

I need to periodically archive/cold store rows form C* tables to CSV. For example: export Jan-Jun 2016's rows in C* table my_table to a CSV my_table.2016_06-30.csv, export Jul-Dec 2016's rows in my_table to my_table.2016-12-31.csv, and so on.

I considered CQL to do this, but not all my tables have timestamp columns for my rows. It has been suggested that I use Spark Dataframes to do this (so I can, get to metadata like writeTime available from the Spark Cassandra Connector).

I'm new to the Spark Cassandra Connector and this sort of Spark use.

Can anyone point me to a good example of how to write a Cassandra Dataframe to a CSV, where I am 'filtering' the Dataframe on writeTime or something like it?

Thanks in advance for any help, guidance, etc.

1

1 Answers

0
votes

For a table:

CREATE TABLE myKeySpace.myTable (
id INTPRIMARY KEY,
name TEXT)

do:

case class SimpleTable(id: Int, name: String, name_writeTime: Long)

val df = sc.cassandraTable[SimpleTable]("saitejal", "mysql_import")
           .select("id", "name", "name".writeTime as "name_writeTime")
           .toDF

# Do filtering as needed

df.select("id", "name").write()
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .save("backup_all.csv");

I do see some old tasks on DataStax JIRA regarding writeTime support for DataFrames, especially this PR on GitHub. It is unclear whether writeTime is available for both loading and writing DFs.