How do I use Spark Dataframes to export rows from C* to CSV files

Question

I need to periodically archive/cold store rows form C* tables to CSV. For example: export Jan-Jun 2016's rows in C* table my_table to a CSV my_table.2016_06-30.csv, export Jul-Dec 2016's rows in my_table to my_table.2016-12-31.csv, and so on.

I considered CQL to do this, but not all my tables have timestamp columns for my rows. It has been suggested that I use Spark Dataframes to do this (so I can, get to metadata like writeTime available from the Spark Cassandra Connector).

I'm new to the Spark Cassandra Connector and this sort of Spark use.

Can anyone point me to a good example of how to write a Cassandra Dataframe to a CSV, where I am 'filtering' the Dataframe on writeTime or something like it?

Thanks in advance for any help, guidance, etc.

Sai Sai · Accepted Answer · 2017-09-02T03:12:34

For a table:

CREATE TABLE myKeySpace.myTable (
id INTPRIMARY KEY,
name TEXT)

do:

case class SimpleTable(id: Int, name: String, name_writeTime: Long)

val df = sc.cassandraTable[SimpleTable]("saitejal", "mysql_import")
           .select("id", "name", "name".writeTime as "name_writeTime")
           .toDF

# Do filtering as needed

df.select("id", "name").write()
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .save("backup_all.csv");

I do see some old tasks on DataStax JIRA regarding writeTime support for DataFrames, especially this PR on GitHub. It is unclear whether writeTime is available for both loading and writing DFs.

How do I use Spark Dataframes to export rows from C* to CSV files

1 Answers