Importing long rows to Cassandra from Spark

Question

I've been trying to use the datastax spark-cassandra connector (https://github.com/datastax/spark-cassandra-connector) to import some data from csv files. I understand that most of the time case classes can be used on the import, but I'm dealing with rows with about 500 fields and so I can't use them without nesting (due to the 22 field limit on cases). It's also possible to directly store a map, but I don't think this is ideal either since there are several data types.

I may be missing something in the conversion from RDD[String] -> RDD[(String, String, ...)] Since a .split(",") just yields RDD[Array[String]].

I've done a fair amount of searching without much luck, so any help would be greatly appreciated! Thanks.

G Quintana G Quintana · Accepted Answer · 2015-03-03T13:41:03

I would do something like this:

Read your text file (or whatever file format)
Use .map( ..) to convert each line into an Array[Any] (or a Map[String,Any])
Two options here
- Convert each Array[Any] into CassandraRow. A CassandraRow is just columnNames:Array[String] and columnValues:Array[Any] and then write the RDD[CassandraRow]
- Implement a RowWriterFactory[Array[Any]] and write the RDD[Array[Any]] using the custom RowWriterFactory. Look at CassandraRowWriter's code.

Importing long rows to Cassandra from Spark

1 Answers