1
votes

I've been trying to use the datastax spark-cassandra connector (https://github.com/datastax/spark-cassandra-connector) to import some data from csv files. I understand that most of the time case classes can be used on the import, but I'm dealing with rows with about 500 fields and so I can't use them without nesting (due to the 22 field limit on cases). It's also possible to directly store a map, but I don't think this is ideal either since there are several data types.

I may be missing something in the conversion from RDD[String] -> RDD[(String, String, ...)] Since a .split(",") just yields RDD[Array[String]].

I've done a fair amount of searching without much luck, so any help would be greatly appreciated! Thanks.

1

1 Answers

5
votes

I would do something like this:

  1. Read your text file (or whatever file format)
  2. Use .map( ..) to convert each line into an Array[Any] (or a Map[String,Any])
  3. Two options here
    • Convert each Array[Any] into CassandraRow. A CassandraRow is just columnNames:Array[String] and columnValues:Array[Any] and then write the RDD[CassandraRow]
    • Implement a RowWriterFactory[Array[Any]] and write the RDD[Array[Any]] using the custom RowWriterFactory. Look at CassandraRowWriter's code.