saveToCassandra based on content from RDD

Question

I'm using spark in scala to build a generic application to parallelize http calls and my concern is whether it would be possible to perform saveToCassandra action based on the content of the RDD as the responses should go into different tables.

To provide more clarity,

val queries: List[Query] = List(Query("google", "fish"), Query("yahoo", "chicken"))
val inputRDD = sc.parallelize(queries)

where

case class Query(dataSource: String, query: String)

Each query is then mapped into a list of tuples to be saved into cassandra but based on the data source in the query, the data for google should go into the cassandra table for google and that for yahoo to its own table.

TIA

zero323 zero323 · Accepted Answer · 2015-11-05T15:47:17

I would simply filter and save individual subsets:

val keywords = Map("google" -> "googletab", "yahoo" -> "yahootab")
val keyspace: String = ???

val subsets = keywords.keys.map(k => 
  (k -> inputRDD.filter{case Query(x, _) => x == k}))

subsets.foreach{ case (k, rdd) =>
  rdd.saveToCassandra(keyspace, keywords(k), SomeColumns(???)) 
}

saveToCassandra based on content from RDD

1 Answers