0
votes

I'm using spark in scala to build a generic application to parallelize http calls and my concern is whether it would be possible to perform saveToCassandra action based on the content of the RDD as the responses should go into different tables.

To provide more clarity,

val queries: List[Query] = List(Query("google", "fish"), Query("yahoo", "chicken"))
val inputRDD = sc.parallelize(queries)

where

case class Query(dataSource: String, query: String)

Each query is then mapped into a list of tuples to be saved into cassandra but based on the data source in the query, the data for google should go into the cassandra table for google and that for yahoo to its own table.

TIA

1

1 Answers

1
votes

I would simply filter and save individual subsets:

val keywords = Map("google" -> "googletab", "yahoo" -> "yahootab")
val keyspace: String = ???

val subsets = keywords.keys.map(k => 
  (k -> inputRDD.filter{case Query(x, _) => x == k}))

subsets.foreach{ case (k, rdd) =>
  rdd.saveToCassandra(keyspace, keywords(k), SomeColumns(???)) 
}