1
votes

I am applying the following through the Spark Cassandra Connector:

val links = sc.textFile("linksIDs.txt")
links.map( link_id => 
{ 
val link_speed_records = sc.cassandraTable[Double]("freeway","records").select("speed").where("link_id=?",link_id)
average = link_speed_records.mean().toDouble
})

I would like to ask if there is way to apply the above sequence of queries more efficiently given that the only parameter I always change is the 'link_id'.

The 'link_id' value is the only Partition Key of my Cassandra 'records' table. I am using Cassandra v.2.0.13, Spark v.1.2.1 and Spark-Cassandra Connector v.1.2.1

I was thinking if it is possible to open a Cassandra Session in order to apply those queries and still get the 'link_speed_records' as a SparkRDD.

1
I am curious how your are able to run the code you've posted without getting a NPE due to sc not being available to the workers from within an RDD.Metropolis
All the requests(queries) were sent through the spark driver using the spark context. The workers were solely responsible to compute the resulted CassandraRDD. Hence, there was no need for the sc to be available to the workers.raschild

1 Answers

1
votes

Use the joinWithCassandra Method to use an RDD of keys to pull data out of a Cassandra Table. The method given in the question will be extremely expensive comparatively and also not function well as a parallelizable request.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12