4
votes

I working on spark 1.2.1 with datastax/spark-cassandra-connector and C* table filled with 1B+ rows (datastax-enterprise dse 4.7.0). I need to perform a range filter/where query on time stamp parameter.

What is the best way to do it without loading the whole 1B+ rows table to sparks memory (it could take hours to finish) and practically push the query back to C*?

Using rdd with JoinWithCassandraTable or using data frame with pushdown? Is there something else?

1
Here is a demo from DSE on how they perform join with C* and Spark. Ofc they don't use DataFrame because the feature was not available then.eliasah
@eliasah In the link they perform join between two tables on id parameter... But I need to perform join between array of time stamps (as longs) and a huge table- range filter/where query.Reshef
DataFrames are available as of 1.4 which ships with the latest DSE. JoinWithCassandraTable is a good option in many cases.phact
@phact I tried the JoinWithCassandraTable (with 1.2.1) and it didnt work as I discribed here: stackoverflow.com/questions/33329494/…. Maybe I need to upgrade to 1.4...Reshef
You might also check this out.rs_atl

1 Answers

1
votes

JoinWithCassandraTable turned to be the best solution in my case. I have learned a lot from this post: http://www.datastax.com/dev/blog/zen-art-spark-maintenance and post an answer to the linked question: Spark JoinWithCassandraTable on TimeStamp partition key STUCK

It is all about building your C* table in the right way (extra important to choose good partition keys) for your future queries.