I am following the instructions found here to connect my spark program to read data from Cassandra. Here is how I have configured spark:
val configBuilder = SparkSession.builder
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.config("spark.cassandra.connection.host", cassandraUrl)
.config("spark.cassandra.connection.port", 9042)
.config("spark.sql.catalog.myCatalogName", "com.datastax.spark.connector.datasource.CassandraCatalog")
According to the documentation, once this is done I should be able to query Cassandra like this:
spark.sql("select * from myCatalogName.myKeyspace.myTable where myPartitionKey = something")
however when I do so I get the following error message:
mismatched input '.' expecting <EOF>(line 1, pos 43)
== SQL ==
select * from myCatalog.myKeyspace.myTable where myPartitionKey = something
----------------------------------^^^
When I try in the following format I am successful at retrieving entries from Cassandra:
val frame = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "myKeyspace", "table" -> "myTable"))
.load()
.filter(col("timestamp") > startDate && col("timestamp") < endDate)
However this query requires a full table scan to be performed. The table contains a few million entries and I would prefer to avail myself of the predicate Pushdown functionality, which it would seem is only available via the SQL API.
I am using spark-core_2.11:2.4.3, spark-cassandra-connector_2.11:2.5.0 and Cassandra 3.11.6
Thanks!
To set up a catalog put the following configuration into your SparkSession configuration (or any other Spark Configuration file or Object)- Jewels