Apache Spark + cassandra+Java +Spark session filter records based on datetime between given from and to values

Question

I am working on a Spring Java Project and integrating Apache spark and cassandra using Datastax connector.

I have autowired sparkSession and the below lines of code seems to work.

Map<String, String> configMap = new HashMap<>();
configMap.put("keyspace", "key1");
configMap.put("table", tableName.toLowerCase());

Dataset<Row> ds = sparkSession.sqlContext().read().format("org.apache.spark.sql.cassandra").options(configMap)
        .load();
ds.show();

In the above step I am loading Datasets and in below step I am doing filtration of datetime field .

String s1 = "2020-06-23 18:51:41";
String s2 = "2020-06-23 18:52:21";

Timestamp from = Timestamp.valueOf(s1);
Timestamp to = Timestamp.valueOf(s2);
ds = ds.filter(df.col("datetime").between(from, to));

Is it possible to apply this filter condition during load itself.If so can someone suggest me how to do this?

Thanks in advance.

Rayan Ral Rayan Ral · Accepted Answer · 2020-06-26T04:21:00

You don't have to do anything explicitly here, spark-cassandra-connector has predicate pushdown, so your filtering condition would be applied during the data selection.

Source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

The connector will automatically pushdown all valid predicates to Cassandra. The Datasource will also automatically only select columns from Cassandra which are required to complete the query. This can be monitored with the explain command.

Apache Spark + cassandra+Java +Spark session filter records based on datetime between given from and to values

2 Answers