I have to filter Cassandra table in spark, after getting data from a table via spark, apply filter function on the returned rdd ,we dont want to use where clause in cassandra api that can filter but that needs custom sasi index on the filter column, which has disk overhead issue due to multiple ss table scan in cassandra . for example:
val ct = sc.cassandraTable("keyspace1", "table1")
val fltr = ct.filter(x=x.contains "zz")
table1 fields are :
- dirid uuid
- filename text
- event int
- eventtimestamp bigint
- fileid int
- filetype int
Basically we need to filter data based on filename with arbitrary string. since returned rdd is of type
com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD
and filter operations are restricted only to the methods of CassandraRow
type which are enter image description here
val ct = sc.cassandraTable("keyspace1", "table1")
scala> ct
res140: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[171] at RDD at CassandraRDD.scala:19
when i hit tab after "x." in the below filter function, which shows the below methods of CassandraRow class`enter code here
scala> ct.filter(x=>x.
columnValues getBooleanOption getDateTime getFloatOption getLongOption getString getUUIDOption length
contains getByte getDateTimeOption getInet getMap getStringOption getVarInt metaData
copy getByteOption getDecimal getInetOption getRaw getTupleValue getVarIntOption nameOf
dataAsString getBytes getDecimalOption getInt getRawCql getTupleValueOption hashCode size
equals getBytesOption getDouble getIntOption getSet getUDTValue indexOf toMap
get getDate getDoubleOption getList getShort getUDTValueOption isNullAt toString
getBoolean getDateOption getFloat getLong getShortOption getUUID iterator