1
votes

I have to filter Cassandra table in spark, after getting data from a table via spark, apply filter function on the returned rdd ,we dont want to use where clause in cassandra api that can filter but that needs custom sasi index on the filter column, which has disk overhead issue due to multiple ss table scan in cassandra . for example:

val ct = sc.cassandraTable("keyspace1", "table1")
val fltr = ct.filter(x=x.contains "zz")

table1 fields are :

  • dirid uuid
  • filename text
  • event int
  • eventtimestamp bigint
  • fileid int
  • filetype int

Basically we need to filter data based on filename with arbitrary string. since returned rdd is of type com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD and filter operations are restricted only to the methods of CassandraRow type which are enter image description here

    val ct = sc.cassandraTable("keyspace1", "table1")
    scala> ct
    res140: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[171] at RDD at CassandraRDD.scala:19

when i hit tab after "x." in the below filter function, which shows the below methods of CassandraRow class`enter code here

scala> ct.filter(x=>x.
columnValues   getBooleanOption   getDateTime         getFloatOption   getLongOption    getString             getUUIDOption     length
contains       getByte            getDateTimeOption   getInet          getMap           getStringOption       getVarInt         metaData
copy           getByteOption      getDecimal          getInetOption    getRaw           getTupleValue         getVarIntOption   nameOf
dataAsString   getBytes           getDecimalOption    getInt           getRawCql        getTupleValueOption   hashCode          size
equals         getBytesOption     getDouble           getIntOption     getSet           getUDTValue           indexOf           toMap
get            getDate            getDoubleOption     getList          getShort         getUDTValueOption     isNullAt          toString
getBoolean     getDateOption      getFloat            getLong          getShortOption   getUUID               iterator
1

1 Answers

0
votes

You need to get string field from the CassandraRow object, and then perform filtering on it. So this code will look as following:

val fltr = ct.filter(x => x.getString("filename").contains("zz"))