1
votes

I am getting Rdd using spark and hbase. Now i want to filter that rdd and get a specific value from that Rdd. How can i proceed with?

Here is what i have done up to now

val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tbl_date")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result])

Now i want to use that RDD(hBaseRDD) and get a specific column data by sending a specific parameter to the RDD. How can i achieve this?

1

1 Answers

0
votes

What you already have:

val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tbl_date")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result])

Add the following:

val localData = hbaseRDD.collect()  // This is array of Result
val filteredData = localData.map{ result =>
               result.getColumnCells("MyColFamily", "MyColName").get(0) // assuming you want first cell: otherwise
                                                       // you could also take all of them..
             }.filter{ cell => new String(cell.getValueArray()).startswtih("SomePrefix") }

The above shows placeholder/dummy functions for :

  • get(0) You need to decide if you want just first cell or all cells
  • new String(cell.getValueArray()) You need to convert to proper data type
  • .startsWith(..) You need to decide what to do with the data

But in any case the above gives you the flow and outline of how to process the hbase cell data.