2
votes

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :

sc.newAPIHadoopRDD(..)

I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.

Thanks!

1

1 Answers

1
votes

The link provided does the following

  1. Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.

  2. The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.

I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.