I have an incoming spark data stream that ingests messages containing device Ids:
{deviceId=123 , deviceState: "turned off" }
I want to join this to a table of device information:
{deviceId=123 , deviceInfo: "The red refrigirator" }
To get denormalized tuples such as:
{deviceId=123 , deviceState: "turned off", deviceInfo: "The red refrigirator" }
The device_info table is stored in HBASE. Now here is the problem: Every once in a while the "device-info" hbase table can be altered: A new device is added to the table, Info for an existing device is changed, etc... These changes are NOT real time, I can tolerate a few minutes of latency with the updates.
I can see three approaches to the problem:
Not using Spark joins: For each entry in the DataStream, perform a single hbase lookup for the device_info by id.
- this should work, but it seems extremely low level, and potentially inefficient.
Create an RDD from hbase at the beginning of the program using "newAPIHadoopRDD", and then join it with each new microbatch in the datastream.
- I would potentially miss any updates to the hbase table. (Would the hbase table ever get rescanned ?)
For each microbatch in the incoming Datastream (ds.foreachRdd): Create an RDD from hbase (newAPIHadoopRDD) and then call join.
- This seems dangerous: creating so many new RDDs from hbase for each spark-streaming might add too much latency.
What is the approach that I should take?