Spark Streaming Joins with Hbase

Question

I have an incoming spark data stream that ingests messages containing device Ids:

{deviceId=123 , deviceState: "turned off" }

I want to join this to a table of device information:

{deviceId=123 , deviceInfo: "The red refrigirator" }

To get denormalized tuples such as:

{deviceId=123 , deviceState: "turned off", deviceInfo: "The red refrigirator" }

The device_info table is stored in HBASE. Now here is the problem: Every once in a while the "device-info" hbase table can be altered: A new device is added to the table, Info for an existing device is changed, etc... These changes are NOT real time, I can tolerate a few minutes of latency with the updates.

I can see three approaches to the problem:

Not using Spark joins: For each entry in the DataStream, perform a single hbase lookup for the device_info by id.
- this should work, but it seems extremely low level, and potentially inefficient.
Create an RDD from hbase at the beginning of the program using "newAPIHadoopRDD", and then join it with each new microbatch in the datastream.
- I would potentially miss any updates to the hbase table. (Would the hbase table ever get rescanned ?)
For each microbatch in the incoming Datastream (ds.foreachRdd): Create an RDD from hbase (newAPIHadoopRDD) and then call join.
- This seems dangerous: creating so many new RDDs from hbase for each spark-streaming might add too much latency.

What is the approach that I should take?

John Leach John Leach · Accepted Answer · 2016-08-04T18:01:15

I would do #1 with a slight modification. I would

Process all messages into memory and create a set of Device Ids to lookup.
Perform a multiget in HBase to lookup the device ids and place into a hashmap.
Probe the hashmap to assemble the joined record.

This would allow you to reduce the number of network calls (multiget) and remove repetitive calls for the same device (set vs. list).

We did this in Splice Machine for Foreign Key Validation and it increased our performance significantly (2-3X).

Good luck.

Spark Streaming Joins with Hbase

1 Answers