Say, I have an application that reads a batch of data from KAFKA, it uses the keys of the incoming messages and makes a query to HBase (reads the current data from HBase for those keys), does some computation and writes data back to HBase for the same set of keys. For e.g.
{K1, V1}, {K2, V2}, {K3, V3} (incoming messages from KAFKA) --> My Application (Reads the current value of K1, K2 and K3 from HBase, uses the incoming value V1, V2 and V3 does some compute and writes the new values for K1 (V1+x), K2 (V2+y) and K3(V3+z) back to HBase after the processing is complete.
Now, let’s say I have one partition for the KAFKA topic and 1 consumer. My application has one consumer thread that is processing the data.
The problem is that say HBase goes down, at which point my application stops processing messages, and a huge lag builds into KAFKA. Even, though I have the ability to increase the number of partitions and correspondingly the consumers, I cannot increase either of them because of RACE conditions in HBase. HBase doesn’t support row level locking so now if I increase the number of partitions the same key could go to two different partitions and correspondingly to two different consumers who may end up in a RACE condition and whoever writes last is the winner. I will have to wait till all the messages gets processed before I can increase the number of partitions.
For e.g.
HBase goes down --> Initially I have one partition for the topic and there is unprocessed message --> {K3, V3} in partition 0 --> now I increase the number of partitions and message with key K3 is now present let’s say in partition 0 and 1 --> then consumer consuming from partition 0 and another consumer consuming from partition 1 will end up competing to write to HBase.
Is there a solution to the problem? Of course locking the key K3 by the consumer processing the message is not the solution since we are dealing with Big Data.