Bulk loading Avro to HBase with NiFi

Question

I'm ingesting flowfiles containing Avro records with NiFi, and need to insert them into HBase. These flowfiles vary in size, but some have 10,000,000+ records. I use SplitAvro twice (one to split to 10,000 recs, then one to split to 1 rec), then use an ExecuteScript processor to pull out the row key for HBase and add it as a flowfile attribute. Finally I use PutHBaseCell (with a batch size of 10,000) to write to HBase using the row key attribute..

The processor that splits the Avro to 1 rec is very slow (Concurrent tasks is set to 5). Is there a way to speed that up? And is there a better way to load this Avro data into HBase?

(I am using a 2 node NiFi (v1.2) cluster (made from VMs), each node has 16 CPUs and 16GB RAM.)

Please format your question and single out the question because it's a wall of text and it's unclear what you are asking. — Maciej Jureczko

Bryan Bende Bryan Bende · Accepted Answer · 2017-10-02T13:28:48

There is a new PutHBaseRecord processor that will be part of the next release (there is a 1.4.0 release being voted upon right now).

With this processor you would avoid ever splitting your flow files, and you just send a flow file will millions of Avro records right to PutHBaseRecord, and PutHBaseRecord would be configured with an Avro reader.

You should get significantly better performance with this approach.

Bulk loading Avro to HBase with NiFi

1 Answers