NIFI: Proper way to consume kafka and store data into hive

Question

I have the task to create kafka consumer that should extract messages from kafka, transfrom it and store into Hive table.

So, in kafka topic there are a lot of messages as json object.

I like to add some field and insert its into hive.

I create flow with following Nifi-processors:

ConsumeKafka_2_0
JoltTransformJSON - for transform json
ConvertRecord - to transform json into insert query for hive
PutHiveQL

The topic will be sufficiently loaded and handle about 5Gb data per day.

So, are the any ways to optimize my flow (i think it's a bad idea to give a huge amount of insert queries to Hive)? Maybe it will be better to use the external table and putHDFS Processor (in this way how to be with partition and merge input json into one file?)

mattyb mattyb · Accepted Answer · 2020-05-19T15:49:13

As you suspect, using PutHiveQL to perform a large number of individual INSERTs is not very performant. Using your external table approach will likely be much better. If the table is in ORC format, you could use ConvertAvroToORC (for Hive 1.2) or PutORC (for Hive 3) which both generate Hive DDL to help create the external table.

There are also Hive streaming processors, but if you are using Hive 1.2 PutHiveStreaming is not very performant either (but should still be better than PutHiveQL with INSERTs). For Hive 3, PutHive3Streaming should be much more performant and is my recommended solution.

NIFI: Proper way to consume kafka and store data into hive

1 Answers