2
votes

I am trying to import data from MySql to Hive using QueryDatabaseTable and PutHiveQl processors, but an error occurs.

I have some questions:

  1. What is the output format of puthiveql?
  2. Should the output table be created beforehand or will the processor do that?
  3. Where can I find a template for the MySql to Hive process?
1

1 Answers

5
votes

Here is some information about your questions:

  1. The flow files input to PutHiveQL are output after they have been sent to Hive (or if the send fails), so the output format (and contents) are identical to the input format/contents.

  2. The output table should be created beforehand, but you could send PutHiveQL a "CREATE TABLE IF NOT EXISTS" statement first and it will create the table for you.

  3. I'm not aware of an existing template, but a basic approach could be the following:

QueryDatabaseTable -> ConvertAvroToJSON -> SplitJson -> EvaluateJsonPath -> UpdateAttribute (optional) -> ReplaceText -> PutHiveQL

  • QueryDatabaseTable will do incremental fetches of your MySQL table.

  • ConvertAvroToJSON will get the records into a format you can manipulate (there currently aren't many processors that handle Avro)

  • SplitJson will create a flow file for each of the records/rows

  • EvaluateJsonPath can extract values from the records and put them in flow file attributes

  • UpdateAttribute could add attributes containing type information. This is optional, used if you are using prepared statements for PutHiveQL

  • ReplaceText builds a HiveQL statement (INSERT, e.g.) either with parameters (if you want prepared statements) or hard-coded values from the attributes

  • PutHiveQL executes the statement(s) to get the records into Hive

In NiFi 1.0, there will be a ConvertAvroToORC processor, this is a more efficient way to get data into Hive (as well as to query it from Hive). That approach is to convert the results of QueryDatabaseTable to ORC files, which are then placed in HDFS (using PutHDFS), and it generates a partial Hive DDL statement to create the table for you (using the type information from the Avro records). You pass that statement (after filling in the target location) to PutHiveQL, and you can immediately start querying your table.

There are also plans for a PutHiveStreaming processor which takes Avro records as input, so that flow would just be QueryDatabaseTable -> PutHiveStreaming, which would insert the records directly into Hive (and is much more efficient than multiple INSERT statements).