Adding results of Hadoop job to Hive Table

Question

I have a Map-only job that processes a large text file. Each line is analyzed and categorized. MultipleOutputs are used to output each category into separate files. Eventually all the data gets added to a Hive table dedicated to each category. My current workflow does the job but is a bit cumbersome. I am going to add a couple of categories, and thought I might be able to streamline line the process. I have a couple of ideas and was looking for some input.

Current Workflow:

Map-only job divides large file into categories. The output looks like this:

An external (non-Hadoop) process copies the output files into separate directories for each category.

An external table is created for each category and then the data is inserted into the permanent Hive table for that category.

Possible new workflows

Use Spark to loop through the output files, and based on the file name, insert the data into the appropriate permanent Hive table.
Use HCatalog to insert the data into the permanent Hive table directly from the Mapper, or perhaps a Reducer or set of Reducers dedicated to each category.

fi11er fi11er · Accepted Answer · 2017-02-09T09:46:29

For MultipleOutputs set output path to base folder where your hive external tables located. Then write data into "<table_name>/<filename_prefix>". And your data will be located in your target tables.

Adding results of Hadoop job to Hive Table

1 Answers