0
votes

I have a Map-only job that processes a large text file. Each line is analyzed and categorized. MultipleOutputs are used to output each category into separate files. Eventually all the data gets added to a Hive table dedicated to each category. My current workflow does the job but is a bit cumbersome. I am going to add a couple of categories, and thought I might be able to streamline line the process. I have a couple of ideas and was looking for some input.

Current Workflow:

  1. Map-only job divides large file into categories. The output looks like this:
    categ1-m-00000
    categ1-m-00001
    categ1-m-00002
    categ2-m-00000
    categ2-m-00001
    categ2-m-00002
    categ3-m-00000
    categ3-m-00001
    categ3-m-00002
  1. An external (non-Hadoop) process copies the output files into separate directories for each category.
    categ1/00000
    categ1/00001
    categ1/00002
    categ2/00000
    categ2/00001
    categ2/00002
    categ3/00000
    categ3/00001
    categ3/00002
  1. An external table is created for each category and then the data is inserted into the permanent Hive table for that category.

Possible new workflows

  • Use Spark to loop through the output files, and based on the file name, insert the data into the appropriate permanent Hive table.
  • Use HCatalog to insert the data into the permanent Hive table directly from the Mapper, or perhaps a Reducer or set of Reducers dedicated to each category.
1

1 Answers

0
votes

For MultipleOutputs set output path to base folder where your hive external tables located. Then write data into "<table_name>/<filename_prefix>". And your data will be located in your target tables.