I have a Map-only job that processes a large text file. Each line is analyzed and categorized. MultipleOutputs are used to output each category into separate files. Eventually all the data gets added to a Hive table dedicated to each category. My current workflow does the job but is a bit cumbersome. I am going to add a couple of categories, and thought I might be able to streamline line the process. I have a couple of ideas and was looking for some input.
Current Workflow:
- Map-only job divides large file into categories. The output looks like this:
-
categ1-m-00000
categ1-m-00001
categ1-m-00002
categ2-m-00000
categ2-m-00001
categ2-m-00002
categ3-m-00000
categ3-m-00001
categ3-m-00002
- An external (non-Hadoop) process copies the output files into separate directories for each category.
-
categ1/00000
categ1/00001
categ1/00002
categ2/00000
categ2/00001
categ2/00002
categ3/00000
categ3/00001
categ3/00002
- An external table is created for each category and then the data is inserted into the permanent Hive table for that category.
Possible new workflows
- Use Spark to loop through the output files, and based on the file name, insert the data into the appropriate permanent Hive table.
- Use HCatalog to insert the data into the permanent Hive table directly from the Mapper, or perhaps a Reducer or set of Reducers dedicated to each category.