1
votes

I have a hive table with one column being an array of strings. I also have a set of custom UDFs that manipulate individual strings. I would like to make hive execute my custom UDF on each element in an array and then return the result as a modified array.

This seems like a simple requirement, but I wasn't able to find a simple solution for it. I found two possibilities, none of them being simple really:

  1. Do a hive SQL gymnastic with explode and lateral view, then invoke UDF, then aggregate back into array. This seems way too big overkill as I don't see it executing in less than 2 mapreduce jobs (but I could be wrong here).
  2. Implement each of my UDFs as GenericUDF that, is supplied with an array, processes each element in it and returns an array again. This requires a lot more development.

Is there any simple way to do this?

1
Choice (1) seems reasonable to me. Is there a reason you want to avoid 2 mapreduce jobs? - gobrewers14
You want to execute query in as few mapreduce jobs as possible to reduce IO to the minimum. That makes the difference between a slow query and a fast one. - miljanm
Yes I am aware 2 > 1. My question is is rewriting your UDFs for this specific case MORE efficient than simply waiting through 1 extra MR job? What if the 2nd job takes 1 min? - gobrewers14
Well, I happen to have enough knowledge about my data to know that one mapreduce job versus two will make a big difference. For the general case, yes, it may be unnecessary to implement custom UDFS. - miljanm
I filed issues.apache.org/jira/browse/HIVE-13993 for adding built-in support for this to Hive. - erwaman

1 Answers

1
votes

There's no way I know of to do it without either more custom UDF code, or as you say, requiring more MR jobs.

But I would suggest a possible third option - write a GenericUDF that takes two arguments: an array and the class name of another UDF. Instantiate and call the UDF through reflection, pass it everything in the array, and return the resulting array. This might be a bit difficult to write, but at least then you won't have to rewrite all of your existing UDFs, as you mentioned.