0
votes

Summary : Concern is related to UDF creation in Hive.

Dear friends, As I am new in creating UDFs in Hive (I have read about this via google but not getting very clear idea), my first thing here is to determine which would be the best possible way like Java/Python or any other to write hive UDFs.

Another thing is on what basis I should analyse? What all parameter I should look for ?

Please not that I have few functions as given below for which UDFs needs to be written. 1. To select and group by clauses required for another function when "no aggregation" is needed. 2. To return the select and group by clauses required when "aggregation" is needed. 3. For vector_indexes are SUM, LISTAGG strings for the data collection query 4. To return the WHERE clause used by other function. 5 To return the nth item in a comma separated string. 6. Percentile Value function for Narrow data. 7. To calculates percentile for a given counter name. Along with the percentile, it also outputs the number of samples used in the calculation, the peak and average.

Thank you very much in advance,

1

1 Answers

0
votes

This question probably isnt within guidelines because you are asking for an opinion.

Having said that i would propose that:

A) you pick a language that you know.

B) if you know both, then pick based upon the features you need.

C) consider performance - i believe (but cannot confifm) that a compiled Java Jar will run without launching a java runtime just to support that Java module (it will run inside the hive java instance). To run a Python module a new python interpreter will need to be instantiated and data transferred via interprocess communication. Thus java is possibly slightly more perfofmant - especially if the algorithm is simple. However unless you are processing huge data sets you probably would not even notice.

Finally, you could probably do all of the functions you asked with Hive query language.