6
votes

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?

Here is the scenario: I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column. I have written a Pig script which does the job of getting the distinct primary keys and counting them. However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.

3

3 Answers

3
votes

The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":

REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');

B = FOREACH A GENERATE f1, CommaSplitter(f2);

Hopefully that conveys the idea.

1
votes

To pass parameters you do the following in your pigscript:

UDF(document, '$param1', '$param2', '$param3')

edit: Not sure if those params need to be wrappedin ' ' or not

while in your UDF you do:

public class UDF extends EvalFunc<Boolean> {



public Boolean exec(Tuple input) throws IOException {

    if (input == null || input.size() == 0)
        return false;

    FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());


    String var1 = input.get(1).toString();
    InputStream var1In = fs.open(new Path(var1));


    String var2 = input.get(2).toString();
    InputStream var2In = fs.open(new Path(var2));

    String var3 = input.get(3).toString();
    InputStream var3In = fs.open(new Path(var3));



    return doyourthing(input.get(0).toString());
}
}

for example

0
votes

Yes, you can pass any parameter in the Tuple parameter input of your UDF:

exec(Tuple input)

and access it using

input.get(index)