1
votes

I have to format data in a flat file before its getting loaded into Hive table.

CF32|4711|00010101Z| +34.883|  98562AS1D |N8594ãä| 00   | 2

The file is pipe separated and I need to apply different cleaning and formatting functions on the different columns in the flat file . I have multiple functions to Clean_Text, Format_Date, Format_TimeStamp, Format_Integer etc.

My idea is to pass the schema as constructor to my UDF and call the different functions on the flat file in pig.

A = LOAD 'call_detail_records'  USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;

REGISTER ZPigUdfs.jar;
DEFINE DFormat com.zna.pig.udf.DataColumnFormatter(A);

B = FOREACH A GENERATE DFormat($0);
DUMP B;

But how can I pass the schema ? DUMP A actually dumps the entire table but I need the metadata only. My current UDF pseudo code looks like

public class DataColumnFormatter extends EvalFunc {

private Tuple schema;

public DataColumnFormatter(Tuple schema) {
    this.schema = schema;
}

@Override
public String exec(Tuple inputTuple) throws IOException {

    if (inputTuple != null && inputTuple.size() > 0) {
        String inpString = inputTuple.get(0).toString();
        System.out.println(inpString);
        System.out.println(schema);

        /**
         * Logic for splitting the string as pipe and apply functions based
         * on positions of schema if(schema[1] -> date ){
         * 
         * formatDate(input) }else if(schema[1] -> INT ){
         * 
         * formatInt(input); }
         * 
         */

    }

    return null;
}

}

How can I get the schema in PIG UDF or is there any alternative way to achieve this.

Thanks in advance.

1
(1) Where should this schema come from? (2) What is holding you from defining the schema as a constant? (3) Is it possible that different lines within the same table will have different schema? - Zach Beniash
The schema should come from the HCatalog. I have multiple files and I don't want to define schema every time I run the scripts . No all records in a table are having same schema . - Abhi

1 Answers

1
votes

From within your EvalFunc you can call this.getInputSchema() (at least since Pig v0.12, maybe earlier). You shouldn't need to do anything special to pass in the schema, and since you loaded from HCatalog, A will already be decorated.

Alternately, you could consider breaking out separate UDF functions for each data type. Something like B = FOREACH A GENERATE dateFormat($0), cleanText($1), dateFormat($2);