I have to format data in a flat file before its getting loaded into Hive table.
CF32|4711|00010101Z| +34.883| 98562AS1D |N8594ãä| 00 | 2
The file is pipe separated and I need to apply different cleaning and formatting functions on the different columns in the flat file . I have multiple functions to Clean_Text, Format_Date, Format_TimeStamp, Format_Integer etc.
My idea is to pass the schema as constructor to my UDF and call the different functions on the flat file in pig.
A = LOAD 'call_detail_records' USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;
REGISTER ZPigUdfs.jar;
DEFINE DFormat com.zna.pig.udf.DataColumnFormatter(A);
B = FOREACH A GENERATE DFormat($0);
DUMP B;
But how can I pass the schema ? DUMP A actually dumps the entire table but I need the metadata only. My current UDF pseudo code looks like
public class DataColumnFormatter extends EvalFunc {
private Tuple schema;
public DataColumnFormatter(Tuple schema) {
this.schema = schema;
}
@Override
public String exec(Tuple inputTuple) throws IOException {
if (inputTuple != null && inputTuple.size() > 0) {
String inpString = inputTuple.get(0).toString();
System.out.println(inpString);
System.out.println(schema);
/**
* Logic for splitting the string as pipe and apply functions based
* on positions of schema if(schema[1] -> date ){
*
* formatDate(input) }else if(schema[1] -> INT ){
*
* formatInt(input); }
*
*/
}
return null;
}
}
How can I get the schema in PIG UDF or is there any alternative way to achieve this.
Thanks in advance.