I am looking at ways to write a general data cleansing framework that cleans the entire row based on the position and the type configured for a given data set.
Sample input record from the data set as follows,
100| John | Mary | 10Sep2013 | 10,23,4
Now the configuration would be based on the position (starting from index 1). For example, at position 2 trim the spaces, at position 4 convert to the hive standard date, at position 5 remove the commas. This is configured at the data set level.
Now if these have to plugged into hive or pig, there should be a way for the hive\Pig UDFs to accept the entire row as input. The UDF should parse the row based on the configurable field separator and apply the field\column specific operations based on positions. This way it does not matter whether pig or hive or anything else is used for such row based operations. I know this is a bit more involved to abstract the hive\pig specific row types and provide generic position based getter.
It also may make sense to call the UDF for the entire row rather than for each columns to make things faster.
Is there a way for the hive\pig UDFs to accept the entire line of text as the input?