1
votes

I am looking at ways to write a general data cleansing framework that cleans the entire row based on the position and the type configured for a given data set.

Sample input record from the data set as follows,

100| John |  Mary | 10Sep2013 | 10,23,4

Now the configuration would be based on the position (starting from index 1). For example, at position 2 trim the spaces, at position 4 convert to the hive standard date, at position 5 remove the commas. This is configured at the data set level.

Now if these have to plugged into hive or pig, there should be a way for the hive\Pig UDFs to accept the entire row as input. The UDF should parse the row based on the configurable field separator and apply the field\column specific operations based on positions. This way it does not matter whether pig or hive or anything else is used for such row based operations. I know this is a bit more involved to abstract the hive\pig specific row types and provide generic position based getter.

It also may make sense to call the UDF for the entire row rather than for each columns to make things faster.

Is there a way for the hive\pig UDFs to accept the entire line of text as the input?

1

1 Answers

0
votes

The only way to take the entire row as input is just keep the whole text as one column. But as far as treating the columns separately is concerned you can use as UDTF which takes input as 1 column but output of that UDTF will be multiple columns which can be used by Hive or Pig.

The other option is keep the values in different columns but build a UDF which us smart enough to understand the format of data and accordingly give different output. But UDF will take 1 col and output also will be 1 col