I have written the below Pig UDF for testing if a chararray column is having valid 'yyyy-MM-dd' date format or not. But while testing using below script, I am getting the below error. Is there any problem with the data because I am handling null tuples to consider the NULL values in data as well as non-existent values. Also, should I remove the empty line in the data file?
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2114: Expected input to be chararray, but got NULL
at IsValidDateTime.exec(IsValidDateTime.java:41)
at IsValidDateTime.exec(IsValidDateTime.java:18)
dates.txt
2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
,2020-08-20
NULL,2017-09-28
2016-11-15,NULL
2018-04-17,Thu Aug-20 2020
2017-05-09,2020-08-20
Mon Jan-20 2020,2020-08-20
<empty line>
------------------------------------------
dates_valid (expected all valid 'yyyy-MM-dd' start_dt and end_dt)
2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
2017-05-09,2020-08-20
Pig script
REGISTER 'IsValidDateTime.jar'
DEFINE IsValidDateTime IsValidDateTime();
dates = LOAD 'dates.txt' USING PigStorage(',') AS (start_dt:chararray, end_dt:chararray);
DUMP dates;
dates_valid = FILTER dates BY (IsValidDateTime(start_dt) AND IsValidDateTime(end_dt));
DUMP dates_valid;
IsValidDateTime Filter UDF
import org.apache.pig.FilterFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
public class IsValidDateTime extends FilterFunc {
private static String datePattern = "yyyy-MM-dd";
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
try {
Object date = input.get(0);
if(DataType.findType(date) == DataType.CHARARRAY){
String dateStr = String.valueOf(date);
if(dateStr != null && dateStr.length() != 0) {
try {
SimpleDateFormat format = new SimpleDateFormat(datePattern);
format.setLenient(false);
format.parse(dateStr);
} catch (ParseException | IllegalArgumentException e) {
return false; //date string does not match 'yyyy-MM-dd' format
}
return true; //date string is of valid format 'yyyy-MM-dd'
}
return false; //empty or null date string
} else {
int errCode = 2114;
String msg = "Expected input to be chararray, but got " + DataType.findTypeName(date) ;
throw new ExecException(msg, errCode, PigException.BUG);
}
} catch(ExecException ee) {
throw ee;
}
}
}
if (input == null || input.size() == 0) return false;
to return false for null or empty tuples. Within the try-block, I am getting the value of the column from the tupleObject date = input.get(0);
and check its type. Empty or non-existent values are considered as NULL in Pig, but if the value is literal NULL, then will it be of NULL type? It should have considered as chararray type but invalid format. – somnathchakrabarti