0
votes

I have written the below Pig UDF for testing if a chararray column is having valid 'yyyy-MM-dd' date format or not. But while testing using below script, I am getting the below error. Is there any problem with the data because I am handling null tuples to consider the NULL values in data as well as non-existent values. Also, should I remove the empty line in the data file?

Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2114: Expected input to be chararray, but got NULL
        at IsValidDateTime.exec(IsValidDateTime.java:41)
        at IsValidDateTime.exec(IsValidDateTime.java:18) 

dates.txt

2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
,2020-08-20
NULL,2017-09-28
2016-11-15,NULL
2018-04-17,Thu Aug-20 2020
2017-05-09,2020-08-20
Mon Jan-20 2020,2020-08-20
<empty line>
------------------------------------------

dates_valid (expected all valid 'yyyy-MM-dd' start_dt and end_dt)

2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
2017-05-09,2020-08-20

Pig script

REGISTER 'IsValidDateTime.jar'

DEFINE IsValidDateTime IsValidDateTime();

dates = LOAD 'dates.txt' USING PigStorage(',') AS (start_dt:chararray, end_dt:chararray);
DUMP dates;
dates_valid = FILTER dates BY (IsValidDateTime(start_dt) AND IsValidDateTime(end_dt));
DUMP dates_valid;

IsValidDateTime Filter UDF

import org.apache.pig.FilterFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
public class IsValidDateTime extends FilterFunc {
    private static String datePattern = "yyyy-MM-dd";
    public Boolean exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return false;
        try {
            Object date = input.get(0);
            if(DataType.findType(date) == DataType.CHARARRAY){
                String dateStr = String.valueOf(date);
                if(dateStr != null && dateStr.length() != 0) {
                    try {
                        SimpleDateFormat format = new SimpleDateFormat(datePattern);
                        format.setLenient(false);
                        format.parse(dateStr);
                    } catch (ParseException | IllegalArgumentException e) {
                        return false; //date string does not match 'yyyy-MM-dd' format
                    }
                    return true; //date string is of valid format 'yyyy-MM-dd'
                }
                return false; //empty or null date string
            } else {
                int errCode = 2114;
                String msg = "Expected input to be chararray, but got " +  DataType.findTypeName(date) ;
                throw new ExecException(msg, errCode, PigException.BUG);
            }
        } catch(ExecException ee) {
            throw ee;
        }
    }
}
1
isnt the udf working as expected? its throwing the error defined with error code 2114 as it received a NULL value instead of a chararray in the input. you can rewrite the udf to return false if it received a NULL type, or FILTER NULL values before applying the UDF.saph_top
@saph_top I have added if (input == null || input.size() == 0) return false; to return false for null or empty tuples. Within the try-block, I am getting the value of the column from the tuple Object date = input.get(0); and check its type. Empty or non-existent values are considered as NULL in Pig, but if the value is literal NULL, then will it be of NULL type? It should have considered as chararray type but invalid format.somnathchakrabarti

1 Answers

1
votes

Removed the outer if-condition checking the DataType.CHARARRAY, just get the value from the input Tuple into a String and check if null or empty. That is the only condition needed. Below is the final code.

import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;

public class IsValidDateTime extends FilterFunc {
    private static String datePattern = "yyyy-MM-dd";
    public Boolean exec(Tuple input) throws IOException {
        try {
            String date = (String)input.get(0);
            if(date != null && date.length() != 0) {
                try {
                    SimpleDateFormat format = new SimpleDateFormat(datePattern);
                    format.setLenient(false);
                    format.parse(date);
                } catch (ParseException | IllegalArgumentException e) {
                    return false; //date string does not match 'yyyy-MM-dd' format
                }
                return true; //date string is of valid format 'yyyy-MM-dd'
            } else {
                return false; //empty or null date string
            }

        } catch(ExecException ee) {
            throw ee;
        }
    }
}

Getting the expected output

(2019-12-27,2020-08-20)
(2017-05-09,2018-10-04)
(2016-09-25,2020-01-19)
(2017-05-09,2020-08-20)