I'm using Pig to read a huge CSV file (+29000 lines) that looks like this
What I'm interested in is begin and end, which are dates
I'm trying to find items that were active in 1930. So first I loaded the file using this statement :
stations = LOAD '/mytp/isd-history.csv'
USING PigStorage(',')
AS
(
id:int,
wban:long,
name:chararray,
country:chararray,
state:chararray,
icao:chararray,
lat:double,
lon:double,
ele:double,
begin:chararray,
end:chararray
);
Then I used this query to FILTER by date
items_active_1930 = FILTER stations
BY ToDate(begin,'yyyy-MM-dd') >= ToDate('1930-01-01')
AND ToDate(end,'yyyy-MM-dd') <= ToDate('1930-12-31');
When I try to dump, the job fails with the following result :
Unable to open iterator for alias items_active_1930. Backend error : Exception while executing [POUserFunc (Name: POUserFunc(org.apache.pig.builtin.ToDate2ARGS)[datetime] - scope-172 Operator Key: scope-172) children: null at []]: java.lang.IllegalArgumentException: Invalid format: "begin"
I would like to know if it's possible in FILTER, to first check if both begin and date are valid dates that match the specified date format, so that no errors occur in ToDate()