1
votes

I am having the problem in implementing the field object ( schema) after flatting the string in pig. I have the following code:

Data = load 'data/*.txt' using PigStorage( ) AS(...., date:chararray, .....);

B = foreach Data FLATTEN(REGEX_EXTRACT_ALL(date, '"(.)/(.)/(.*)

(.):(.):(.*)"')) AS (month:int, day:int, year:int, hour:int, min:int, second:int);

--B = filter B by year==2015;

--B = filter B by month ==1 OR month ==2;

C = foreach B generate speed, month, day, year, hour, min;

store C into 'data/out_files' using PigStorage(',');

Where date is in the form ( '2/23/2015 23:56:49')

This works perfectly fine. But when I use filter in B ( year ==2015 or month ==1 OR month ==2), this code does not work. Do you have a good idea how to use any field after flattening String?. Thank you for your help.

1

1 Answers

1
votes

Can you try this?

input:

2/23/2015 23:56:49
1/23/2014 23:56:49
9/23/2014 23:56:49
8/23/2014 23:56:49

PigScript:

A = LOAD 'input' AS (date:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(date, '([0-9]+)/([0-9]+)/([0-9]+)\\s+([0-9]+):([0-9]+):([0-9]+)')) AS (month,day,year,hour,min,second);
C = FILTER B BY (month==1) OR (month==2) OR (year==2015);
D = FOREACH C GENERATE month,day,year,hour,min,second;
DUMP D;

Output:

(2,23,2015,23,56,49)
(1,23,2014,23,56,49)

The below Regex also works.

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(date, '(\\d+)/(\\d+)/(\\d+)\\s+(\\d+):(\\d+):(\\d+)')) AS (month,day,year,hour,min,second);