0
votes

Need help with PIG

A = load 'input.txt' as (line:chararray);
B = foreach A generate FLATTEN(TOBAG(*));
C = FOREACH B GENERATE REPLACE(($0, '\\s+', ' ')  

Need help on the last line to Replace multiple space with single space, Remove " (Quotes) and leading 00 using APACHE PIG

Note:- Approach should NOT be field specific as there are more than 70 fields, Basically, expecting help with REPLACE or STRSTRING OR REGEX function which can perfomr mentioned operations on a line.

Input.txt

00595, ab 000cdef      california "state,   00USA
00733, 0ds ds "ARIZONA 00state, USA

Expected Output

595, ab cdef califormia state, USA
733, ds ds ARIZONA state, USA
2

2 Answers

0
votes

You can use REPLACE function in Pig to do the cleaning and loading as INT will remove the leading zeros from the number.

A = LOAD '/usr/pigfiles/pigo.txt' using PigStorage(',') as (value: INT, state: chararray, country: chararray);  
B = FOREACH A GENERATE value,REPLACE(REPLACE(state,'  ', ' ' ),'\\"',''),  country; 
DUMP B;

Output

0
votes

You cannot use nested REPLACE function within same loop. You have to do a series of operation on your data to get your desired result.

Try below code on your data. it is working well on sample you provided.

*a = LOAD 'ip.txt' USINGTextLoader();*

*b = FOREACH a GENERATE REPLACE($0,'\\s+',' ');*

*c = FOREACH b GENERATE REPLACE($0,'"','');*

*d = FOREACH c GENERATE REPLACE($0,'\\s+0+',' ');*