0
votes

I want to get the last element of a line using pig script. I cant use $ as the index of last element is not fixed. I tried using Regular Expression but it is not working. I tried using $-1 to get it but it didn't work. I am posting only a sample as my actual file contains more of PID's.

Sample:

MSH|�~\&|LAB|LAB|HEATH|HEA-HEAL|20247||OU�R01|M1738000000001|P|2.3|||ER|ER|
PID|1|YXQ120185751001|YXQ120185751001||ELJKDP@#PDUB||19790615|F||| H LGGH VW��ZHVW FKHVWHU�SD�19380|||||||4002C340778A|000009561|ELJKDP@#PDUB19790615F

i want ot get the last value of PID i;e ELJKDP@#PDUB19790615F for that i have tried below code's but it is not working.

Code 1:

STOCK_A = LOAD '/user/rt/PARSED' USING PigStorage('|'); 
data = FILTER STOCK_A BY ($0 matches '.*PID.*'); 
MSH_DATA = FOREACH data GENERATE $2 AS id, $5 AS ame , $7 AS dob, $8 AS gender, $-1 AS rk;

Code 2:

STOCK_A = LOAD '/user/rt/PARSED' USING PigStorage('|'); 
data = FILTER STOCK_A BY ($0 matches '.*PID.*'); 
MSH_DATA = FOREACH data GENERATE $2 AS id, $5 AS ame , $7 AS dob, $8 AS gender, REGEX_EXTRACT(data,'\\s*(\\w+)$',1) AS rk;

Error for Code 2:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: Invalid scalar projection: data : A column needs to be projected from a relation for it to be used as a scalar

Please help

1
Your best bet is to write a UDF and pass the entire record and look for the index of the last delimiter | and get anything after that. - VK_217
@inquisitive_mind Can you please give an example it would be easy for me. - Ironman
If it's simply a regex you can do this: ^PID\|.*\|(.*) that would put in the capturing group anything after the last | for the row that starts with PID - sniperd
@sniperd i used REGEX_EXTRACT('^PID\|.*\|(.*)') but i am getting ERROR 1200: <line 6, column 108> Unexpected character '|' - Ironman
That's weird. I wonder if there needs to be some special escaping for pig? the \ should escape the | - sniperd

1 Answers

1
votes

This should work

REGEX_EXTRACT(data,'([^|]+$)',1) AS rk

[^|]+$ matches everything to the right of the last pipe character.

Output