I'm stuck on string parsing in Pig.
I have looked at the documentation around regex_extract and regex_extract_all and hoped to use one of those functions.
I have file '/logs/test.log':
cat '/logs/test.log'
user=242562&friend=6226&friend=93856&age=35&friend=35900
I want to extract the friend tags from the url, and in this case, I have 3 identical tags. regex_extract seems to only work for the first instance, which is what I expected, and for regex_extract_all, it seems like I have know the whole string pattern, which changes on each row of the source file.
It looked ok with regex_extract, but this option only gives me the first one.
[root@test]# pig -x local
A = LOAD './test.log';
B = FOREACH A GENERATE REGEX_EXTRACT($0, 'friend=([0-9]*)',1);
dump B;
(6226)
The examples I see for regex_extract_all show regex where you seek out all the tags:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL($0, 'user=([0-9]+?)&friend=([0-9]+?)&friend=([0-9]+?)&.+?'));
dump B;
(242562,6226,93856)
That seems to work, but I really just want to extract the friends - (6226,93856,35900). I also have cases where there might be more-than or less-than 3 friends per user.
Any ideas?
Also looking at using something like FLATTEN(TOKENIZE($0,'&')) and then somehow only filtering on the SUBSTRING($0,0,INDEXOF($0,'=')) == 'friend' or something like that, but wanted to see if anyone knew a good regex approach.