I am pretty new to pig and have a question with log parsing. I currently parse out important tags in my url string via regex_extract, but am thinking I should transform the whole string to a map. I am working on a sample set of data using 0.10, but am starting to get really lost. In reality, my url string has tags repeated. So my map should actually be a map with bags as the values. Then i could just write any subsequent job using flatten..
here is my test data. the last entry shows my problem with repeated tags.
`pig -x local`
grunt> cat test.log
test1 user=3553&friend=2042&system=262
test2 user=12523&friend=26546&browser=firfox
test2 user=205&friend=3525&friend=353
I am using a tokenize to generate an inner bag.
grunt> A = load 'test.log' as (f:chararray, url:chararray);
grunt> B = foreach A generate f, TOKENIZE(url,'&') as attr;
grunt> describe B;
B: {f: chararray,attr: {tuple_of_tokens: (token: chararray)}}
grunt> dump B;
(test1,{(user=3553),(friend=2042),(system=262)})
(test2,{(user=12523),(friend=26546),(browser=firfox)})
(test2,{(user=205),(friend=3525),(friend=353)})
Using nested foreach on these relations, but i think they have some limitations I am not aware of..
grunt> C = foreach B {
>> D = foreach attr generate STRSPLIT($0,'=');
>> generate f, D as taglist;
>> }
grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})
grunt> G = foreach C {
>> H = foreach taglist generate TOMAP($0.$0, $0.$1) as tagmap;
>> generate f, H as alltags;
>> }
grunt> describe G;
G: {f: chararray,alltags: {tuple_of_tokens: (tagmap: map[])}}
grunt> dump G;
(test1,{([user#3553]),([friend#2042]),([system#262])})
(test2,{([user#12523]),([friend#26546]),([browser#firfox])})
(test2,{([user#205]),([friend#3525]),([friend#353])})
grunt> MAPTEST = foreach G generate f, flatten(alltags.tagmap);
grunt> describe MAPTEST;
MAPTEST: {f: chararray,null::tagmap: map[]}
grunt> res = foreach MAPTEST generate $1#'user';
grunt> dump res;
(3553)
()
()
(12523)
()
()
(205)
()
()
grunt> res = foreach MAPTEST generate $1#'friend';
grunt> dump res;
()
(2042)
()
()
(26546)
()
()
(3525)
(353)
So that's not terrible. I think its close, but not perfect. My bigger concern is that I need to group the tags as the last line has 2 tags for "friend", at least before I add it to the map.
grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})
I try the nested foreach with a group but thats causing an error.
grunt> G = foreach C {
>> H = foreach taglist generate *;
>> I = group H by $1;
>> generate I;
>> }
2013-01-18 14:56:31,434 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 34, column 10> Syntax error, unexpected symbol at or near 'H'
Anyone have any ideas how to get closer to generating this URL string into a map of bags? Figured there'd be a pig macro or something, since this seems like a common use case. Any ideas are very much appreciated.