1
votes

I am pretty new to pig and have a question with log parsing. I currently parse out important tags in my url string via regex_extract, but am thinking I should transform the whole string to a map. I am working on a sample set of data using 0.10, but am starting to get really lost. In reality, my url string has tags repeated. So my map should actually be a map with bags as the values. Then i could just write any subsequent job using flatten..

here is my test data. the last entry shows my problem with repeated tags.

`pig -x local`
grunt> cat test.log
test1   user=3553&friend=2042&system=262
test2   user=12523&friend=26546&browser=firfox
test2   user=205&friend=3525&friend=353

I am using a tokenize to generate an inner bag.

grunt> A = load 'test.log' as (f:chararray, url:chararray);
grunt> B = foreach A generate f, TOKENIZE(url,'&') as attr;
grunt> describe B;
B: {f: chararray,attr: {tuple_of_tokens: (token: chararray)}}

grunt> dump B;
(test1,{(user=3553),(friend=2042),(system=262)})
(test2,{(user=12523),(friend=26546),(browser=firfox)})
(test2,{(user=205),(friend=3525),(friend=353)})

Using nested foreach on these relations, but i think they have some limitations I am not aware of..

grunt> C = foreach B {
>> D = foreach attr generate STRSPLIT($0,'=');
>> generate f, D as taglist;
>> }

grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})

grunt> G = foreach C {
>> H = foreach taglist generate TOMAP($0.$0, $0.$1) as tagmap;
>> generate f, H as alltags;
>> }

grunt> describe G;
G: {f: chararray,alltags: {tuple_of_tokens: (tagmap: map[])}}

grunt> dump G;
(test1,{([user#3553]),([friend#2042]),([system#262])})
(test2,{([user#12523]),([friend#26546]),([browser#firfox])})
(test2,{([user#205]),([friend#3525]),([friend#353])})

grunt> MAPTEST = foreach G generate f, flatten(alltags.tagmap);
grunt> describe MAPTEST;
MAPTEST: {f: chararray,null::tagmap: map[]}

grunt> res = foreach MAPTEST generate $1#'user';
grunt> dump res;
(3553)
()
()
(12523)
()
()
(205)
()
()

grunt> res = foreach MAPTEST generate $1#'friend';
grunt> dump res;
()
(2042)
()
()
(26546)
()
()
(3525)
(353)

So that's not terrible. I think its close, but not perfect. My bigger concern is that I need to group the tags as the last line has 2 tags for "friend", at least before I add it to the map.

grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})

I try the nested foreach with a group but thats causing an error.

grunt> G = foreach C {
>> H = foreach taglist generate *;
>> I = group H by $1;
>> generate I;
>> }
2013-01-18 14:56:31,434 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200:   <line 34, column 10>  Syntax error, unexpected symbol at or near 'H'

Anyone have any ideas how to get closer to generating this URL string into a map of bags? Figured there'd be a pig macro or something, since this seems like a common use case. Any ideas are very much appreciated.

2
+1: Excellent job of explaining the problem, showing what you've tried, and providing sample input that succinctly demonstrates the issue.reo katoa

2 Answers

0
votes

Good news and bad news. The good news is it is pretty simple to achieve this. The bad news is that you will not be able to achieve what I would presume is the ideal -- all of the tag/value pairs in a single map -- without resorting to a UDF.

First, a couple tips: FLATTEN the result of STRSPLIT so that you don't have a useless level of nesting in your tuples, and FLATTEN again inside the nested foreach so that you don't need to do it later. Also, STRSPLIT has an optional third argument to give the maximum number of output strings. Use that to guarantee a schema for its output. Here's a modified version of your script:

A = load 'test.log' as (f:chararray, url:chararray);
B = foreach A generate f, TOKENIZE(url,'&') as attr;
C = foreach B {
    D = foreach attr generate FLATTEN(STRSPLIT($0,'=',2)) AS (key:chararray, val:chararray);
    generate f, FLATTEN(D);
};
E = foreach (group C by (f, key)) generate group.f, TOMAP(group.key, C.val);
dump E;

Output:

(test1,[user#{(3553)}])
(test1,[friend#{(2042)}])
(test1,[system#{(262)}])
(test2,[user#{(12523),(205)}])
(test2,[friend#{(26546),(3525),(353)}])
(test2,[browser#{(firfox)}])

After you've finished splitting out the tags and values, group also by the tag to get your bag of values. Then put that into a map. Note that this assumes that if you have two lines with the same id (test2, here) you want to combine them. If this isn't the case, you'll need to construct a unique identifier for the line.

Unfortunately, there is apparently no way to combine maps without resorting to a UDF, but this should be just about the simplest of all possible UDFs. Something like (untested):

public class COMBINE_MAPS extends EvalFunc<Map> {
    public Map<String, DataBag> exec(Tuple input) throws IOException {
        if (input == null || input.size() != 1) { return null; }

        // Input tuple is a singleton containing the bag of maps
        DataBag b = (DataBag) input.get(0);

        // Create map that we will construct and return
        Map<String, Object> m = new HashMap<String, Object>();

        // Iterate through the bag, adding the elements from each map
        Iterator<Tuple> iter = b.iterator();
        while (iter.hasNext()) {
            Tuple t = iter.next();
            m.putAll((Map<String, Object>) t.get(0));
        }

        return m;
    }
}

With a UDF like that, you can do:

F = foreach (group E by f) generate COMBINE_MAPS(E.$1);

Note that in this UDF, if any of the input maps have overlap in their keys, one will overwrite the other and there is no way to tell ahead of time which will "win". If this could be a problem, you would need to add some sort of error-checking code to the UDF.

0
votes

I thought I would update this in case anyone tries to do this in the future. I never got the pig latin to work, but I went the full UDF route. Sadly, I am not really a programmer by trade, so the java examples had me lost for a while. But I managed to hack together a python UDF that has been working so far. Still need to go clean it up to handle errors and what not, but this is usable for now. I'm sure there is a better java way to do this as well.

#!/usr/bin/python
@outputSchema("tagmap:map[{(value:chararray)}]")

def inst_url_parse(url_query):
        query_vals = url_query.split("&")
        url_query_map = {}
        for each_val in query_vals:
                kv = each_val.split("=")
                if kv[0] in url_query_map:
                        url_query_map[kv[0]].append(kv[1])
                else:
                        url_query_map[kv[0]] = [kv[1]]

        return url_query_map

I really love that our URL query is stored this way since each key could have 0,1,N values. Downstream jobs just call flatten(tagmap#'key') in the eval and its pretty painless compared to what I was doing before. We can develop much faster using this. We also store the data in hcatalog as

querymap<string, array<string>> 

and it seems to work fine for hive queries/view using LATERAL VIEW too. Who knew?

Sorry if this is too opinionated for a Q and A site.