1
votes

We aim to leverage PIG for largescale log analysis of our server logs. I need to load a PIG map datatype from a file.

I tried running a sample PIG script with the following data.

A line in my CSV file, named 'test' (to be processed by PIG) looks like,

151364,[ref#R813,highway#secondary]

My PIG Script

a = LOAD 'test' using PigStorage(',') AS  (id:INT, m:MAP[]);
DUMP a;

The idea is to load an int and the second element as a hashmap. However, when I dump, the int field get parsed correctly(and gets printed in the dump) but the map field is not parsed resulting in a parsing error.

Can someone please explain if I am missing something?

2

2 Answers

1
votes

I think there is a delimiter related problem (such as field-delimiter is somehow effecting parsing of map field or it is confused with map-delimiter).

When this input data is used (notice I used semicolon as field-delimiter):

151364;[ref#R813,highway#secondary]

below is the output from my grunt shell:

grunt> a = LOAD '/tmp/temp2.txt' using PigStorage(';') AS (id:int, m:[]);
grunt> dump a;
...
(151364,[highway#secondary,ref#R813])

grunt> b = foreach a generate m#'ref'; 
grunt> dump b;
(R813)
1
votes

Atlast, I figured out the problem. Just change the de-limiter from ',' to another character ,say a pipe. The field delimiter was being confused with the delimiter ',' used for the map :)

The string 151364,[ref#R813,highway#secondary] was getting parsed into,
field1: 151364  field2: [ref#R813  field3: highway#secondary]
Since '[ref#$813' is not a valid map field, there is a parse error.

I also looked into the source code of the PigStorage function and confirmed the parsing logic - Source code

@Override
public Tuple getNext() throws IOException {
        for (int i = 0; i < len; i++) {
            //skipping some stuff
            if (buf[i] == fieldDel) { // if we find delim
                readField(buf, start, i); //extract field from prev delim to current
                start = i + 1;
                fieldID++;
            }
        }
 }

Thus, since PIG splits fields by the delimiter, it causes the parsing of fields to be confused with the separator used for the map.