1
votes

I am using the -tagsource option while loading the input data in order to identify the input source. It seems that, later while I project only selected fields from the input tuple, there are some assumptions and certain fields get projected all the time though I try to ignore them.

Take a look at my script.

rawdata = load 'data/201212*' using PigStorage(' ', '-tagsource') as (filename:chararray, ts: int, ip: chararray, domain: chararray, answer: chararray);

A = foreach rawdata generate ts, ip, domain, answer, CONCAT(CONCAT(filename, '_'), UPPER(SUBSTRING(domain, 0, 1))) as domain_index, filename as filename;

B = foreach A generate ip as ip, SUBSTRING(domain, 0, 1) as domain_first_char, filename;

dump A;

dump B;

ILLUSTRATE B;

While creating B, I am trying to include only selected fields from A. However, if I dump B, the 'ts' field (the first field in A) keeps appearing in B. But in ILLUSTRATE B, everything looks nice as expected.

Dump of A:

(100,123.98.11.123,google.com,{(google)},20121201_G,20121201)

(95,500.98.11.123,yahoo.com,{(yahoo)},20121201_Y,20121201)

(107,123.98.11.123,google.com,{(google)},20121201_G,20121201)

(156,123.98.11.123,cnn.com,{(cnn)},20121201_C,20121201)

Dump of B:

(100,g,20121201)

(95,y,20121201)

(107,g,20121201)

(156,c,20121201)

ILLUSTRATE B:

B | ip:chararray | domain_first_char:chararray | filename:chararray

| 123.98.11.123 | g | 20121202

As seen in Dump B, instead of printing the ip value as the first field (as in illustrate B), it prints the ts field.

1

1 Answers

0
votes

Searching around the internet, I found that this is a bug in PigStorage and found a workaround.

Starting pig with the flag -t ColumnMapKeyPrune helps fixing this issue i.e., start pig using the command pig -x local -t ColumnMapKeyPrune sample.pig.

Thanks to Reremiah Rounds of pig user group.