I am using the -tagsource option while loading the input data in order to identify the input source. It seems that, later while I project only selected fields from the input tuple, there are some assumptions and certain fields get projected all the time though I try to ignore them.
Take a look at my script.
rawdata = load 'data/201212*' using PigStorage(' ', '-tagsource') as (filename:chararray, ts: int, ip: chararray, domain: chararray, answer: chararray);
A = foreach rawdata generate ts, ip, domain, answer, CONCAT(CONCAT(filename, '_'), UPPER(SUBSTRING(domain, 0, 1))) as domain_index, filename as filename;
B = foreach A generate ip as ip, SUBSTRING(domain, 0, 1) as domain_first_char, filename;
dump A;
dump B;
ILLUSTRATE B;
While creating B, I am trying to include only selected fields from A. However, if I dump B, the 'ts' field (the first field in A) keeps appearing in B. But in ILLUSTRATE B, everything looks nice as expected.
Dump of A:
(100,123.98.11.123,google.com,{(google)},20121201_G,20121201)
(95,500.98.11.123,yahoo.com,{(yahoo)},20121201_Y,20121201)
(107,123.98.11.123,google.com,{(google)},20121201_G,20121201)
(156,123.98.11.123,cnn.com,{(cnn)},20121201_C,20121201)
Dump of B:
(100,g,20121201)
(95,y,20121201)
(107,g,20121201)
(156,c,20121201)
ILLUSTRATE B:
B | ip:chararray | domain_first_char:chararray | filename:chararray
| 123.98.11.123 | g | 20121202
As seen in Dump B, instead of printing the ip value as the first field (as in illustrate B), it prints the ts field.