FOREACH multiple data in Pig Latin

Question

Can I do something like this in Pig Latin?

data1 = LOAD 'hadoop/text1.txt' AS (line:chararray);
data2 = LOAD 'hadoop/text2.txt' AS (line:chararray);

mixed = FOREACH data1, data2 GENERATE data1:line, data2:line;

could you give a short example of what you actually want to achieve? — Frederic
Basically, I have my data split in two files. Both have same number of records and on same lines. But I want to combine the output to be able to process it entirely. — divinedragon
is there a common key for the lines in both files? In that case you could do a JOIN — Frederic
Is it too onerous to go through your files and add line numbers? This would serve as the key for your JOIN. In general, this would be a good idea -- what happens if one line gets accidentally deleted from one of the files? Now you have no way to match them up again. It would be good practice, if your data is in separate files, to ensure that those files have keys that can be matched up. — reo katoa

delmet delmet · Accepted Answer · 2012-11-15T18:55:27

In general, it wouldn't make sense to do what you are asking, as the data will be loaded by multiple mappers, perhaps one line at a time. There is no guarantee that the corresponding lines will be seen by the same mapper, and no guarantee that the mappers know what line of what block they are reading. As WinnieNicklaus mentioned, the best thing to do is to label the lines and do a join.

FOREACH multiple data in Pig Latin

1 Answers