0
votes

LOAD function/command loads data from HDFS or Local FS. ex: -

gurnt >employees = LOAD 'hdfs://localhost:9090/pig_dir/data.txt' USING PigStorage(',')    as ( id:int, salary:int, ...etc)

Following which pig commands can be executed - like

grunt >wellpaid_employees = FILTER employees BY salary > '100000';

So I started thinking, where does pig store the "employees" data/relation. Which is used when further processing is needed i.e generating wellpaid_employees.
1) employees relation - If it just saves the employees in a temp directory (which is based on configuration) what is the benefit. It can any way read the data from HDFS every time. And the file can be large from 1GB to 1TB or even more. So I will assume that LOAD does not duplicate the data any where else. It works lazily. And it uses the orginal files in HDFS for running pig jobs (which are MR jobs behind the screen).
2) wellpaid_employees relation - when pig process employees relation to generate wellpaid_employees relation. Where does it save this result. Because, if I have to do further processing on "wellpaid_employees" like to get all well paid employees in a particular city - example

grunt >wellpaid_employees_in_newyork = FILTER wellpaid_employees BY city == 'NY';

In this case I see the benfit of PIG storing all the intermediate and end result/relatios some where. Is this how pig works.

So how (format etc) and where (physical location) does pig store the intermediate results/relations and how to configure these aspects ?

But if the intermediate result are also too big - say several GB, then how does the trade off work (between processing previous stages every time or storing the result). Can it be configured also.

2
Pig will keep the relation in memory until it runs out. Then it spills to disk. But it is not storing the relation, nothing is permanent. You can't access it in another Pig job, unless you actively store it. - Andrew

2 Answers

1
votes

All transformations are lazy, in that they do not compute their results right away. except in DUMP and STORE, and in all other commands it only evaluate syntax for errors.

And statments are in memory and execite in order once action (sotre/Dump) are used.

1
votes

Pig has a configurable property in its config files which lets you configure the "HDFS location" say /user/cloudera/loc1 (generally done by the Admin) of the results of the intermediate MR jobs. So when the MR compiler comes out with the chain of MR jobs e.g. MR1-->MR2-->MR3--MR4 where MR4 is the final step then the compiler tells the MRs : MR1, MR2 and MR3 to use this aforementioned configured location. MR4 of course uses the location that you specify with the STORE statement.

More here: https://pig.apache.org/docs/latest/start.html#data-store