Under the hood? Where does PIG save intermediate results/relations data.?

Question

LOAD function/command loads data from HDFS or Local FS. ex: -

gurnt >employees = LOAD 'hdfs://localhost:9090/pig_dir/data.txt' USING PigStorage(',')    as ( id:int, salary:int, ...etc)

Following which pig commands can be executed - like

grunt >wellpaid_employees = FILTER employees BY salary > '100000';

So I started thinking, where does pig store the "employees" data/relation. Which is used when further processing is needed i.e generating wellpaid_employees.
1) employees relation - If it just saves the employees in a temp directory (which is based on configuration) what is the benefit. It can any way read the data from HDFS every time. And the file can be large from 1GB to 1TB or even more. So I will assume that LOAD does not duplicate the data any where else. It works lazily. And it uses the orginal files in HDFS for running pig jobs (which are MR jobs behind the screen).
2) wellpaid_employees relation - when pig process employees relation to generate wellpaid_employees relation. Where does it save this result. Because, if I have to do further processing on "wellpaid_employees" like to get all well paid employees in a particular city - example

grunt >wellpaid_employees_in_newyork = FILTER wellpaid_employees BY city == 'NY';

In this case I see the benfit of PIG storing all the intermediate and end result/relatios some where. Is this how pig works.

So how (format etc) and where (physical location) does pig store the intermediate results/relations and how to configure these aspects ?

But if the intermediate result are also too big - say several GB, then how does the trade off work (between processing previous stages every time or storing the result). Can it be configured also.

Pig will keep the relation in memory until it runs out. Then it spills to disk. But it is not storing the relation, nothing is permanent. You can't access it in another Pig job, unless you actively store it. — Andrew

Ravinder Karra Ravinder Karra · Accepted Answer · 2016-11-16T03:47:24

All transformations are lazy, in that they do not compute their results right away. except in DUMP and STORE, and in all other commands it only evaluate syntax for errors.

And statments are in memory and execite in order once action (sotre/Dump) are used.

Under the hood? Where does PIG save intermediate results/relations data.?

2 Answers