LOAD function/command loads data from HDFS or Local FS. ex: -
gurnt >employees = LOAD 'hdfs://localhost:9090/pig_dir/data.txt' USING PigStorage(',') as ( id:int, salary:int, ...etc)
Following which pig commands can be executed - like
grunt >wellpaid_employees = FILTER employees BY salary > '100000';
So I started thinking, where does pig store the "employees" data/relation. Which is used when further processing is needed i.e generating wellpaid_employees.
1) employees relation - If it just saves the employees in a temp directory (which is based on configuration) what is the benefit. It can any way read the data from HDFS every time. And the file can be large from 1GB to 1TB or even more. So I will assume that LOAD does not duplicate the data any where else. It works lazily. And it uses the orginal files in HDFS for running pig jobs (which are MR jobs behind the screen).
2) wellpaid_employees relation - when pig process employees relation to generate wellpaid_employees relation. Where does it save this result.
Because, if I have to do further processing on "wellpaid_employees" like to get all well paid employees in a particular city - example
grunt >wellpaid_employees_in_newyork = FILTER wellpaid_employees BY city == 'NY';
In this case I see the benfit of PIG storing all the intermediate and end result/relatios some where. Is this how pig works.
So how (format etc) and where (physical location) does pig store the intermediate results/relations and how to configure these aspects ?
But if the intermediate result are also too big - say several GB, then how does the trade off work (between processing previous stages every time or storing the result). Can it be configured also.