A few times I've had pig workflows where I store multiple aliases. For example, I'll have something roughly like
A = LOAD 'data1' USING PigStorage()
B = LOAD 'data2' USING PigStorage()
C = ... # transformation of A
D = ... #transformation of B
E = JOIN C by fieldA, D by fieldB
# STORE E into 'foo'
F = ... # transformation of E
STORE F into 'bar'
I would think that if I un-commented storing E, it would add only one map-reduce job since the results of E should already be in a temporary hadoop output file. In practice, it always adds multiple jobs, as if pig is reloading A and B and recomputing E from scratch.
When does Pig need to do this, and how do you prevent it?
Using version 0.11.0.