Apache Pig needlessly re-running map reduce jobs

Question

A few times I've had pig workflows where I store multiple aliases. For example, I'll have something roughly like

A = LOAD 'data1' USING PigStorage()
B = LOAD 'data2' USING PigStorage()
C = ... # transformation of A
D = ... #transformation of B
E = JOIN C by fieldA, D by fieldB
# STORE E into 'foo'
F = ... # transformation of E
STORE F into 'bar'

I would think that if I un-commented storing E, it would add only one map-reduce job since the results of E should already be in a temporary hadoop output file. In practice, it always adds multiple jobs, as if pig is reloading A and B and recomputing E from scratch.

When does Pig need to do this, and how do you prevent it?

Using version 0.11.0.

reo katoa reo katoa · Accepted Answer · 2014-01-22T22:19:04

Make sure that you have Multi-Query Optimization enabled and that you are running this script like

bash> pig script.pig

rather than copying and pasting the code into the Grunt shell.

Apache Pig needlessly re-running map reduce jobs

1 Answers