6
votes

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.

During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.

Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?

(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)

2
I stopped using Pig because of issues like this. It's also next to impossible to write a custom loader in 0.6.0 (they improved the loader API in 0.8.0). Consider using Hive.Spike Gronim

2 Answers

1
votes

(For posterity, a sub-par solution we've come up with:)

To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:

Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).

but at least Pig doesn't crash with an exception.

Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.

These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.

1
votes

The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.

The shell checks for the existence of the input file and assembles a final pig script from the fragments.

It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)