0
votes

Pig fails mysteriously when I specify the root of a large directory tree as the LOAD input in my script. The backend error exception that it throws offers no insight into what's happening. The same script works perfectly when there are fewer files.

It's a very simple script as you can see below:

SET pig.noSplitCombination true;
raw_record = LOAD '/data/directory/tree/root' USING PigStorage(',');
filtered = FILTER raw_record by $1 == 251068;
filtered_data = FOREACH filtered GENERATE (chararray)$0, (chararray)$1, (chararray)$2;
STORE filtered_data INTO '/data/output/directory/' USING PigStorage();

Here's the error message I see :

   ERROR 2244: Job scope-594 failed, hadoop does not return any error message
org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job scope-594 failed, hadoop does not return any error message
    at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:178)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:232)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
    at org.apache.pig.Main.run(Main.java:608)
    at org.apache.pig.Main.main(Main.java:156)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

How many files can PIG process at once?

1
Seems like it's already failing in the frontend. Can you access the job setup in the job tracker? Why are you setting pig.noSplitCombination? - LiMuBei

1 Answers

0
votes

Pig can process any number of files, there is no limitation for pig in terms of processing. In your case try to give data types of each field while loading and try with quotes in FILTER statement.

raw_record = LOAD '/data/directory/tree/root' USING PigStorage(',') as (col1: chararray, col2:chararray;

filtered = FILTER raw_record by $1 == '251068';

If you are still getting errors try to provide sample data