0
votes

I have a pig script I got from Hortonworks that works fine with pig-0.9.2.15 with Hadoop-1.0.3.16. But when I run it with pig-0.12.1(recompiled with -Dhadoopversion=23) or pig-0.13.0 on Hadoop-2.4.0, it won't work.

It seems the following line is where the problem is.

max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;

Here's the whole script.

batting = load 'pig_data/Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
STORE join_data INTO './join_data';

And here's the hadoop error info:

2014-07-29 18:03:02,957 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grp_data: Local Rearrange[tuple]{bytearray}(false) - scope-34 Operator Key: scope-34): org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error executing an algebraic function 2014-07-29 18:03:02,958 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!

How can I fix this if I still want to use "MAX" function? Thank you!

Here's the complete information:

14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 14/07/29 17:50:11 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 14/07/29 17:50:11 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2014-07-29 17:50:12,104 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58 2014-07-29 17:50:12,104 [main] INFO org.apache.pig.Main - Logging error messages to: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log 2014-07-29 17:50:13,050 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found 2014-07-29 17:50:13,415 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-07-29 17:50:13,415 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:13,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://namenode.cmda.hadoop.com:8020 2014-07-29 17:50:14,302 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: namenode.cmda.hadoop.com:8021 2014-07-29 17:50:14,990 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:15,570 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:15,665 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s). 2014-07-29 17:50:15,705 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator 2014-07-29 17:50:15,791 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY 2014-07-29 17:50:15,873 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]} 2014-07-29 17:50:16,319 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2014-07-29 17:50:16,377 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2014-07-29 17:50:16,410 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager) 2014-07-29 17:50:16,417 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators. 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2014-07-29 17:50:16,493 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:16,575 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:16,973 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2014-07-29 17:50:17,007 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2014-07-29 17:50:17,007 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2014-07-29 17:50:17,007 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2014-07-29 17:50:17,020 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers. 2014-07-29 17:50:17,020 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2014-07-29 17:50:17,064 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990 2014-07-29 17:50:17,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1 2014-07-29 17:50:17,067 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2014-07-29 17:50:17,068 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2014-07-29 17:50:17,068 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job2337803902169382273.jar 2014-07-29 17:50:20,957 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job2337803902169382273.jar created 2014-07-29 17:50:20,957 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.jar is deprecated. Instead, use mapreduce.job.jar 2014-07-29 17:50:21,001 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job 2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2014-07-29 17:50:21,046 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize [] 2014-07-29 17:50:21,310 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2014-07-29 17:50:21,311 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http.address 2014-07-29 17:50:21,332 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:21,366 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:22,606 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2014-07-29 17:50:22,606 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2014-07-29 17:50:22,629 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2014-07-29 17:50:22,729 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1 2014-07-29 17:50:22,745 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-07-29 17:50:23,026 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1406677482986_0003 2014-07-29 17:50:23,258 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1406677482986_0003 2014-07-29 17:50:23,340 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://namenode.cmda.hadoop.com:8088/proxy/application_1406677482986_0003/ 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1406677482986_0003 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases batting,grp_data,max_runs,runs 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: batting[3,10],runs[5,7],max_runs[7,11],grp_data[6,11] C: max_runs[7,11],grp_data[6,11] R: max_runs[7,11] 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://namenode.cmda.hadoop.com:50030/jobdetails.jsp?jobid=job_1406677482986_0003 2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1406677482986_0003] 2014-07-29 17:51:18,582 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2014-07-29 17:51:18,582 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1406677482986_0003 has failed! Stop running all dependent jobs 2014-07-29 17:51:18,582 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grp_data: Local Rearrange[tuple]{bytearray}(false) - scope-73 Operator Key: scope-73): org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error executing an algebraic function 2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed! 2014-07-29 17:51:18,826 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.4.0 0.13.0 root 2014-07-29 17:50:16 2014-07-29 17:51:18 HASH_JOIN,GROUP_BY

Failed!

Failed Jobs: JobId Alias Feature Message Outputs job_1406677482986_0003 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message: Job failed!

Input(s): Failed to read data from "hdfs://namenode.cmda.hadoop.com:8020/user/root/pig_data/Batting.csv"

Output(s):

Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0

Job DAG: job_1406677482986_0003 -> null, null

2014-07-29 17:51:18,826 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2014-07-29 17:51:18,827 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2106: Error executing an algebraic function Details at logfile: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log 2014-07-29 17:51:18,828 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job scope-58 failed, hadoop does not return any error message Details at logfile: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log

3
can you post the job logs or any more information from jt logs about this particular error? Did it launch a job at all? - hobgoblin
Certianly. Here's the one for pig-0.13.0. I can also post for pig-0.12.0. I have tested other pig scripts before running this one. Other scripts passed. - user2921752
Since your job job_1406677482986_0003 has failed, it must have useful information to debug the root cause. If you can check the logs of this job (log of failed map task), please paste the any stack trace in those logs. - hobgoblin
Thank you, I'll pull the log. I've also tried local mode, still, it works with another pig script but not this one - user2921752

3 Answers

1
votes

try by casting MAX function

max_runs = FOREACH grp_data GENERATE group as grp, (int)MAX(runs.runs) as max_runs;

hope it will work

1
votes

You should use data types in your load statement.

runs = FOREACH batting GENERATE $0 as playerID:chararray, $1 as year:int, $8 as runs:int;

If this doesn't help for some reason, try explicit casting.

max_runs = FOREACH grp_data GENERATE group as grp, MAX((int)runs.runs) as max_runs;
0
votes

Thank both @BigData and @Mikko Kupsu for the hint. The issue does indeed have something to do the datatype casting.

After specifying the data type of each column as follows everything runs great.

batting = 
    LOAD '/user/root/pig_data/Batting.csv' USING PigStorage(',')
    AS (playerID: CHARARRAY, yearID: INT, stint: INT, teamID: CHARARRAY, lgID: CHARARRAY,
    G: INT, G_batting: INT, AB: INT, R: INT, H: INT, two_B: INT, three_B: INT, HR: INT, RBI: INT, 
    SB: INT, CS: INT, BB:INT, SO: INT, IBB: INT, HBP: INT, SH: INT, SF: INT, GIDP: INT, G_old: INT);