Should I start Hadoop cluster before launching Pig in MapReduce mode?

Question

I'm trying to run an Hadoop single cluster node on my personal computer (Linux Mint 17, Linux kernel 3.13). I would like to run some Pig scripts for an online course I'm attending, but mainly because I'm not familiar with Hadoop itself and Pig (even though I'm writing Hive queries on a daily basis) I've got stuck.

I've installed both Hadoop 2.5.0 and Pig 0.13.0 following these two guides:

As far as I know Pig has two modes of execution: local mode and MapReduce mode.

Local Mode

Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem. To run in local mode, you pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt.

MapReduce Mode

In this mode, Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster can be pseudo- or fully distributed cluster.

The assignment on my course asks question as:

How many MapReduce jobs are generated by example.pig?
How many reduce tasks are within the first MapReduce job? How many reduce tasks are within later MapReduce jobs?
How long does each job take? How long does the entire script take?
What is the schema of the tuples after each of the following commands in example.pig?

Considering the kind of questions I assume I have to go for the MapReduce mode using the single Hadoop cluster I've just created.

On this guide I've read I have to run this command too, which I think in some way connect Pig to Hadoop:

$ export PIG_CLASSPATH=$HADOOP_HOME/conf/

Running a pig command in the terminal return me a lot of warning:

$ pig
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:21:18,409 [main] INFO  org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:21:18,409 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408458078408.log
2014-08-19 15:21:18,429 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:21:18,837 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:21:18,837 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:21:18,837 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:21:19,670 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

Now, my doubt is if I have to start the Hadoop cluster before launching Pig in MapReduce mode?

If I don't start the cluster and just run the script in MapReduce mode I got this error message:

$ pig -x mapreduce example.pig 
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:56:46,818 [main] INFO  org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:56:46,818 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log
2014-08-19 15:56:47,418 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:56:47,630 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:56:47,630 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:56:47,630 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:56:48,524 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Unable to check name hdfs://localhost:9000/user/gianluca
Failed to parse: Pig script failed to parse: 
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1712)
    at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1420)
    at org.apache.pig.PigServer.parseAndBuild(PigServer.java:364)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:389)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
    at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:170)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:232)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
    at org.apache.pig.Main.run(Main.java:608)
    at org.apache.pig.Main.main(Main.java:156)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: 
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
    at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:881)
    at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
    at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
    at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
    at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
    at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
    ... 16 more
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
    at org.apache.pig.parser.QueryParserUtils.getCurrentDir(QueryParserUtils.java:90)
    at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:873)
    ... 22 more
Caused by: java.net.ConnectException: Call From gianluca-Aspire-S3-391/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
    at org.apache.hadoop.ipc.Client.call(Client.java:1415)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:707)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1785)
    at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1068)
    at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1064)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1064)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
    at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:200)
    ... 26 more
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
    at org.apache.hadoop.ipc.Client.call(Client.java:1382)
    ... 44 more
2014-08-19 15:56:48,530 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
Details at logfile: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log

Erik Schmiegelow Erik Schmiegelow · Accepted Answer · 2014-08-21T21:18:02

your problem is due to the fact that your pig script tries to connect to the cluster and that failes:

Call From gianluca-Aspire-S3-391/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused

that's in the log files.

Now the cause of that problem is that your cluster runs on YARN, while your pig Script expects the (older) MR1 JobTracker, running on port 9000.

You've got two options:

a) setup a MR1 JobTracker. See http://itellity.wordpress.com/2014/08/20/installing-hadoop-chd4-mr1-on-mac-os-x/ for example (works also on ubuntu) b) configure Pig to use YARN, for that run

export HADOOPDIR=/yourHADOOPsite/conf

that should allow pig to read your hadoop config and "discover" the right configuration

Should I start Hadoop cluster before launching Pig in MapReduce mode?

1 Answers