I'm trying to run an Hadoop single cluster node on my personal computer (Linux Mint 17, Linux kernel 3.13). I would like to run some Pig scripts for an online course I'm attending, but mainly because I'm not familiar with Hadoop itself and Pig (even though I'm writing Hive queries on a daily basis) I've got stuck.
I've installed both Hadoop 2.5.0 and Pig 0.13.0 following these two guides:
- Install Hadoop 2.2.0 on Ubuntu Linux 13.04 (Single-Node Cluster)
- How to Install Pig & Hive on Linux Mint VM
As far as I know Pig has two modes of execution: local mode and MapReduce mode.
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem. To run in local mode, you pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt.
MapReduce Mode
In this mode, Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster can be pseudo- or fully distributed cluster.
The assignment on my course asks question as:
- How many MapReduce jobs are generated by example.pig?
- How many reduce tasks are within the first MapReduce job? How many reduce tasks are within later MapReduce jobs?
- How long does each job take? How long does the entire script take?
- What is the schema of the tuples after each of the following commands in example.pig?
Considering the kind of questions I assume I have to go for the MapReduce mode using the single Hadoop cluster I've just created.
On this guide I've read I have to run this command too, which I think in some way connect Pig to Hadoop:
$ export PIG_CLASSPATH=$HADOOP_HOME/conf/
Running a pig
command in the terminal return me a lot of warning:
$ pig
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:21:18 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:21:18,409 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:21:18,409 [main] INFO org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408458078408.log
2014-08-19 15:21:18,429 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:21:18,837 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:21:18,837 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:21:18,837 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:21:19,670 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
Now, my doubt is if I have to start the Hadoop cluster before launching Pig in MapReduce mode?
If I don't start the cluster and just run the script in MapReduce mode I got this error message:
$ pig -x mapreduce example.pig
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/08/19 15:56:46 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-08-19 15:56:46,818 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-08-19 15:56:46,818 [main] INFO org.apache.pig.Main - Logging error messages to: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log
2014-08-19 15:56:47,418 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/gianluca/.pigbootup not found
2014-08-19 15:56:47,630 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-08-19 15:56:47,630 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-19 15:56:47,630 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2014-08-19 15:56:48,524 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Unable to check name hdfs://localhost:9000/user/gianluca
Failed to parse: Pig script failed to parse:
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1712)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1420)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:364)
at org.apache.pig.PigServer.executeBatch(PigServer.java:389)
at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:170)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:232)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:608)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by:
<file example.pig, line 7, column 6> pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:881)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 16 more
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
at org.apache.pig.parser.QueryParserUtils.getCurrentDir(QueryParserUtils.java:90)
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:873)
... 22 more
Caused by: java.net.ConnectException: Call From gianluca-Aspire-S3-391/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:707)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1785)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1068)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1064)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1064)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:200)
... 26 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 44 more
2014-08-19 15:56:48,530 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6007: Unable to check name hdfs://localhost:9000/user/gianluca
Details at logfile: /home/gianluca/Dropbox/Data Analysis/online courses/intro to data science (coursera)/datasci_course_materials/assignment4/pigtest/pig_1408460206817.log