Apache Pig Input Path error using Cloudera quick-start vm and pig shell

Question

I tried to run the the following pig commands for a yelp assignment:

-- *******   PIG LATIN SCRIPT for Yelp Assignmet ******************

-- 0. get function defined for CSV loader

register /usr/lib/pig/piggybank.jar;

define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();



-- The data-fu jar file has a CSVLoader with more options, like reading multiline records,

--  but for this assignment we don't need it, so the next line is commented out

-- register /home/cloudera/incubator-datafu/datafu-pig/build/libs/datafu-pig-incubating-1.3.0-SNAPSHOT.jar;

-- 1 load data

Y      = LOAD '/usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv' USING CSVLoader() AS(business_id:chararray,cool,date,funny,id,stars:int,text:chararray,type,useful:int,user_id,name,full_address,latitude,longitude,neighborhoods,open,review_count,state);

Y_good = FILTER Y BY (useful is not null and stars is not null);

--2 Find max useful

Y_all = GROUP Y_good ALL;

Umax  = FOREACH Y_all GENERATE MAX(Y_good.useful);

DUMP Umax

Unfortunately, I get the following Error:

Failed!

Failed Jobs: JobId Alias Feature Message Outputs job_1455222366557_0010 Umax,Y,Y_all,Y_good GROUP_BY,COMBINER Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://quickstart.cloudera:8020/usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303) at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128) at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:191) at java.lang.Thread.run(Thread.java:745) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:270) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) ... 18 more hdfs://quickstart.cloudera:8020/tmp/temp864992621/tmp897146964,

Input(s): Failed to read data from "/usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv"

Output(s): Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp864992621/tmp897146964"

Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0

Job DAG: job_1455222366557_0010

2016-02-15 06:22:16,643 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2016-02-15 06:22:16,686 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Umax Details at logfile: /home/cloudera/pig_1455546020789.log

I have checked the the path to the file here (see image below):

It seems it resembles the same path seen in the error:

hdfs://quickstart.cloudera:8020/usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv

So, I do not know how else to resolve it! Could there be something else I am not seeing? Any help will be appreciated. Thanks in Advance.

sr7 sr7 · Accepted Answer · 2016-02-15T15:07:02

You need to upload your csv into hdfs (using hadoop dfs -put ) and give the path in load command (load '{hdfs path of csv file}' )

Apache Pig Input Path error using Cloudera quick-start vm and pig shell

2 Answers