error while running nutch on hadoop multi cluster environment

Question

I am running nutch on hadoop multi cluster environment.

Hadoop is throwing an error when nutch is being executed using the following command

$ bin/hadoop jar /home/nutch/nutch/runtime/deploy/nutch-1.5.1.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5

Error: Exception in thread "main" java.io.IOException: Not a file: hdfs://master:54310/user/nutch/urls/crawldb at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:170) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) at com.bdc.dod.dashboard.BDCQueryStatsViewer.run(BDCQueryStatsViewer.java:829) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.bdc.dod.dashboard.BDCQueryStatsViewer.main(BDCQueryStatsViewer.java:796) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

I tried with possible ways of solving this and fixed all the issues like setting http.agent.name in /local/conf path etc. And I installed earlier and it was smooth.

Can anybody suggest a solution?

By the way, I followed link for installing and running.

Swamy Swamy · Accepted Answer · 2012-11-23T01:40:49

I could solve this issue. when copying files from local file system to HDFS destination filesystem, it used to be like this: bin/hadoop dfs -put ~/nutch/urls urls.

However it should be "bin/hadoop dfs -put ~/nutch/urls/* urls", here urls/* will allow sub directories.

error while running nutch on hadoop multi cluster environment

1 Answers