I have installed fully distributed Hadoop 1.2.1. I was trying to integrated nutch with steps below:
- Download apache-nutch-1.9-src.zip
- Add value http.agent.name into nutch-site.xml
- Copy
hadoop-env.sh
,core-site.xml
,hdfs-site.xml
,mapred-site.xml
,masters
,slaves
into $NUTCH_HOME/conf - compile using
ant runtime
- create
urls/seed.txt
and put on hadoop dfs - edit $NUTCH_HOME/conf/regex-urlfilter.txt
Test crawl using command:
bin/hadoop -jar nutch-1.9.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5
and get this error:
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
I tried extract nutch-1.9.job and I didn't find out class Crawl in org/apache/nutch/crawl.
Do I need to config something?