Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file

Question

I'm running a Cluster of five Cubieboards, RaspberryPi-like ARM boards with (because of 32bit) Hadoop 1.2.1 installed on them. There is one Name Node and four Slave Nodes.

For my final paper I wanted to install Apache Nutch 1.9 and Solr for big data analysis. I did the setup explained like this: http://wiki.apache.org/nutch/NutchHadoopTutorial#Deploy_Nutch_to_Multiple_Machines

When starting the Jar Job-File for deploying Nutch over the whole cluster there is a Class not found exception, because there is no Crawl class anymore since nutch 1.7: http://wiki.apache.org/nutch/bin/nutch%20crawl even in the source file it is removed alredy.

The following error is shown then:

hadoop jar apache-nutch-1.9.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5 Warning: $HADOOP_HOME is deprecated.

Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:266)

Other classes I found in the package seem to work, there should be no problem with the environment setting.

Which alternatives do you have to perform a crawl over the whole cluster. Since Nutch version 2.0 there is a Crawler class. But not in 1.9 :(

Any help is very appreciated. Thank you.

aalbahem aalbahem · Accepted Answer · 2015-01-25T10:51:21

I believe you should use the bin/crawl script instead of submitting the nutch job your self to hadoop. To do that, you need to do the following:

Download Nutch 1.9 source code, lets say you extracted the source into nutch-1.9.
Navigate to ntuch-1.9 and run:
```
ant build
```

Once the built finished, run

cd runtime/deploy

hadoop fs -put yourseed yourseedlist

bin/crawl seed.txt crawl http://yoursolrip/solr/yoursolrcore

I hope that will help.

Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file

1 Answers