crawl websites out of java web application without using bin/nutch

Question

i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code:

  public void run() throws Exception {
      final String[] args = new String[] {
            String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_URLS),
            "-dir", String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_CRAWL),
            "-threads", this.preferences.get("threads"),
            "-depth", this.preferences.get("depth"),
            "-topN", this.preferences.get("topN"),
            "-solr", this.preferences.get("solr")
        };
      Crawl.main(args);
  }

and a part of the logging:

10/05/17 10:42:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/05/17 10:42:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1
10/05/17 10:42:54 INFO mapred.JobClient: Running job: job_local_0001
10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1
10/05/17 10:42:55 INFO mapred.MapTask: numReduceTasks: 1
10/05/17 10:42:55 INFO mapred.MapTask: io.sort.mb = 100
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:211)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
        at lan.localhost.process.NutchCrawling.run(NutchCrawling.java:108)
        at lan.localhost.main.Index.indexing(Index.java:71)
        at lan.localhost.bean.FeedingBean.actionStart(FeedingBean.java:25)
        ....

can someone help me or tell me how i can crawling from a java application? i have increased the Xms to 256m and Xmx to 768m, but nothing changed...

best regards marcel

Check this repo of mine: github.com/yegor256/nutch-in-java It does what you are trying to do and it works. You can use it as an example. — yegor256

Pascal Dimassimo Pascal Dimassimo · Accepted Answer · 2010-05-17T13:19:01

You probably have to add the nutch config files to your classpath. Normally, it is set via the NUTCH_CONF_DIR environment variable when calling the script bin/nutch.

There is also the -Dhadoop.log.dir that might need to be set.

Take the time to check the bin/nutch script to know more about those.

crawl websites out of java web application without using bin/nutch

3 Answers