1
votes

Solr and Nutch are already setup locally (on separate directories) and I wish to crawl a URL, index it, then integrate that index into Solr.

Running this crawl on terminal:

                $ bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Reports this error on the command line:

                Exception in thread "main" java.io.IOException: Job failed!
                        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
                        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
                        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
                        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
                        at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
                        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

That said, in my attempt to then integrate I run this command:

                $ bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

Which reports this error on the command line:

                2013-10-23 13:23:38.347 java[15444:1203] Unable to load realm info from SCDynamicStore
                Indexer: java.io.IOException: Job failed!
                        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
                        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
                        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
                        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)

My environment and app versions are as follows:

  • Nutch 1.7
  • Solr 4.5
  • MAC OSX (10.8.5)
  • java version "1.6.0_51"

Suggestions would be appreciated.

2
Can you put your logs/hadoop.log file in pastebin? - nimeshjm
1. For the crawl with Nutch - pastebin.com/zDhips3x 2. For the index to Solr - pastebin.com/mMNSWuwg - markreyes
Just noticed you are using "localhost:8983/solr" as the solr index url. Can you try again but using the url to your solr index name? e.g. "localhost:8983/solr/collection1" - nimeshjm
Crawl and index was successful. I can't express enough thank u's @nimeshjm - markreyes
No worries, happy to help :) - nimeshjm

2 Answers

0
votes

Mr Mrkreyes do you got an answer for your problem's nutch

0
votes

I had the same issue, I resolved the problem by including the core in the command

  1. Find your core name

    1a. go to http://localhost:8983/solr

    1b. on the left-hand navigation, there is a pull down menu titled "Core Selector", click on the menu and see a list of Solr core.

    1c. write down the core name. (ex: collection1)

  2. Put the core name in the command

    2a. $ bin/nutch solrindex http://localhost:8983/solr/collection1 crawl/crawldb -linkdb crawl/linkdb crawl/segments/*