ERROR* Adding 2 documents java.io.IOException: Job failed! ( solr 3.4, nutch 1.4 bin on windows using Cygwin)

Question

$ ./nutch crawl urls -solr `http://localhost:8080/solr/` -depth 2 -topN 3
cygpath: can't convert empty path
crawl started in: crawl-20140115213017
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=`http://localhost:8080/solr/`
topN = 3
Injector: starting at 2014-01-15 21:30:17
Injector: crawlDb: crawl-20140115213017/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-01-15 21:30:21, elapsed: 00:00:03
Generator: starting at 2014-01-15 21:30:21
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20140115213017/segments/20140115213024
Generator: finished at 2014-01-15 21:30:26, elapsed: 00:00:04
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2014-01-15 21:30:26
Fetcher: segment: crawl-20140115213017/segments/20140115213024
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching `http://www.parkinson.org/`
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-01-15 21:30:32, elapsed: 00:00:06
ParseSegment: starting at 2014-01-15 21:30:32
ParseSegment: segment: crawl-20140115213017/segments/20140115213024
Parsing: `http://www.parkinson.org/`
ParseSegment: finished at 2014-01-15 21:30:34, elapsed: 00:00:01
CrawlDb update: starting at 2014-01-15 21:30:34
CrawlDb update: db: crawl-20140115213017/crawldb
CrawlDb update: segments: [crawl-20140115213017/segments/20140115213024]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-01-15 21:30:36, elapsed: 00:00:01
Generator: starting at 2014-01-15 21:30:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20140115213017/segments/20140115213038
Generator: finished at 2014-01-15 21:30:39, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2014-01-15 21:30:39
Fetcher: segment: crawl-20140115213017/segments/20140115213038
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching `http://forum.parkinson.org/`
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching `http://twitter.com/ParkinsonDotOrg`
fetching `http://www.youtube.com/user/NPFGuru`
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-01-15 21:30:44, elapsed: 00:00:04
ParseSegment: starting at 2014-01-15 21:30:44
ParseSegment: segment: crawl-20140115213017/segments/20140115213038
Parsing: `http://forum.parkinson.org/`
ParseSegment: finished at 2014-01-15 21:30:45, elapsed: 00:00:01
CrawlDb update: starting at 2014-01-15 21:30:45
CrawlDb update: db: crawl-20140115213017/crawldb
CrawlDb update: segments: [crawl-20140115213017/segments/20140115213038]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-01-15 21:30:46, elapsed: 00:00:01
LinkDb: starting at 2014-01-15 21:30:46
LinkDb: linkdb: crawl-20140115213017/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: tr`enter code here`ue
LinkDb: adding segment: file:/C:/cygwin/home/nutch/runtime/local/bin/crawl-20140115213017/segments/20140115213024
LinkDb: adding segment: file:/C:/cygwin/home/nutch/runtime/local/bin/crawl-20140115213017/segments/20140115213038
LinkDb: finished at 2014-01-15 21:30:47, elapsed: 00:00:01
SolrIndexer: starting at 2014-01-15 21:30:47
Adding 2 documents
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2014-01-15 21:30:52
SolrDeleteDuplicates: Solr url: `http://localhost:8080/solr/`
SolrDeleteDuplicates: finished at 2014-01-15 21:30:53, elapsed: 00:00:01
crawl finished: crawl-20140115213017

ERROR* Adding 2 documents java.io.IOException: Job failed! ( solr 3.4, nutch 1.4 bin on windows using Cygwin) I'm new to Apache...Need some help try to send crawled data to solr for searching but getting error "java.io.IOException: Job failed!"

I think it is related to the configuration of your solr. Look at your solr logs too (or post it here if there is errors in there). Also, check your nutch logs too (in the nutch/logs directory). — tahagh

Allan Macmillan Allan Macmillan · Accepted Answer · 2014-01-16T20:43:20

It sounds like the schema files for Solr and Nutch dont match up. Check out this post, I use Solr 4.3 but I dont feel it shouldnt be too different

http://amac4.blogspot.com/2013/07/configuring-nutch-to-crawl-urls.html

The log files have more detailed information about the problem, so you could post them here too.

ERROR* Adding 2 documents java.io.IOException: Job failed! ( solr 3.4, nutch 1.4 bin on windows using Cygwin)

2 Answers