Solr index empty after nutch solrindex command

Question

I'm using Nutch and Solr to index a file share.

I first issue: bin/nutch crawl urls

Which gives me:

solrUrl is not set, indexing will be skipped...
crawl started in: crawl-20110804191414
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2011-08-04 19:14:14
Injector: crawlDb: crawl-20110804191414/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-08-04 19:14:16, elapsed: 00:00:02
Generator: starting at 2011-08-04 19:14:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20110804191414/segments/20110804191418
Generator: finished at 2011-08-04 19:14:20, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2011-08-04 19:14:20
Fetcher: segment: crawl-20110804191414/segments/20110804191418
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
fetching file:///mnt/public/Personal/Reminder Building Security.htm
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-08-04 19:14:22, elapsed: 00:00:02
ParseSegment: starting at 2011-08-04 19:14:22
ParseSegment: segment: crawl-20110804191414/segments/20110804191418
ParseSegment: finished at 2011-08-04 19:14:23, elapsed: 00:00:01
CrawlDb update: starting at 2011-08-04 19:14:23
CrawlDb update: db: crawl-20110804191414/crawldb
CrawlDb update: segments: [crawl-20110804191414/segments/20110804191418]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-08-04 19:14:24, elapsed: 00:00:01
Generator: starting at 2011-08-04 19:14:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2011-08-04 19:14:25
LinkDb: linkdb: crawl-20110804191414/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110804191414/segments/20110804191418
LinkDb: finished at 2011-08-04 19:14:26, elapsed: 00:00:01
crawl finished: crawl-20110804191414

Then I: bin/nutch solrindex http://localhost:8983/solr/ crawl-20110804191414/crawldb crawl-20110804191414/linkdb crawl-20110804191414/segments/*

Which gives me:

SolrIndexer: starting at 2011-08-04 19:17:07
SolrIndexer: finished at 2011-08-04 19:17:08, elapsed: 00:00:01

When I do a : query on solr I get:

<response>
     <lst name="responseHeader">
          <int name="status">0</int>
          <int name="QTime">2</int>
          <lst name="params">
               <str name="indent">on</str>
               <str name="start">0</str>
               <str name="q">*:*</str>
               <str name="version">2.2</str>
               <str name="rows">10</str>
          </lst>
     </lst>
     <result name="response" numFound="0" start="0"/>
</response>

:(

Note that this worked fine when I tried to use protocol-http to crawl a website but does not work when I use protocol-file to crawl a file system.

---EDIT--- After trying this again today I noticed that files with spaces in the names were causing a 404 error. That's a lot of files on the share I'm indexing. However, the thumbs.db files were making it in ok. This tells me that the problem is not what I thought it was.

I also did a segment dump and found that pdf text content is being indexed, which is GREAT since that's what I need this for. I can't figure out why solr isn't being updated with all of the data. — Seth Griffin
I also tried indexing a single pdf file renamed to just one word. The segment data is there, the text gets parsed out, but no search results show up in solr after doing bin/nutch solrindex... — Seth Griffin
Still haven't been able to fix this problem. I've opened an issue with Apache regarding this problem. It seems to have at least one developer assigned: issues.apache.org/jira/browse/NUTCH-1076 — Seth Griffin

Martin Dorey Martin Dorey · Accepted Answer · 2013-01-29T07:01:16

I've spent much of today retracing your steps. I eventually resorted to printf debugging in /opt/nutch/src/java/org/apache/nutch/indexer/IndexerMapReduce.java, which showed me that each URL I was trying to index was appearing twice, once starting with file:///var/www/Engineering/, as I'd originally specified, and once starting with file:/u/u60/Engineering/. On this system, /var/www/Engineering is a symlink to /u/u60/Engineering. Further, the /var/www/Engineering URLs were rejected because the parseText field wasn't supplied and the /u/u60/Engineering URLs were rejected because the fetchDatum field wasn't supplied. Specifying the original URLs in the /u/u60/Engineering form solved my problem. Hope that helps the next sap in this situation.

Solr index empty after nutch solrindex command

2 Answers