2
votes

I'm using Nutch and Solr to index a file share.

I first issue: bin/nutch crawl urls

Which gives me:

solrUrl is not set, indexing will be skipped...
crawl started in: crawl-20110804191414
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2011-08-04 19:14:14
Injector: crawlDb: crawl-20110804191414/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-08-04 19:14:16, elapsed: 00:00:02
Generator: starting at 2011-08-04 19:14:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20110804191414/segments/20110804191418
Generator: finished at 2011-08-04 19:14:20, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2011-08-04 19:14:20
Fetcher: segment: crawl-20110804191414/segments/20110804191418
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
fetching file:///mnt/public/Personal/Reminder Building Security.htm
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-08-04 19:14:22, elapsed: 00:00:02
ParseSegment: starting at 2011-08-04 19:14:22
ParseSegment: segment: crawl-20110804191414/segments/20110804191418
ParseSegment: finished at 2011-08-04 19:14:23, elapsed: 00:00:01
CrawlDb update: starting at 2011-08-04 19:14:23
CrawlDb update: db: crawl-20110804191414/crawldb
CrawlDb update: segments: [crawl-20110804191414/segments/20110804191418]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-08-04 19:14:24, elapsed: 00:00:01
Generator: starting at 2011-08-04 19:14:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2011-08-04 19:14:25
LinkDb: linkdb: crawl-20110804191414/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110804191414/segments/20110804191418
LinkDb: finished at 2011-08-04 19:14:26, elapsed: 00:00:01
crawl finished: crawl-20110804191414

Then I: bin/nutch solrindex http://localhost:8983/solr/ crawl-20110804191414/crawldb crawl-20110804191414/linkdb crawl-20110804191414/segments/*

Which gives me:

SolrIndexer: starting at 2011-08-04 19:17:07
SolrIndexer: finished at 2011-08-04 19:17:08, elapsed: 00:00:01

When I do a : query on solr I get:

<response>
     <lst name="responseHeader">
          <int name="status">0</int>
          <int name="QTime">2</int>
          <lst name="params">
               <str name="indent">on</str>
               <str name="start">0</str>
               <str name="q">*:*</str>
               <str name="version">2.2</str>
               <str name="rows">10</str>
          </lst>
     </lst>
     <result name="response" numFound="0" start="0"/>
</response>

:(

Note that this worked fine when I tried to use protocol-http to crawl a website but does not work when I use protocol-file to crawl a file system.

---EDIT--- After trying this again today I noticed that files with spaces in the names were causing a 404 error. That's a lot of files on the share I'm indexing. However, the thumbs.db files were making it in ok. This tells me that the problem is not what I thought it was.

2
I also did a segment dump and found that pdf text content is being indexed, which is GREAT since that's what I need this for. I can't figure out why solr isn't being updated with all of the data.Seth Griffin
I also tried indexing a single pdf file renamed to just one word. The segment data is there, the text gets parsed out, but no search results show up in solr after doing bin/nutch solrindex...Seth Griffin
Still haven't been able to fix this problem. I've opened an issue with Apache regarding this problem. It seems to have at least one developer assigned: issues.apache.org/jira/browse/NUTCH-1076Seth Griffin

2 Answers

0
votes

I've spent much of today retracing your steps. I eventually resorted to printf debugging in /opt/nutch/src/java/org/apache/nutch/indexer/IndexerMapReduce.java, which showed me that each URL I was trying to index was appearing twice, once starting with file:///var/www/Engineering/, as I'd originally specified, and once starting with file:/u/u60/Engineering/. On this system, /var/www/Engineering is a symlink to /u/u60/Engineering. Further, the /var/www/Engineering URLs were rejected because the parseText field wasn't supplied and the /u/u60/Engineering URLs were rejected because the fetchDatum field wasn't supplied. Specifying the original URLs in the /u/u60/Engineering form solved my problem. Hope that helps the next sap in this situation.

0
votes

This is because solr didn't get the data to index. Seems like you have not properly executed the previous commands. Restart the whole process and then try the last command. copy the commands from here:https://wiki.apache.org/nutch/NutchTutorial or refer my video on youtube--https://www.youtube.com/watch?v=aEap3B3M-PU&t=449s