3
votes

I have a Nutch index crawled from a specific domain and I am using the solrindex command to push the crawled data to my Solr index. The problem is that it seems that only some of the crawled URLs are actually being indexed in Solr. I had the Nutch crawl output to a text file so I can see the URLs that it crawled, but when I search for some of the crawled URLs in Solr I get no results.

Command I am using to do the Nutch crawl: bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000

This command is completing successfully and the output displays URLs that I cannot find in the resulting Solr index.

Command I am using to push the crawled data to Solr: bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

The output for this command says it is also completing successfully, so it does not seem to be an issue with the process terminating prematurely (which is what I initially thought it might be).

One final thing that I am finding strange is that the entire Nutch & Solr config is identical to a setup I used previously on a different server and I had no problems that time. It is literally the same config files copied onto this new server.

TL;DR: I have a set of URLs successfully crawled in Nutch, but when I run the solrindex command only some of them are pushed to Solr. Please help.

UPDATE: I've re-run all these commands and the output still insists it's all working fine. I've looked into any blockers for indexing that I can think of, but still no luck. The URLs being passed to Solr are all active and publicly accessible, so that's not an issue. I'm really banging my head against a wall here so would love some help.

1
Try using the step-by-step commands. Maybe the output will enlighten you :) Here is an example: wiki.apache.org/nutch/IntranetRecrawlmana
Did you ever resolve this? My issue is similar, every step before the solrindex seems to work fine, but I have no data in Solr. I imagine it's one of the xml files excluding data somehow.Carlton
Nope - I never did get it worked out. The project eventually fell away and I've since left the company that it was a part of - now I don't even work with Solr or Nutch anymore.Hugh Lashbrooke
Can you answer following question stackoverflow.com/questions/27597504/…Hafiz Muhammad Shafiq

1 Answers

1
votes

I can only guess what happend from my experiences:

There is a component called url-normalizer (with its configuration url-normalizer.xml) which is truncating some urls (removing URL parameters, SessionIds, ...)

Additionally, Nutch uses a unique constraint, by default each url is only saved once.

So, if the normalizer truncates 2 or more URLs ('foo.jsp?param=value', 'foo.jsp?param=value2', 'foo.jsp?param=value3', ...) to the exactly same one ('foo.jsp'), they get only saved once. So Solr will only see a subset of all your crawled URLs.

cheers