1
votes

We have thousands of solr indexes/collections that share pages being crawled by nutch.

Currently these pages are being crawled multiple times, once for each solr index that contains them.

It is possible to crawl these sites once, and share the crawl data between indexes?

Maybe by checking existing crawldbs if a site has been crawled and get the data from there for parsing and indexing.

Or crawl all sites in one go, and then selectively submit crawl data to each index. (eg: one site per segment, but not sure how to identify which segment belongs to what site due to segment names are numeric)

Any ideas or help appreciated :)

1

1 Answers

1
votes

You will need to write a new indexer plugin to do that; look at the SolrIndexer of Nutch to understand how to write a new indexer. In that indexer, you should do the following:

  1. Define three or four Solr server instances, one for each core.
  2. Inside the write method of the indexer, examine the type of the document and use the right Solr core to add the document. By right, you should have a field at Nutch that you can use to determine where to send the document.