We have thousands of solr indexes/collections that share pages being crawled by nutch.
Currently these pages are being crawled multiple times, once for each solr index that contains them.
It is possible to crawl these sites once, and share the crawl data between indexes?
Maybe by checking existing crawldbs if a site has been crawled and get the data from there for parsing and indexing.
Or crawl all sites in one go, and then selectively submit crawl data to each index. (eg: one site per segment, but not sure how to identify which segment belongs to what site due to segment names are numeric)
Any ideas or help appreciated :)