Iam able to set up the Apache Nutch and get the data indexed in Solr. While indexing I am trying to make sure only modified pages gets indexed. Below are the 2 questions we have regarding this.
Is it possible to tell Nutch to send ‘If-modified-since’ header while crawling the site and download the page only if it has changed since the last time it was crawled.
I could see that Nutch is forming the MD5 digest out of the retrieved page content, but even though digest hasn’t changed (compared to previous version), it is still the indexing the page in Solr. Is there any setting with in Nutch to make sure if the content hasn’t changed have it not index in Solr?