Apache Nutch section pages handling trick

Question

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. The idea is to crawl and index story pages mostly. For that I have prepared a seed of some domains. Now I am facing some logical problem in Nutch that is it behaves similar to all level of a domain. Lets have an example. Suppose, After fetching the home page of few domains, there are some documents that are not actually story pages rather they are some sections e.g., in news websites there are different links of news categories. If a user click on a category e.g., nation, then the new page contains a lot of news of this category. Nutch crawl this page and some text of many pages are stored as its content. After time, these page will change ( for updated news ), if such pages are indexed then after search if user goes to this page then the text is changed. Here is just an example page.

How and where should I handle such cases ? I think it should be handled at some Nutch phase so that it should fetch such pages, pich its urls to move forward but do not index such pages. Is this option is available in Nutch and if not what are the possible ways ?

Yossi Yossi · Accepted Answer · 2018-08-05T14:28:25

You need to implement an IndexingFilter that will return null for pages you don't want to index.

In Nutch 1.14, you may be able to use JexlIndexingFilter with a simple JEXL expression on the URLs, but I don't think this has been ported to Nutch 2.x.

As long as you know the format of the URLs you want to filter out from indexing, writing such a filter should be easy.

Apache Nutch section pages handling trick

1 Answers