0
votes

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem to crawl Urdu language content. For language detection, I have customized fetcher and find language at that point. If document does not have enough Urdu language (bytes) then I deliberately set its status to gone to stop growing this edge with null content. I have to find new Urdu domains also.

I am still facing a problem for the selection of urls for fetch. As time is passing, inlinks data in increased and it includes a lot of those URLs that are not in Urdu. Nutch is selecting (Generator) about 90% these urls that do not have Urdu content. Due to that reason my resources are wasted as very small number of new Urdu content in fetched.

How can I infom Nutch to prefer those domain documents that have possibility of Urdu content ? I think I have to customize ranking algorithm somehow. What are the possible ways to achieve my objective ?

1

1 Answers

1
votes

I think that the easiest solution would be to assign a really low score to these not important URLs. And, perhaps set a minimum score threshold for the generator (https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L93).

Of course, all of this comes with certain concerns, it could be the case that at some point you ran out of URLs to fetch. Because, either the generator didn't found any suitable candidate (score threshold, or no more Urdu URLs to fetch), Or all the URLs (that you've discovered) have been fetched already.

Usually is a good idea to plan for these edge cases.