I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem to crawl Urdu language content. For language detection, I have customized fetcher and find language at that point. If document does not have enough Urdu language (bytes) then I deliberately set its status to gone to stop growing this edge with null content. I have to find new Urdu domains also.
I am still facing a problem for the selection of urls for fetch. As time is passing, inlinks data in increased and it includes a lot of those URLs that are not in Urdu. Nutch is selecting (Generator) about 90% these urls that do not have Urdu content. Due to that reason my resources are wasted as very small number of new Urdu content in fetched.
How can I infom Nutch to prefer those domain documents that have possibility of Urdu content ? I think I have to customize ranking algorithm somehow. What are the possible ways to achieve my objective ?