Nutch Does not have any default configuration to achieve your task.
There are many flows which you can tune like changing plugins code which does the parsing of HTML and extracting links(like parse-html,parse-tika.. etc)
(OR) changing in the Parse phase Mapper code.
(OR)
you can add the following regex in regex-urlfilter.txt (please note to disable Urlfilter in the injection phase because the input seed might not have language information in URL path).
-(?i).*?//.*?[/?].*?(?<=[/])(urdu)([/?.]|$).*
But I would prefer the following way.
In Nutch 1.16 .. you can customize the code of ParseOutputFormat which is used in ParseSegment Parse Reducer Phase as a RecordWriter.
What happens in ParseOutputFormat?
If you check inside getRecordWriter method in RecordWriter Impl,
it basically get all the Outlinks from a given page and picks only
db.max.outlinks.per.page a number of URLs per page and score them using OPIC scoring filter and create CrawlDatum with the
necessary status and save them Nutch DB. (note: it also applies a lot
of filters on the extracted page and normalize them based on you
nutch-site conf default values)
If you check this particular line of code inside getRecordWriter
Outlink[] links = parseData.getOutlinks(); // this returns the number of outlinks
replace the above code with some thing like this
Outlink[] links = filter(parseData.getOutlinks(),langValue);
you can write a custom filter method. and return all those pages which do not have the corresponding langValue in its path.
langValue --> you can directly hard code this value (OR)
you can have a property (like allowed.lang.per.page) in nutch-site.xml and read it in the getConf method and use it inside the filter method.
if you want to have multiple langValues to allow.
Then pass , separated values, and while reading them split it and customize your filter method accordingly...