Restrict Nutch to Seed path and its following webpages only

0

votes

I have setup Nutch 2.x to crawl few domains that are multilingual. I can restrict Nutch to inlinks only but not to subfolders. For example, for following seed,

https://www.bbc.com/urdu

I just want to crawl URLs in /urdu as this website contains webpage in other languages also. Now, how I can configure or customize Nutch to handle these cases ?

web-crawlernutchnutch2

2

votes

You can edit the conf/regex-urlfilter.txt file. There is a comment at the bottom of the file which says accept anything else. If you change where it says +. to the regex that fits the urls you want everything else should be dropped. eg you may want: +.*\/urdu\/.*

2

votes

Nutch Does not have any default configuration to achieve your task.

There are many flows which you can tune like changing plugins code which does the parsing of HTML and extracting links(like parse-html,parse-tika.. etc) (OR) changing in the Parse phase Mapper code.

(OR)

you can add the following regex in regex-urlfilter.txt (please note to disable Urlfilter in the injection phase because the input seed might not have language information in URL path).

-(?i).*?//.*?[/?].*?(?<=[/])(urdu)([/?.]|$).*

But I would prefer the following way.

In Nutch 1.16 .. you can customize the code of ParseOutputFormat which is used in ParseSegment Parse Reducer Phase as a RecordWriter.

What happens in ParseOutputFormat?

If you check inside getRecordWriter method in RecordWriter Impl, it basically get all the Outlinks from a given page and picks only db.max.outlinks.per.page a number of URLs per page and score them using OPIC scoring filter and create CrawlDatum with the necessary status and save them Nutch DB. (note: it also applies a lot of filters on the extracted page and normalize them based on you nutch-site conf default values)

If you check this particular line of code inside getRecordWriter

Outlink[] links = parseData.getOutlinks(); // this returns the number of outlinks
replace the above code with some thing like this
Outlink[] links = filter(parseData.getOutlinks(),langValue);

you can write a custom filter method. and return all those pages which do not have the corresponding langValue in its path.

langValue --> you can directly hard code this value (OR) you can have a property (like allowed.lang.per.page) in nutch-site.xml and read it in the getConf method and use it inside the filter method.

if you want to have multiple langValues to allow. Then pass , separated values, and while reading them split it and customize your filter method accordingly...

Restrict Nutch to Seed path and its following webpages only

2 Answers