Nutch cannot find out link for this kind of website

Question

I am a beginner of web-crawling, I had tried crawler4j for static web.

And now, I would like to try crawling this website (https://weedmaps.com/brands) via Nutch+hbase+solr, but I can't even go further.

I had tried other website such as http://sports.sina.com.cn, I can actually index the information to solr.

I wanna know for https://weedmaps.com/brands, the source page doesn't have the explicit out links, how can I crawl it? Can any body suggest the tools or articles? or explain the reason why nutch doesn't work?

Thank you so much.

Jorge Luis Jorge Luis · Accepted Answer · 2018-02-13T08:51:13

The problem is that https://weedmaps.com/brands this page is built using AngularJS meaning that is basically rendered using Javascript and the HTML actually present is quite poor. If you try just using curl you can see the source code. By default, Nutch relies only on the HTML sent by the server and doesn't do any client-side processing (like interpreting js code).

You can take a look at https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium and configure that protocol. In this case, Nutch will pipe the HTML through Selenium (which is able of interpreting javascript) and then it will send the end HTML down the normal Nutch pipeline.

Nutch cannot find out link for this kind of website

1 Answers