I would use Apache Nutch into my Java application to crawl web pages from one ore more websites. Basically, I need to call a method of my Java application for each web page found by the web-crawler, in order to process page content (text, etc.). How to achieve this?
1 Answers
Well, your question appears to be an "XY Problem", Nutch could be used as a library in your custom Java application, the bin/nutch
and bin/crawl
scripts basically just executes several Java classes with the right parameters, so in your application you could call the right classes with the right parameters, taking a look at the bin/crawl
script will provide you with the right sequence of steps (and classes) to call for a full cycle crawl. This should only be used for small crawls.
Now, going back to the XY problem, if all you need is to extract custom text/metadata from the webpages you could just extend Nutch itself without the need to write your custom application. From what you described looks like you are after a custom parser/indexing plugin. If this is the case I recommend taking a look at the headings plugin (https://github.com/apache/nutch/tree/master/src/plugin/headings) which is a very good starting point to write your own HtmlParseFilter
plugin. You'll still need to write custom code but it will be contained in a Nutch plugin.
Also you could check out https://issues.apache.org/jira/browse/NUTCH-1870, this plugin allows to extract custom portions of the HTML using XPath expressions.