Apache Nutch: crawl only new pages for semantics analysis

Question

I plan to tune up Nutch 2.2.X such way, that after initial crawling of the list of sites I launch the crawler daily and get HTML or plain text of new pages appeared on those sites this day only. Number of sites: hundreds.

Please be noted, that I'm not interested on updated, only new pages. Also I need new pages only starting from a date. Let's suppose it is the date of "Initial crawling".

Reading documentation and searching the Web iI got following questions can't find anywhere else:

What backend I should better use for Nutch for my task? I need page's text only once, then I never return to it. MySQL seems isn't an option as it is not supported by gora anymore. I tried use HBase, but seems I have to rollback to Nutch 2.1.x to get it working correctly. What are your ideas? How I may minimize disk space and other resources utilization?
May I perform my task not using indexing engine, like Solr? Not sure I need store large fulltext indexes. May Nutch >2.2 be launched without Solr and does it needs specific options for launching such way? Tutorials aren't clearly explain this question: everybody needs Solr, except me.
If I'd like to add a site to the crawling list, how I better perform it? Let's suppose I already crawling a list of sites and want to add a site to the list to monitor it from now. So I need to crawl the new site skipping pages content to add it to WebDB, and then run daily crawl as usual. For Nutch 1.x it may be possible to perform separate crawls and then merge them. How it may looks like for Nutch 2.x?
May this task be performed without custom plugins, and may it be performed with Nutch at all? Probably, I may write a custom plugin which detects somehow is the page already indexed, or it is new, and we need put the content to XML, or a database, etc. Should I write the plugin at all, or there is a way to solve the task with lesser blood? And how the plugin's algorithm may look like, if there is no way to live without it?

P.S. There is a lot of Nutch questions/answers/tutorials, and I honestly searched in the Web for 2 weeks, but haven't found answers to questions above.

i am trying to solve something similar .Please update if you have come across any solution — bhalkian

NH Narumi NH Narumi · Accepted Answer · 2015-04-02T07:16:46

I'm not using solr too. I just checked this documentation: https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

It seems like there are command prompts that can show the data fetched using WebDB. I'm new to Nutch but I just follow this documentation. Check it out.

Apache Nutch: crawl only new pages for semantics analysis

1 Answers