Apache Nutch 2.3.1 checkingpointing not working

Question

I have configured apache Nutch 2.3.1 with single node cluster (Hadoop 2.7.x and hbase 1.2.6). I have to check its checkpointing feature. according to my information, resuming is available in Fetch and parse. I assume that at any stage during fetching (or parsing), my complete cluster goes down due to some problem e.g. power failure. I assume that when I restart cluster and crawler with -resume flag, it should start to fetch only those URLs that was not fetched.

But what I observed is that (with debug enabled) that it start to refetch all URLs (with same batchID) till end even with resume flag. Resume flag only works when a job (e.g. fetch) was completely finished. I have cross checked it from its logs with a message like "Skipping express.pk; already fetched".

Does my interprestation is no correct about resume option in Nutch?

or there is some problem in cluster/configuration ?

Jorge Luis Jorge Luis · Accepted Answer · 2018-03-05T11:06:11

Your interpretation is right. Also, in this case the output from Nutch (logs) is also correct.

If you check the code on https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/fetcher/FetcherJob.java#L119-L124, Nutch is only logging that is skipping that URL because it was already fetched. Since Nutch works in batches, in needs to check all the URLs on the same batchId, but if you specify the resume flag then (only on DEBUG) will log that it is skipping certain URLs. This is done mainly for troubleshooting if you have some issue.

This happens Nutch doesn't keep a record of the last processed URL, it needs to start at the beginning of the same batch and work its way from there. Even knowing the last URL its not enough because, you would also need the position of that URL in the batch.

Apache Nutch 2.3.1 checkingpointing not working

1 Answers