0
votes

I'm new in using Nutch and I want to crawl the whole seeds list that I have in entry.

First : I used the script : bin/crawl -i -D elastic.server.url=http://localhost:9200/index_name/ urls ksu_Crawldb/ 30

with : 2 CPU and 7.5 GB of memory

But after 2 days it just fetch 63500 document, and the CPU was only taken by 50% and not on the full time.

enter image description here

I want to know, how to fetch the max of documents in short time.

Second : what is the difference between topN, depth and rounds ?

Thanks for any help.

1

1 Answers

1
votes

I recently published some benchmarks on Nutch with an explanation of why the resources are not used at the maximum continuously. Basically, Apache Nutch is based on Hadoop and as such is batch-driven: the different operations are carried out in succession. See also this Q&A.

There are various ways in which the performance can be tuned but the key element is simply the diversity of hosts you are fetching from and the politeness settings.

Second : what is the difference between topN, depth and rounds ?

topN is the number of URLs to select for fetching based on their score depth is the number of outlinks from the seeds to get to a particular URL rounds is the number of iterations of fetching/parse/update

depth and round are often the same but not necessarily