0
votes

I've set up Nutch and gave it a seedlist of URLs to crawl. I configured it such that it will not crawl anything outside of my seed list. The seed list contains ~1.5 million urls. I followed the guide and kicked off nutch like so:

bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb $s1 -addBinaryContent -base64

Aside: I really wish I knew how to crawl and index at the same time (e.g., crawl a page -> index it, crawl next page), because I currently have to wait for this entire crawl to finish before anything is indexed at all.

Anyway, right now, from checking the hadoop.log, I believe I've crawled about 40k links in 48 hours. However, I'd like to make sure that it's grabbing all the content correctly. I'd also like to see which links have been crawled, and which links are left. I've read all the documentation and I can't seem to figure out how to get the status of a Nutch crawl unless it was started as a job.

I'm running Nutch 1.10 with Solr 4.10.

1
> how to crawl and index at the same time Have a look at [github.com/DigitalPebble/storm-crawler]. Nutch is batch driven and does everything step by step. [digitalpebble.blogspot.co.uk/2015/09/… contains a comparison between Nutch and SC that you might find useful. +1 to what Sujen suggested about the nutch readdb command. You can specify a given URL to get its status but as he pointed out this will be updated at the end of a crawl iteration onlyJulien Nioche

1 Answers

4
votes

As of now, there is no way in which you could see the status of a crawl while it is being fetched apart from the log. You can query a crawldb only after it fetch-parse-updatedb jobs are over.

And I think you are missing the bin/nutch updatedb job before running bin/nutch solrindex.

As you have mentioned, it seems like you are not using the ./bin/crawl script but calling each job individually.

For crawls as large as yours, one way I could think of is by using the ./bin/crawl script which, by default, generates 50k urls for fetching per iteration. And after every iteration you could use the:

./bin/nutch readdb <crawl_db> -stats

command given at https://wiki.apache.org/nutch/CommandLineOptions to check the crawldb status.

If you want to check updates more frequently then change(lower) the '-topN' parameter(which is passed to the generate job) in the ./bin/crawl script. And now by varying the number of iterations you whould be able to crawl your entire seedlist.

Hope this helps :)