I've set up Nutch and gave it a seedlist of URLs to crawl. I configured it such that it will not crawl anything outside of my seed list. The seed list contains ~1.5 million urls. I followed the guide and kicked off nutch like so:
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb $s1 -addBinaryContent -base64
Aside: I really wish I knew how to crawl and index at the same time (e.g., crawl a page -> index it, crawl next page), because I currently have to wait for this entire crawl to finish before anything is indexed at all.
Anyway, right now, from checking the hadoop.log, I believe I've crawled about 40k links in 48 hours. However, I'd like to make sure that it's grabbing all the content correctly. I'd also like to see which links have been crawled, and which links are left. I've read all the documentation and I can't seem to figure out how to get the status of a Nutch crawl unless it was started as a job.
I'm running Nutch 1.10 with Solr 4.10.