I have just started using Nutch 1.9 and Solr 4.10
After browsing through certain pages I see that syntax for running this version has been changed, and I have to update certain xml's for configuring Nutch and Solr
This version of package doesnt require Tomcat for running. I started Solr:
java -jar start.jar
and checked localhost:8983/solr/admin, its working.
I planted a seed in bin/url/seed.txt and seed is "simpleweb.org"
Ran Command in Nutch: ./crawl urls -dir crawl -depth 3 -topN 5
I got few IO exceptions in the middle and so to avoid the IO exception I downloaded patch-hadoop_7682-1.0.x-win.jar and made an entry in nutch-site.xml and placed the jar file in lib of Nutch.
After running Nutch, Following folders were created:
apache-nutch-1.9\bin\-dir\crawldb\current\part-00000
I can see following files in that path:
data<br>
index<br>
.data.crc<br>
.index.crc<br>
I want to know what to do with these files, what are the next steps? Can we view these files? If yes, how?
I indexed the crawled data from Nutch into Solr:
for linking solr with nutch (command completed successfully) Command ./crawl urls solr http://localhost:8983/solr/ -depth 3 -topN 5
Why do we need to index the data crawled by Nutch into Solr?
After crawling using Nutch
command used for this: ./crawl urls -dir crawl -depth 3 -topN 5; can we view the crawled data, if yes, where?
OR only after indexing the data crawled by Nutch into Solr, can we view the crawled data entires?
How to view the crawled data in Solr web?
command used for this: ./crawl urls solr localhost:8983/solr/ -depth 3 -topN 5