Apache nutch and solr : queries

Question

I have just started using Nutch 1.9 and Solr 4.10

After browsing through certain pages I see that syntax for running this version has been changed, and I have to update certain xml's for configuring Nutch and Solr

This version of package doesnt require Tomcat for running. I started Solr:

java -jar start.jar

and checked localhost:8983/solr/admin, its working.

I planted a seed in bin/url/seed.txt and seed is "simpleweb.org"

Ran Command in Nutch: ./crawl urls -dir crawl -depth 3 -topN 5

I got few IO exceptions in the middle and so to avoid the IO exception I downloaded patch-hadoop_7682-1.0.x-win.jar and made an entry in nutch-site.xml and placed the jar file in lib of Nutch.

After running Nutch, Following folders were created:

apache-nutch-1.9\bin\-dir\crawldb\current\part-00000

I can see following files in that path:

data<br>
index<br>
.data.crc<br>
.index.crc<br>

I want to know what to do with these files, what are the next steps? Can we view these files? If yes, how?

I indexed the crawled data from Nutch into Solr:

for linking solr with nutch (command completed successfully) Command ./crawl urls solr http://localhost:8983/solr/ -depth 3 -topN 5

Why do we need to index the data crawled by Nutch into Solr?

After crawling using Nutch

command used for this: ./crawl urls -dir crawl -depth 3 -topN 5; can we view the crawled data, if yes, where?

OR only after indexing the data crawled by Nutch into Solr, can we view the crawled data entires?

How to view the crawled data in Solr web?

command used for this: ./crawl urls solr localhost:8983/solr/ -depth 3 -topN 5

aalbahem aalbahem · Accepted Answer · 2015-02-16T22:09:16

Although Nutch was built to be a web scale search engine, this is not the case any more. Currently, the main purpose of Nutch is to do a large scale crawling. What you do with that crawled data is then up to your requirements. By default, Nutch allows to send data into Solr. That is why you can run

crawl url crawl solraddress depth level

You can emit the solr url parameter also. In that case, nutch will not send the crawled data into Solr. Without sending the crawled data to solr, you will not be able to search data. Crawling data and searching data are two different things but very related.

Generally, you will find the crawled data in the crawl/segments not crawl/crawdb. The crawl db folder stores information about the crawled urls, their fetching status and next time for fetching plus some other useful information for crawling. Nutch stores actual crawled data in crawl/segments.

If you want to have an easy way to view crawled data, you might try nutch 2.x as it can store its crawled data into several back ends like MySQL, Hbase, Cassandra and etc through the Gora component.

To view data at solr, you simplly issue a query to Solr like this:

curl http://127.0.0.1:8983/solr/collection1/select/?q=*:*

Otherwise, you can always push your data into different stores via adding indexer plugins. Currently, Nutch supports sending data to Solr and Elasticsearch. These indexer plugins send structured data like title, text, metadata, author and other metadata.

The following summarizes what happens in Nutch:

seed list -> crawldb -> fetching raw data (download site contents) 
-> parsing the raw data -> structuring the parse data into fields (title, text, anchor text, metadata and so on)-> 
sending the structured data to storage for usage (like ElasticSearch and Solr).

Each of these stages is extendable and allows you to add your logic to suit your requirements.

I hope that clears your confusion.

Apache nutch and solr : queries

2 Answers