1
votes

I am new to Nutch and Solr. So, I apologize in advance if I am asking basic question.

Details of environment:

  1. Virtual Box with Guest OS: Ubuntu 12.04.4, Host OS: Windows 8
  2. Nutch Release: Apache nutch 1.7
  3. Solr Release: Apache Solr 3.6.2
  4. Referring to wiki.apache.org/nutch/NutchTutorial

I initiated crawling with command-

bin/nutch crawl urls -solr http://<code>mylocalhost<code>:8983/solr/ -depth 3 -topN 5

This command succeeded with no errors.

After that, I opened the solr admin page in browser and tried to search with a default query string: \*:*. However, this resulted in a page with the below content:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
        <lst name="params">
            <str name="start">0</str>
            <str name="q">*:*</str>
            <str name="rows">10</str>
            <str name="indent">on</str>
            <str name="version">2.2</str>
        </lst>
    </lst>
    <result name="response" numFound="0" start="0"/>
</response>

When I tried to search for 'nutch' in solr, it resulted in an error: "HTTP Error 400".

Could you please help me see data crawled by nutch so that I can validate it.

1

1 Answers

0
votes

The simplest way to validate your data sounds like what you are trying to do: query the data and make sure it returns the expected results. Some help there:

When you say you tried a basic query string, do you mean from the solr admin, or through the rest API? If you are using the solr admin, you don't need to escape that first *. So q would be : directly. In the Rest API, the * needs to be properly encoded. Something like this:

http://your_host_name:8888/solr/your_core_name/select?q=*%3A*&wt=json&indent=true

Another thing you can do is validate some of nutch's intermediary data is to dump the crawl or link dbs using the bin/nutch commands readdb, readlinkdb, mergedb.