0
votes

I am using nutch 1.9 using cygwin and solr 4.8.0. I can index the crawled data into solr using below code.

bin/crawl urls/ crawlresult/ http://localhost:8983/solr/ 1

But i want to add some additional fields while indexing such as indexed_by, crawled_by, crawl_name, etc.
I need help on this.

Thanks in Advance.

1

1 Answers

1
votes

If the value of the additional fields does not change, then you can use the Nutch's index-static plugin. It allows you to add a number of fields with their contents. You first need to enable it in nutch-site.xml. You then add the list of fields as shown below:

<property>
 <name>index.static</name>
 <value>indexed_by:solr,crawled_by:nutch-1.8,crawl_name:nutch</value>
 <description>
  Used by plugin index-static to adds fields with static data at indexing time. 
   You can specify a comma-separated list of fieldname:fieldcontent per Nutch job.
  Each fieldcontent can have multiple values separated by space, e.g.,
   field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
   It can be useful when collections can't be created by URL patterns, 
  like in subcollection, but on a job-basis.
  </description>
</property>

If the value of these fields is not static and independent of indexed documents, then you will need to write a IndexingFilter plugin to do that. Have a look at the index-static plugin to know how implement yours.