0
votes

After a webpage has been crawled with Apache Nutch 2.2.1, contents of that page are pushed to Solr. Solr stores the contents of entire webpages in the "content" field, so data in that field is usually very sizable. So here's my concerns:

Should I index the "content" field in Solr? Indexing such a large field will increase index size. In Solr's schema.xml file I found the following recommendation:

NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content.

<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

However, if I left this field unindexed, would it increase search response time significantly?

I'd greatly appreciate any information that will help me to understand benefits of not indexing this large field or benefits of indexing it.

1

1 Answers

1
votes

If you're going to search against the field, it needs to be indexed. The example in the schema assumes that since you're going to search against text instead of content, there is no need to create the index twice. They do however want to keep a reference to the content by itself, so that it can be displayed in the application or used for highlighting (which require the whole field content to be available).

If you don't seen any situation where you'll need the field for querying, there is no need to create an index for the field.