4
votes

Lately i have been trying to apply facet to a field with some values having multiple words(a phrase)? I have been suggested to use shingles but am not sure if that would work as expected as the required phrase should be taken from a given list.

For example: when i apply facet to a field, i get seperate facets for 'Information' and 'Technology' whereas i want it to be a single facet like 'Information Technology'.

How to facet a particular phrase in a particular field?

EDIT: The schema for the required field looks like this:

<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
        <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
             possible with WordDelimiterFilter in conjuncton with stemming. -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      </fieldType>

The shingles filter doesn't work, as it shows three facets for Information technology: information, technology and information technology.

1
Would you post the analyzer from your schema.xml that you are using on that facet field?cheffe
@cheffe, Do check out the analyzer bit of the field type I am using for the required field. I have added it as an EDIT in my questionabhilashLenka
why do you have outputUnigrams="true" if you don't want to use unigrams?soulcheck
@soulcheck: When i turn unigrams to false, it does not index entries which carry only one word like 'physics'abhilashLenka

1 Answers

4
votes

The problem seems to be that the facet field words are being split in the index, by the analyzers. If you want to facet on fields which has potentially multiple words then we should use the analyzers which does not split the words. It can be "copy field" in solr so that your indexing process doesn't really change. For example you could have something like below.

<field name="facet_text_en_nosplit" type="string" indexed="true" stored="false" multiValued="true"/>

Use the above field in your facet query.