2
votes

I am using Solr 4.6.0 and I am trying to get the most frequent terms grouped by year. Since it is possible that my stopwords can change often, I do not apply the stopwords at indexing time. Instead, all dynamic word-lists like stopwords, protwords and synonyms are used at query time. But although the stopword-list includes terms like "of" and "the", they are still displayed in the result-list (see Results).

Question: How can I get facetted and stopword-filtered results, if I use the StopFilterFactory only at query time?

Additional information

If I use the StopFilterFactory at indexing time, everything is as expected. The terms like "of" and "the" are filtered out, when I run my query.

I also have tested the functionality of the fieldtype text_en with the Solr admin analysis tool and the results are as expected - "of" and "the" are filtered out. That means that somehow the SearchHandler does not call the right analyzer?

Query

http://ip:port/solr/collection1/select?q=*:*&rows=0&facet=true&facet.pivot=year,text

Results

[..]
<lst name="facet_pivot">
  <arr name="year,text">
    <lst>
      <str name="field">year</str>
      <int name="value">2009</int>
      <int name="count">139</int>
      <arr name="pivot">
        <lst>
          <str name="field">text</str>
          <str name="value">of</str>
          <int name="count">135</int>
        </lst>
        <lst>
          <str name="field">text</str>
          <str name="value">the</str>
          <int name="count">135</int>
        </lst>
        <lst>
          <str name="field">text</str>
          <str name="value">and</str>
          <int name="count">123</int>
[..]

Schema.xml

<field name="year" type="int" indexed="true" stored="true" />
    <field name="text" type="text_en" indexed="true" stored="true" multiValued="true" />
    [..]
    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
            <filter class="solr.PorterStemFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
            <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
            <filter class="solr.PorterStemFilterFactory"/>
          </analyzer>
        </fieldType>
4
Could you expound a little on why your stop words change frequently? I'm wondering if a different approach is needed here. - Mark Leighton Fisher

4 Answers

1
votes

Please see the thread - does solr support query time only stopwords? from Solr Mailing List.

This sounds very similar to your requirements and their workaround was to enable the stopFilterFactory at index time, however without a stopwords file specified to get it working as expected.

1
votes

Is it not because of your query?

http://ip:port/solr/collection1/select?q=*:*&rows=0&facet=true&facet.pivot=year,text

From what I can see, you're searching for everything, so it means it will return the stopwords also. I mean, if the query is getting passed to the analyzer, the filter class of the analyzer only see

*:* 

as the query, so I don't think it will remove anything from the query string that way.

If you really want to search for everything, but without any stopwords, you can try to either search with the negative query. Of course, if you use this, you will need to have a different configuration which doesn't filter any stopwords for the query, then you can put the stopwords manually as negative query to filter them out. So you're basically searching for anything, but leaving out the result which contains the negative query.

But one easy way (and better way according to my opinion) to get what you want is actually to use the copy field in the field configuration. But this will increase your index size. So what we do here with our solr is, aside from the normal field, we have other language fields like text_en, text_de, text_es etc. And we have a language detector which can detect the language, copy the field to the appropriate language, and run the correct stopwords filter.

You can also do this if you want, in your schema.xml, just create a new field, text_en_filtered, and copy the text from text_en there, and filter the stopwords there. Then you can just search in that field which doesn't have any stopwords anymore.

<field name="text_en_filtered" type="text_en_filtered" indexed="true" stored="false" multiValued="false"/>
<copyField source="text" dest="text_en_filtered"/>
<fieldType name="text_en_filtered" class="solr.TextField" positionIncrementGap="100">
    ... // Analyzer with stopwords filtering here..
</fieldType>
0
votes

Sorry, your question is not clear. So i am guessing and attempting to answer what might be your question. Here is how stopwords processed. If you have <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> during index time, Solr will not index stop words and you will not see these words in your resulting facets. Also, you need to use this during query time for proper matches.

If you have <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> during query time you are only removing the stopwords from your query phrase before Solr executes your query.

Update Your incorrect understanding of analysis chain seem to be cause of your confusion. Your q paramter is ":" , So if you have StopFilterFactory during query time, as mentioned above you will filter stopwords words from ":" not from query results. You will endup with still stopdwords in your results since you are facetting on text. You need to understand that query time analysis is on QUERY not on results.Your "text" still got stopwords which will show up in results. It is better and easy to remove results that you do not want on client side in this case.

0
votes

I am afraid that you will have to reindex, unless you can dig into the faceting code and filter them out before aggregation process. You can speed up the process by reducing the document set to reindex only to the documents that contains the new stop word/s in case.