1
votes

Solr provides some data type out of box in managed schema for different languages such as English, French, Japanese etc.

We are using common data type "text_general" for fields declaration and using stopwards.txt for stopword filtering.

    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="1"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

While sycing data to Solr core we are importing different languages text in the fields such as french, english, german etch.

My query is shall we use all different language stopwords into same "stopwards.txt" file or how solr use different language stopwords?

1
You'll want to define fields specific to each language with relevant settings - you probably don't want the same synonyms applied to each language either; the same is the case with stop words. You probably also want language specific stemming in some cases. Define fieldname_en, fieldname_jp etc.MatsLindh
In a standard Solr installation there are language specific fields already defined (e.g text_en and text_cjk) and each of them uses different analyzers, stop words, and synonyms, you can see this via curl http://your-solr/solr/your-core/schema/fieldtypes/text_cjk and curl http://your-solr/solr/your-core/schema/fieldtypes/text_enHector Correa

1 Answers

0
votes

Do not remove stop words. Stop word removal is a disk space saving hack left over from 32-bit machines in the 1970s.

I've never removed stop words and I started working in search 25 years ago at Infoseek (which did not remove stop words).

Removing them from the index makes some queries impossible, like "vitamin a". When I was building search at Netflix, I accidentally left the stop word removal configured and found a whole set of movie titles that were 100% stop words. That list is in this blog post.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

The "idf" score in a tf.idf system like Solr does the same job as stop words, but better. It gives common words a lower score based on the statistics of this particular collection.

Do not remove stop words.