2
votes

I am trying to make multi-language stemming working with the Solr. I have setup language detection with LangDetectLanguageIdentifierUpdateProcessorFactory as per official Solr guides. The language is recognized and now I have a whole bunch of dynamic fields like:

  • description_en
  • description_de
  • description_fr
  • ...

which are properly stemmed.

The question now is how do I search across so many fields? Making a long query every time that will search across dozens possible language fields doesn't seem like a smart option. I have tried using copyField like:

<copyField source="description_*" dest="text"/>

but stemming is being lost in the text field when I do that.

The text field is defined as solr.TextField with solr.WhitespaceTokenizerFactory. Maybe I am not setting up the text field properly or how is this supposed to be done?

1
See wiki.apache.org/solr/SchemaXml#Copy_Fields where it says: The original text is sent from the "source" field to the "dest" field, before any configured analyzers for the originating or destination field are invoked. copyField will not take the tokens from description_* fields after all the analysis is done. It will take the inputs to description_* fields and apply the analysis defined for its own field type, which is just the TextField with white space tokenizer in your case. So copyField is not a solution for this.arun
This may help you: lucene.472066.n3.nabble.com/…arun
Thank you, Arun. I see now why copyField didn't work. The second link is also very helpful. So I see that at this time my only choice is to list all the possible description_[en|fr|de|...] as list of fields to search on in each query. This is still ok I guess, I was just thinking that there were some other ways to do that. Thank you again for your help, Arun!user2113581

1 Answers

0
votes

You have multiple options:

  1. search over all the fields you mentioned. There always will be some overhead: the more fields you use, the slower search will be (gradually)

  2. try to recognise query language and search over only necessary fields: for example recognised and some default one. Here you can find library for this

  3. develop custom solution with multiple languages in one field, which is possible and could work in production according to Trey Graigner

The question is a bit old, but maybe that answer will help other people.