Lucene Multilingual text field

Question

I have looked at this question - Indexing multilingual words in lucene and it confirmed some of my suspicions.

I have an entity with a number of fields I wish to index. One of these fields can be one of several languages, and I need to use different analyzers for each language.

Am I best to implement this as different fields in the same index or as different indexes for each language?

I am guessing that the trade off is between the overhead of running multiple indexes and the suckiness of cluttering up a single index.

Any advice appreciated.

Will you ever need to search multiple languages at the same time? If so, you can't use multiple indexes. — Xodarap

Xodarap Xodarap · Accepted Answer · 2011-03-21T15:06:00

One additional idea that you didn't mention: you can make each language a non-stored, non-indexed field. Then you can copy all the (analyzed) data to a single stored+indexed field, and it will behave as though you're searching a single field. (This is analogous to Solr's "Copy fields" - I'm not sure how hard it would be to do in hibernate.)

If you keep them in separate indexes, you should note that you won't be able to search across languages easily (or, arguably, at all). So if you want to allow queries like "english:foo dutch:foo", you'll need them in the same index.

From a performance standpoint, it would depend on how much data is shared. If the documents are disjoint (i.e. no document has two languages in it) then there probably won't be that much of a difference between having it in one index vs. two. The more data they share, the more memory Lucene will duplicate, so it will become better to have one index. My guess is that this is only an issue if you have a lot of stored data, but YMMV.

Lucene Multilingual text field

1 Answers