9
votes

I am doing a multilingual search. And I will use lucene as the tool to do it.

I have the translated contents already, there will be 3 or 4 languages of each document.

For indexing and search, there could be the 4 strategies, For each document/contents:

  1. each language are indexed in different index/directory.
  2. each language are indexed in different document but in the same index.
  3. each language are indexed in different Field but in the same document.
  4. all the languages are indexed in the same Field in a document

But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?

Thanks!

2
btw, if my answer was helpful it would be nice if you accepted it or at least voted it upChristophK

2 Answers

5
votes

Although the question has been asked a couple of years ago, it's still a great question.

There are a couple of aspects to consider evaluating the different solution approaches:

  1. are language specific analyzers used at indexing time?
  2. is the query language always known (e.g. user selectable)?
  3. does the query language always match one of the "content" languages?
  4. should only content matching the query language be retuned?
  5. is relevancy important?

If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.


Single Field (Strategies 2 & 4)


+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)

Multiple Fields (Strategy 3)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query

Multiple Indices (Strategy 1)


+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index

Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.

Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.

3
votes

In short, it depends on your needs, but I would go with option 3 or 1.

1) would probably the best way, if there is no overlap / shared fields between the languages at all.

3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache

I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.

4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.